Hostname: page-component-89b8bd64d-n8gtw Total loading time: 0 Render date: 2026-05-13T05:54:41.241Z Has data issue: false hasContentIssue false

A novel approach for variable star classification based on imbalanced learning

Published online by Cambridge University Press:  09 August 2023

Jingyi Zhang
Affiliation:
Key Laboratory of Optical Astronomy, National Astronomical Observatories, Chinese Academy of Sciences, Beijing, China
Yanxia Zhang*
Affiliation:
Key Laboratory of Optical Astronomy, National Astronomical Observatories, Chinese Academy of Sciences, Beijing, China
Zihan Kang
Affiliation:
Key Laboratory of Optical Astronomy, National Astronomical Observatories, Chinese Academy of Sciences, Beijing, China
Changhua Li
Affiliation:
National Astronomical Observatories, Chinese Academy of Sciences, Beijing, China
Yihan Tao
Affiliation:
National Astronomical Observatories, Chinese Academy of Sciences, Beijing, China
Yongheng Zhao
Affiliation:
Key Laboratory of Optical Astronomy, National Astronomical Observatories, Chinese Academy of Sciences, Beijing, China
Xue-Bing Wu
Affiliation:
Kavli Institute for Astronomy and Astrophysics, Peking University, Beijing, China
*
Corresponding author: Yanxia Zhang; Email: zyx@bao.ac.cn
Rights & Permissions [Opens in a new window]

Abstract

The advent of time-domain sky surveys has generated a vast amount of light variation data, enabling astronomers to investigate variable stars with large-scale samples. However, this also poses new opportunities and challenges for the time-domain research. In this paper, we focus on the classification of variable stars from the Catalina Surveys Data Release 2 and propose an imbalanced learning classifier based on Self-paced Ensemble (SPE) method. Compared with the work of Hosenie et al. (2020), our approach significantly enhances the classification Recall of Blazhko RR Lyrae stars from 12% to 85%, mixed-mode RR Lyrae variables from 29% to 64%, detached binaries from 68% to 97%, and LPV from 87% to 99%. SPE demonstrates a rather good performance on most of the variable classes except RRab, RRc, and contact and semi-detached binary. Moreover, the results suggest that SPE tends to target the minority classes of objects, while Random Forest is more effective in finding the majority classes. To balance the overall classification accuracy, we construct a Voting Classifier that combines the strengths of SPE and Random Forest. The results show that the Voting Classifier can achieve a balanced performance across all classes with minimal loss of accuracy. In summary, the SPE algorithm and Voting Classifier are superior to traditional machine learning methods and can be well applied to classify the periodic variable stars. This paper contributes to the current research on imbalanced learning in astronomy and can also be extended to the time-domain data of other larger sky survey projects (LSST, etc.).

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of the Astronomical Society of Australia
Figure 0

Table 1. The number of different classes of variables in the CRTS dataset.

Figure 1

Table 2. Comparison of different imbalance methods.

Figure 2

Figure 1. Folded light curves of different kinds of variable stars reported in magnitudes as a function of phase. The data points are represented in light blue dots along with the error bars, and the fitted light curves are illustrated in purple lines.

Figure 3

Figure 2. The pipeline of Self-paced Ensemble from Liu et al. (2020). Instead of simply balancing the data or directly adjusting class weights, classification hardness is taken into account over the dataset, and the most informative majority data samples are iteratively selected according to the hardness distribution. The under-sampling strategy is controlled by a self-paced procedure, which enables SPE to gradually focus on the harder data samples but still retains the information of the majority sample to prevent over-fitting.

Figure 4

Table 3. Confusion matrix of binary classification.

Figure 5

Figure 3. The AUCROC of SPE for different classes.

Figure 6

Figure 4. Comparison of confusion matrix of SPE with the first-layer classifier of Ho20.

Figure 7

Figure 5. Confusion matrix of the second-layer and third-layer classifier in Ho20.

Figure 8

Table 4. Mean $Balanced\,Accuracy$ and $G\,Mean$ for SPE.

Figure 9

Table 5. Mean $Balanced\,Accuracy$ and $G\,Mean$ for the work of Ho20.

Figure 10

Figure 6. The flowchart of the Voting Classifier. In Soft voting, classifiers or base models are fed with training data to predict the class output of n possibilities. Each classifier independently assigns the occurrence probability of each class. In the end, the average of the possibilities of each class is calculated, and the final output is labelled as the class with maximum average.

Figure 11

Figure 7. Confusion matrix of SPE.

Figure 12

Figure 8. A Voting Classifier is built by combining Random Forest and SPE. As shown in Fig. 7, the left and right panels depict the confusion matrices of Random Forest and Voting Classifier, respectively.

Figure 13

Table 6. Mean $Balanced\,Accuracy$ and $G\,Mean$ for Voting Classifier.

Figure 14

Figure 9. The Period versus Smallkurtosis distribution of RRab, RRc, RRd, and Blazhko stars.