Hostname: page-component-6766d58669-kl59c Total loading time: 0 Render date: 2026-05-19T10:40:36.478Z Has data issue: false hasContentIssue false

Comparison of Three Statistical Classification Techniques for Maser Identification

Published online by Cambridge University Press:  14 April 2016

Ellen M. Manning
Affiliation:
School of Physical Sciences, University of Tasmania, Private Bag 37, Hobart, Tasmania 7001, Australia
Barbara R. Holland
Affiliation:
School of Physical Sciences, University of Tasmania, Private Bag 37, Hobart, Tasmania 7001, Australia
Simon P. Ellingsen*
Affiliation:
School of Physical Sciences, University of Tasmania, Private Bag 37, Hobart, Tasmania 7001, Australia
Shari L. Breen
Affiliation:
CSIRO Asronomy and Space Science, PO Box 76, Epping, NSW 1710, Australia
Xi Chen
Affiliation:
Key Laboratory for Research in Galaxies and Cosmology, Shanghai Astronomical Observatory, Chinese Academy of Sciences, Shanghai 200030, China Key Laboratory of Radio Astronomy, Chinese Academy of Sciences, Beijing 100012, China
Melissa Humphries
Affiliation:
School of Physical Sciences, University of Tasmania, Private Bag 37, Hobart, Tasmania 7001, Australia
Rights & Permissions [Opens in a new window]

Abstract

We applied three statistical classification techniques—linear discriminant analysis (LDA), logistic regression, and random forests—to three astronomical datasets associated with searches for interstellar masers. We compared the performance of these methods in identifying whether specific mid-infrared or millimetre continuum sources are likely to have associated interstellar masers. We also discuss the interpretability of the results of each classification technique. Non-parametric methods have the potential to make accurate predictions when there are complex relationships between critical parameters. We found that for the small datasets the parametric methods logistic regression and LDA performed best, for the largest dataset the non-parametric method of random forests performed with comparable accuracy to parametric techniques, rather than any significant improvement. This suggests that at least for the specific examples investigated here accuracy of the predictions obtained is not being limited by the use of parametric models. We also found that for LDA, transformation of the data to match a normal distribution led to a significant improvement in accuracy. The different classification techniques had significant overlap in their predictions; further astronomical observations will enable the accuracy of these predictions to be tested.

Information

Type
Research Article
Copyright
Copyright © Astronomical Society of Australia 2016 
Figure 0

Table 1. The relationship between the four possible classification results and the calculated values of the sensitivity and the specificity.

Figure 1

Figure 1. Boxplots comparing the variables from Data Set 1 between the sources with an associated water maser and those without. The outline in each of the boxplots represents the range between the first and third quartiles, with the median being the solid line horizontally through the box. The vertical lines outside the box extend to the minimum and maximum values, with any outliers (values separated from the quartiles by more than one and a half times the interquartile range) shown separately as dots. In this case, due to the very small number of samples being associated with a maser in this data set, the individual sample points are also plotted.

Figure 2

Table 2. The predictor variables that increased the classification accuracy of the various methods for Data Set 1. Random forests provides an internal calculation of the mean decrease in accuracy (the higher the value, the more important the variable), logistic regression provides p -values (the lower the value, the more significant the variable’s contribution to the model), and LDA provides no internal measurement of the importance of each variable, so it is just noted which variables were used (see Section 2.4.1).

Figure 3

Figure 2. Receiver operating characteristic curves showing the results of the cross validation for Data Set 1. The diagonal line y = x represents randomly classifying the samples, with half predicted as positive and half as negative. For definitions of classification results, see Section 2.4.

Figure 4

Table 3. The results of cross-validating random forests, logistic regression, and LDA (without and with transformation of the predictor variables) classification and prediction for Data Set 1.

Figure 5

Figure 3. MDS plot of the proximity values produced by the random forest classification (Data Set 1). The values on the axes are arbitrary, the graph just compares relative magnitudes. The closer two points are on the plot, the more similar their properties as determined by the random forest classification. ‘Border-line’ classifications were samples with a predicted maser association between 45 and 55%.

Figure 6

Figure 4. Only the predictor variables from Data Set 2 showing noticeable differences between those YSO with and without an associated water maser are shown. Due to the very small data set, the individual sample points are also plotted. Some of the variables are on logarithmic scales to better illustrate the differences. For an explanation of boxplots, see Figure 1.

Figure 7

Table 4. The predictor variables that increased the classification accuracy of the various methods for Data Set 2. The value given for random forests is the mean decrease in accuracy, while logistic regression provides p -values. The most important variables in logistic regression and random forest models are shown in bold. For further explanation, see Table 2.

Figure 8

Table 5. The results of cross-validating random forests, logistic regression, and LDA classification and prediction for Data Set 2 (association of water masers with infrared YSO in the LMC) using the full sample of 32 sources with known water maser association status as the training sample. For definitions of classification results, see Section 2.4.

Figure 9

Figure 5. Receiver operating characteristic curves showing the results of the cross validation for Data Set 2. The diagonal line y = x represents randomly classifying the samples, with half predicted as positive and half as negative. For a full description of a ROC curve, see Section 2.4.3.

Figure 10

Figure 6. The MDS plot for the random forest model used to predict potential YSOs with an associated water maser in the LMC (Data Set 2). ‘Border-line’ predictions were samples with a predicted maser association between 45 and 55%. For details on multidimensional plots in random forest analysis, see Figure 3.

Figure 11

Figure 7. The variables used in the classification and prediction of Data Set 3. Some of the variables are on logarithmic scales to better illustrate the differences. For an explanation of boxplots, see Figure 1.

Figure 12

Table 6. The predictor variables that increased the classification accuracy of the various methods for Data Set 3. The value given for random forests is the mean decrease in accuracy, while logistic regression provides p-values. The most important variables in logistic regression and random forest models are shown in bold. For further explanation, see Table 2.

Figure 13

Figure 8. Receiver operating characteristic curves showing the results of the cross validation for Data Set 3. The diagonal line y = x represents randomly classifying the samples, with half predicted as positive and half as negative.

Figure 14

Table 7. The results of cross-validating random forests, logistic regression, and LDA classification and prediction for Data Set 3 (class I methanol masers associated with GLIMPSE sources). Figure 8 shows the ROC curve for each of the models. For definitions of classification results, see Section 2.4.

Figure 15

Figure 9. The MDS plot for the random forest model used to predict potential millimetre dust-clumps with associated class I methanol masers (Data Set 3). ‘Border-line’ classifications were samples with a predicted maser association between 45 and 55%. For details on multidimensional plots in random forest analysis, see Figure 3.

Figure 16

Table 8. The classification results on the training data subset (where the maser presence is known), and the number of predicted masers from the 8 144 sources for which maser presence is unknown, using Data Set 3 (class I methanol masers associated with GLIMPSE sources). For definitions of classification results, see Section 2.4.

Figure 17

Table 9. Number of maser predictions on sources from Data Set 3 shared by two classification methods, with 242 sources predicted to be masers using all four methods.

Figure 18

Figure 10. The integrated flux density versus the beam averaged H2 column density for the 8 144 BGPS sources not searched for class I methanol masers by Chen et al. (2012). Sources for which one or more of the classification models predicts the presence of a class I methanol masers are represented with red dots, other sources are represented with black dots. The blue line shows the criteria developed by Chen et al. (2012) to identify BGPS sources likely to have an associated class I methanol maser.

Figure 19

Table A1. Bolocam Galactic Plane sources for which one or more of the mathematical classification models predicted the presence of an associated class I methanol maser (probability of a maser > 0.5). The maser probability for each model is listed, those which exceed 0.5 are in bold type. This list contains a total of 739 sources that were predicted to be masers by at least one of the four methods (242 of which were predicted by all methods), from a total of 8 144 sources in version 1.0.1 of the Bolocam catalogue.