Hostname: page-component-89b8bd64d-rbxfs Total loading time: 0 Render date: 2026-05-13T01:59:45.686Z Has data issue: false hasContentIssue false

Measuring photometric redshifts for high-redshift radio source surveys

Published online by Cambridge University Press:  19 July 2023

K. J. Luken*
Affiliation:
School of Science, Western Sydney University, Kingswood, NSW, Australia Data61, CSIRO, Epping, NSW, Australia
R. P. Norris
Affiliation:
School of Science, Western Sydney University, Kingswood, NSW, Australia CSIRO Space & Astronomy, Australia Telescope National Facility, Epping, NSW, Australia
X. R. Wang
Affiliation:
Data61, CSIRO, Epping, NSW, Australia Centre for Research in Mathematics and Data Science, Western Sydney University, Sydney, Australia
L. A. F. Park
Affiliation:
Centre for Research in Mathematics and Data Science, Western Sydney University, Sydney, Australia
Y. Guo
Affiliation:
Data61, CSIRO, Epping, NSW, Australia
M. D. Filipović
Affiliation:
School of Science, Western Sydney University, Kingswood, NSW, Australia
*
Corresponding author: K. J. Luken; Email: kieran@luken.au
Rights & Permissions [Opens in a new window]

Abstract

With the advent of deep, all-sky radio surveys, the need for ancillary data to make the most of the new, high-quality radio data from surveys like the Evolutionary Map of the Universe (EMU), GaLactic and Extragalactic All-sky Murchison Widefield Array survey eXtended, Very Large Array Sky Survey, and LOFAR Two-metre Sky Survey is growing rapidly. Radio surveys produce significant numbers of Active Galactic Nuclei (AGNs) and have a significantly higher average redshift when compared with optical and infrared all-sky surveys. Thus, traditional methods of estimating redshift are challenged, with spectroscopic surveys not reaching the redshift depth of radio surveys, and AGNs making it difficult for template fitting methods to accurately model the source. Machine Learning (ML) methods have been used, but efforts have typically been directed towards optically selected samples, or samples at significantly lower redshift than expected from upcoming radio surveys. This work compiles and homogenises a radio-selected dataset from both the northern hemisphere (making use of Sloan Digital Sky Survey optical photometry) and southern hemisphere (making use of Dark Energy Survey optical photometry). We then test commonly used ML algorithms such as k-Nearest Neighbours (kNN), Random Forest, ANNz, and GPz on this monolithic radio-selected sample. We show that kNN has the lowest percentage of catastrophic outliers, providing the best match for the majority of science cases in the EMU survey. We note that the wider redshift range of the combined dataset used allows for estimation of sources up to $z = 3$ before random scatter begins to dominate. When binning the data into redshift bins and treating the problem as a classification problem, we are able to correctly identify $\approx$76% of the highest redshift sources—sources at redshift $z > 2.51$—as being in either the highest bin ($z > 2.51$) or second highest ($z = 2.25$).

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of the Astronomical Society of Australia
Figure 0

Figure 1. Histogram showing the density of sources at different redshifts in the SDSS Galaxy Sample (blue), SDSS QSO Sample (green), and the Square Kilometre Array Design Survey (SKADS, Levrier et al. 2009) simulation trimmed to expected EMU depth (Norris et al. 2011).

Figure 1

Figure 2. Cross-match completed between the NVSS Radio sample and the AllWISE Infrared sample. The blue line represents the straight nearest-neighbour cross-match between the two datasets, and the orange line represents the nearest-neighbour cross-match where 1$^{\prime}$ has been added to the declination of every radio source. The vertical black line denotes the chosen cutoff.

Figure 2

Table 1. The source count in each sample compiled in Section 2.

Figure 3

Figure 3. Similar to Fig. 2, cross-matching the Northern Sky sample based on the NVSS and RGZ radio catalogues and AllWISE Infrared, with SDSS optical photometry and spectroscopic redshift. Note the different scale of the right-side y-axis.

Figure 4

Figure 4. Similar to Fig. 2, cross-matching the southern sky ATLAS sample with the AllWISE catalogue, matching the SWIRE positions with the AllWISE positions.

Figure 5

Figure 5. Similar to Fig. 2, cross-matching the DES optical photometry with the SDSS optical photometry and spectroscopic redshift catalogues.

Figure 6

Figure 6. Similar to Fig. 2, cross-matching the DES positions with the AllWISE Infrared catalogue. Note the different scale of the right-side y-axis.

Figure 7

Figure 7. Similar to Fig. 2, cross-matching the AllWISE positions with the Hodge et al. (2011) Radio catalogue. Note the different scale of the right-side y-axis.

Figure 8

Figure 8. Similar to Fig. 2, cross-matching the AllWISE positions with the Prescott et al. (2018) Radio catalogue. Note the different scale of the right-side y-axis.

Figure 9

Figure 9. Histogram showing the density of sources at different redshifts in the combined RGZ—North, Stripe82—Equatorial, and ATLAS—South—(blue), and the Square Kilometre Array Design Survey (SKADS, Levrier et al. 2009) simulation trimmed to expected EMU depth (Norris et al. 2011).

Figure 10

Figure 10. A comparison of the g, r, i, and z filter responses, used by the DES (top), and the SDSS (bottom).

Figure 11

Figure 11. Plot showing the effects of homogenisation on the optical photometry. Each panel shows the original difference between the DES and SDSS photometry for a given band (with the band noted in the title of the subplot), as a function of $g - z$ colour. The orange scatterplots are the original data, with the orange line showing a third-order polynomial fit for to the pre-corrected data. The blue scatterplots are the corrected data, with the blue line showing the post-correction fit, highlighting the improvement the corrections bring.

Figure 12

Table 2. Example redshift bin boundaries used in the classification tests, calculated with the first random seed used. We show the bin index, the upper and lower bounds, and the predicted value for the bin.

Figure 13

Figure 12. Optimisation of the Bayesian Information Criteria (top), Test Size (middle), and Outlier Rate (bottom) across a range of components.

Figure 14

Figure 13. Redshift distribution of the 30 GMM components. In each case the vertical axis shows the count.

Figure 15

Table 3. Regression results table comparing the different algorithms across the different error metrics (listed in the table footnotes). The best values for each error metric are highlighted in bold.

Figure 16

Table 4. Classification results table comparing the different algorithms across the different error metrics (listed in the table footnotes). The best values for each error metric are highlighted in bold.

Figure 17

Figure 14. Comparison of spectroscopic and predicted values using kNN Regression. The x-axis shows the spectroscopic redshift, with the y-axis (Top) showing the redshift estimated by the ML model. The y-axis (Bottom) shows the normalised residual between spectroscopic and predicted values as a function of redshift. The turquoise dash-dotted line shows a perfect correlation, and the blue dashed lines show the boundaries accepted by the $\eta_{0.15}$ outlier rate. The colour bar shows the density of points per coloured point.

Figure 18

Figure 15. Same as Fig. 14 but for RF Regression.

Figure 19

Figure 16. Same as Fig. 14 but for ANNz Regression.

Figure 20

Figure 17. Same as Fig. 14 but for GPz Regression.

Figure 21

Figure 18. Confusion matrix showing the results using the kNN classification algorithm. The size of the boxes is approximately scaled (with the exception of the final, highest redshift boxes) with the width of the classification bin. The x-axis shows the spectroscopic redshift, and the y-axis shows the predicted redshift. The left panel is an exploded subsection of the overall right panel.

Figure 22

Figure 19. Same as Fig. 18, but for RF Classification.

Figure 23

Figure 20. Same as Fig. 18, but for ANNz Classification.

Figure 24

Figure 21. Same as Fig. 18, but for GPz Classification.

Figure 25

Figure 22. Comparison of the different algorithms in their regression and classification modes. In all cases, the lower the value the better, with the lowest value for each metric shown with a horizontal dotted line.

Figure 26

Table 5. Classification results table comparing the different algorithms across the different error metrics (listed in the table footnotes). The best values for each error metric are highlighted in bold. Results following the combination of the highest two redshift bins discussed in Section 5.1.

Figure 27

Figure 23. Comparing Regression with Classification over all methods, and all metrics.

Figure 28

Figure 24. Comparing Regression with Classification over all methods, and all metrics. In this case, the highest redshift bin is combined with the second highest. Results following the combination of the highest two redshift bins discussed in Section 5.1.

Figure 29

Table 6. Outlier rates of each error metric using their regression modes, showing the original outlier rate, the outlier rate of the subset of sources deemed ‘certain’, and the outlier rate of the remaining sources.

Figure 30

Figure A.1. Training on SDSS, Testing on DES, with the same axes and notation as Fig. 14.

Figure 31

Figure A.2. Training on SDSS, Testing on DES—Corrected,, with the same axes and notation as Fig. 14.

Figure 32

Table A.1. Comparison between predictions using the kNN algorithm, trained on one subset of data (either the northern SDSS photometry or the southern DES photometry) and tested on the other.

Figure 33

Figure A.3. Scaled confusion matrices in similar style to Fig. 18, with the subfigures showing the effect of the photometry correction.

Figure 34

Figure A.4. Training on DES, testing on SDSS.

Figure 35

Figure A.5. Training on DES, testing on SDSS—corrected.

Figure 36

Figure A.6. Same as Fig. A.3, showing a different training/test set combination.

Figure 37

Figure A.7. Comparison of the $\eta$ outlier rates when trained on corrected and uncorrect data.

Figure 38

Figure B.1. Similar to the top panel of Fig. 14, comparing all of the predictions (left), with the predictions deemed ‘certain’ (middle) and the predictions deemed ‘uncertain’ (right) using the kNN algorithm.

Figure 39

Figure B.2. As with Fig. B.1, using the RF algorithm.

Figure 40

Figure B.3. As with Fig. B.1, using the ANNz algorithm.

Figure 41

Figure B.4. As with Fig. B.1, using the GPz algorithm.