Hostname: page-component-cb9f654ff-65tv2 Total loading time: 0 Render date: 2025-08-01T09:39:32.286Z Has data issue: false hasContentIssue false

EMUSE: Evolutionary Map of the Universe Search Engine

Published online by Cambridge University Press:  01 July 2025

Nikhel Gupta*
Affiliation:
Australia Telescope National Facility, CSIRO, Space & Astronomy, Bentley, WA, Australia
Zeeshan Hayder
Affiliation:
CSIRO Data61, Black Mountain, ACT, Australia
Minh Huynh
Affiliation:
Australia Telescope National Facility, CSIRO, Space & Astronomy, Bentley, WA, Australia International Centre for Radio Astronomy Research (ICRAR), M468, The University of Western Australia, Crawley, WA, Australia
Ray Norris
Affiliation:
Western Sydney University, Penrith, NSW, Australia Australia Telescope National Facility, CSIRO Space & Astronomy, Epping, NSW, Australia
Lars Petersson
Affiliation:
CSIRO Data61, Black Mountain, ACT, Australia
Andrew Hopkins
Affiliation:
School of Mathematical and Physical Sciences, 12 Wally’s Walk, Macquarie University, Sydney, NSW, Australia
Simone Riggi
Affiliation:
INAF-Osservatorio Astrofisico di Catania, Catania, Italy
Bärbel Silvia Koribalski
Affiliation:
Western Sydney University, Penrith, NSW, Australia Australia Telescope National Facility, CSIRO Space & Astronomy, Epping, NSW, Australia
Miroslav D. Filipović
Affiliation:
Western Sydney University, Penrith, NSW, Australia
*
Author for correspondence: Nikhel Gupta, Email: Nikhel.Gupta@csiro.au.
Rights & Permissions [Opens in a new window]

Abstract

We present Evolutionary Map of the Universe Search Engine (EMUSE), a tool designed for searching specific radio sources within the extensive datasets of the Evolutionary Map of the Universe (EMU) survey, with potential applications to other Big Data challenges in astronomy. Built on a multimodal approach to radio source classification and retrieval, EMUSE fine-tunes the OpenCLIP model on curated radio galaxy datasets. Leveraging the power of foundation models, our work integrates visual and textual embeddings to enable efficient and flexible searches within large radio astronomical datasets. We fine-tune OpenCLIP using a dataset of 2 900 radio galaxies, encompassing various morphological classes, including FR-I, FR-II, FR-x, R-type, and other rare and peculiar sources. The model is optimised using adapter-based fine-tuning, ensuring computational efficiency while capturing the unique characteristics of radio sources. The fine-tuned model is then deployed in the EMUSE, allowing for seamless image and text-based queries over the EMU survey dataset. Our results demonstrate the model’s effectiveness in retrieving and classifying radio sources, particularly in recognising distinct morphological features. However, challenges remain in identifying rare or previously unseen radio sources, highlighting the need for expanded datasets and continuous refinement. This study showcases the potential of multimodal machine learning in radio astronomy, paving the way for more scalable and accurate search tools in the field. The search engine is accessible at https://askap-emuse.streamlit.app/ and can be used locally by cloning the repository at https://github.com/Nikhel1/EMUSE.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Astronomical Society of Australia

1. Introduction

The Evolutionary Map of the Universe (EMU; Hopkins et al. Reference Hopkins2025) survey, conducted with the Australian Square Kilometre Array Pathfinder (ASKAP; Johnston et al. Reference Johnston2007; DeBoer et al. Reference DeBoer2009; Hotan et al. Reference Hotan2021), highlights the transformative role of modern radio interferometers in cosmic exploration. Over its five-year duration, the survey aims to detect more than 20 million compact and extended radio galaxies, providing an unprecedented dataset that will significantly enhance our understanding of galaxy evolution and the Universe’s history. Additionally, such extensive data are expected to unveil new astrophysical phenomena and offer deeper insights into the origins of radio emissions. However, achieving these scientific objectives requires moving beyond conventional data mining techniques. Instead, innovative approaches are needed to analyse, organise, and classify the vast amounts of radio galaxy data, leveraging multiwavelength observations to unlock the survey’s full potential.

In recent years, machine learning has become a powerful tool for analyzing data from the next generation of radio telescopes (e.g. Mostert et al. Reference Mostert2021; Gupta et al. Reference Gupta2022; Walmsley et al. Reference Walmsley2022; Segal et al. Reference Segal2023; Alegre et al. Reference Alegre2022; Gupta et al. Reference Gupta2023; Lochner et al. Reference Lochner, Rudnick, Heywood, Knowles and Shabala2023; Gupta et al. Reference Gupta2023; Slijepcevic et al. Reference Slijepcevic2024; Mohale & Lochner Reference Mohale and Lochner2024; Gupta et al. Reference Gupta, Hayder, Norris, Huynh and Petersson2024a; Lastufka et al. Reference Lastufka2024; Gupta et al. Reference Gupta2024b; Riggi et al. Reference Riggi2024; Lochner & Rudnick Reference Lochner and Rudnick2025; Mostert et al. Reference Mostert2024; Lao et al. Reference Lao2025; Gupta et al. Reference Gupta2025). These techniques have significantly accelerated both the discovery of new radio morphologies and the detection, classification, and cataloguing of radio sources. Beyond the approaches employed in these studies, emerging models with multimodal capabilities offer new opportunities to enhance the analysis of Big Data from radio telescopes. For instance, foundation models, which are large-scale deep learning architectures pre-trained on diverse datasets, can be adapted for radio astronomy tasks. These models, such as Generative Pre-training Transformer (GPT; Brown et al. Reference Brown2020), Contrastive Language-Image Pre-training (CLIP; Radford et al. Reference Radford2021), and vision-language models like Gemini (Team et al. Reference Team2023), have demonstrated remarkable capabilities in cross-modal understanding and pattern recognition. By leveraging foundation models, we can further improve the detection, classification, and retrieval of radio sky data. Their ability to integrate information from multiple data modalities (e.g. radio, infrared, optical) enables more robust source identification and classification (e.g. Jia et al. Reference Jia2021; Alayrac et al. Reference Alayrac2022; Radford et al. Reference Radford2021; Ramesh et al. Reference Ramesh, Dhariwal, Nichol, Chu and Chen2022; Rombach et al. Reference Rombach, Blattmann, Lorenz, Esser and Ommer2022). Additionally, their adaptability through fine-tuning and zero-shotFootnote a learning (e.g., Bommasani et al. Reference Bommasani2021; Yu et al. Reference Yu2022; Touvron et al. Reference Touvron2023) allows for more efficient exploration of large-scale surveys, making them valuable tools for future radio astronomy research.

Pre-training multimodal foundation models requires vast image-text datasets and significant computational resources. The lack of open-source models in this domain further hinders progress. Recently, Parker et al. (Reference Parker2024) pre-trained a multimodal model on galaxy data using optical imaging and spectral information, applying it to downstream tasks. Similarly, Riggi et al. (Reference Riggi2025) pre-trained a small vision language model on radio images and image-caption pairs with a focus on downstream generative tasks. However, research on multimodal model pretraining suggests that while pretraining strategies influence downstream performance, the primary objective of pre-training should be to develop robust, generalisable features rather than domain-specific ones. Domain adaptation is generally more effective when achieved through fine-tuning on task-specific datasets (see, e.g., Fayou et al. Reference Fayou, Ngo, Sek and Meng2024; Manzoor et al. Reference Manzoor2023). Notably, Tanoglidis & Jain (Reference Tanoglidis and Jain2024) employed GPT-4o and LLaVA-NeXT pre-trained models for zero-shot classification of low-surface-brightness galaxies and artifacts, as well as for morphological galaxy classification. Their findings indicate that, with natural language prompts, these models achieved high classification accuracy (typically above 80%) without additional fine-tuning. Thus, leveraging a pre-trained model trained on general real-world data is a promising approach for fine-tuning domain-specific tasks while eliminating pre-training costs. In a recent work, Cherti et al. (Reference Cherti2023) trained CLIP using the public LAION dataset (Schuhmann et al. Reference Schuhmann2022), which includes an English image-text subset of 2.32 billion real-world samples, to produce OpenCLIP – a large, publicly available image-text model – using approximately 1 520 NVIDIA A100 GPUs. This enables the design of downstream tasks using OpenCLIP as a foundation model pre-trained on a vast image-text dataset.

In this work, we develop a framework to fine-tune the OpenCLIP model on the RadioGalaxyNET dataset (Gupta et al. Reference Gupta, Hayder, Norris, Huynh and Petersson2024a) derived from the Evolutionary Map of the Universe first pilot survey (EMU-PS1 Norris et al. Reference Norris2021a) using a single H100 GPU. We then leverage the fine-tuned model to develop EMUSEFootnote b (Evolutionary Map of the Universe Search Engine), an application that performs similarity search on the first-year observations of the EMU main survey (Hopkins et al. Reference Hopkins2025). EMUSE enables users to explore data and identify similar radio sources through image or text-based queries, allowing for rapid searches of specific radio source classes. This capability is crucial for building statistically robust samples of well-known categories, such as FR-I and FR-II galaxies, as well as for discovering additional examples of rare and peculiar systems. Such samples are essential for investigating population properties, analysing the distribution of morphological types, and tracing their evolution across cosmic time. Additionally, EMUSE lays the groundwork to develop advanced tools for rapidly extracting meaningful insights and discovering new phenomena from the Big Data produced by next-generation multiwavelength surveys.

The paper is organised as follows. In Section 2, we provide details on the EMU survey, infrared observations and object detection-based EMU catalogues. Section 3 is dedicated to the foundation models and our fine-tuning approach. Section 4 provides comprehensive information about the EMUSE application. Our findings are summarised in Section 5, where we also outline directions for future research.

Figure 1. Overview of EMUSE (Evolutionary Map of the Universe Search Engine). Starting with the open-source OpenCLIP model, which is pre-trained on approximately 2.3 billion image-text pairs from the LAION dataset, we further fine-tuned it using an image-text dataset of extended radio sources in the EMU-PS1 survey. The fine-tuned model is then used to generate image embeddings of EMU sources based on PNG images from the EMU and AllWISE surveys at the positions of extended radio sources identified in the RG-CAT catalogue. The fine-tuned model, along with the generated image embeddings and catalogue metadata – which includes sky position, integrated flux, and host galaxy information – is integrated into the EMUSE application framework to retrieve similar sources. EMUSE facilitates the search of the embedding database and outputs a table of EMU survey radio sources that are similar to a given image or text prompt. The search engine is accessible at https://askap-emuse.streamlit.app/ and can be used locally by cloning https://github.com/Nikhel1/EMUSE.

2. Data

This section presents an overview of the EMU survey, infrared observations, and the catalogues generated through object detection used in this study.

2.1. EMU observations

The Evolutionary Map of the Universe (EMU) (EMUFootnote c ; Hopkins et al. Reference Hopkins2025) is a large-scale radio survey being conducted with the Australian Square Kilometre Array Pathfinder (ASKAP; Hotan et al. Reference Hotan2021) to map the southern sky. ASKAP, located at Inyarrimahnha Ilgari Bundara, MRO, consists of 36 antennas, with most within a 2.3 km diameter and six extending to 6.4 km baselines. The survey includes 853 tile footprints from 1 014 observations, with 692 tiles having 10-h integrations and 161 tiles observed twice for 5-h integrations. EMU covers declinations from $-11^{\circ}.4$ to the south celestial pole and selected equatorial regions up to $\delta = +7^{\circ}.0$ , observing in the 800–1 088 MHz band, centred at 944 MHz. The RMS noise ranges from 25 to $55~\unicode{x03BC}$ Jy/beam, with a $13^{\prime\prime} \times 11^{\prime\prime}$ beamwidth. By 2028, EMU aims to detect up to 20 million radio sources over $2\pi$ sr of the sky. This study uses data from EMU’s first-year observations (see Gupta et al. Reference Gupta2025, for details), covering 160 tiles (4 500 square degrees). Data collection commenced in late 2022, with validated data arriving between February 2023 and March 2024. The dataset, accessed via the CSIRO Data Access Portal (CASDAFootnote d ), consists of image tiles and Selavy-based catalogues (Whiting & Humphreys Reference Whiting and Humphreys2012) with Scheduling Block IDs (SBID) from 45 638 to 59 612. We use restored images at a uniform $15^{\prime\prime}$ resolution per beam (identified by the ‘conv’ filename suffix in CASDA). For the 160 tiles in the first-year dataset, this amounts to approximately 3 million detected radio sources. Each tile is analysed independently rather than combined into super mosaics, which may lead to duplicate detections in overlapping regions.

2.2. Infrared observations

In addition to the EMU observations, we generate corresponding 160 tiles for the AllWISE dataset from the Wide-field Infrared Survey Explorer (WISE) (Wright et al. Reference Wright2010; Cutri et al. Reference Cutri2021) using the Montage image mosaic software.Footnote e WISE conducted an all-sky infrared survey across four bands–W1, W2, W3, and W4–at wavelengths of 3.4, 4.6, 12, and 22 $\unicode{x03BC}$ m, respectively. This study focuses on the W1 band from AllWISE, which provides a 5 $\sigma$ point source detection limit of 28 $\unicode{x03BC}$ Jy and an angular resolution of $8.5^{\prime \prime}$ .

2.3. Catalogues from RG-CAT pipeline

We use the RG-CAT catalogue construction pipeline (Gupta et al. Reference Gupta2024b), which integrates the Gal-DINOFootnote f object detection framework (Gupta et al. Reference Gupta, Hayder, Norris, Huynh and Petersson2024a) to catalogue radio sources systematically. Gal-DINO is designed to detect radio galaxies and identify their probable infrared hosts. It is trained on 5 000 radio galaxies, including 2 800 from the RadioGalaxyNET dataset (Gupta et al. Reference Gupta, Hayder, Norris, Huynh and Petersson2024a), spanning FR-I, FR-II, FR-x, and R-type classifications based on peak separation and total extent (Fanaroff & Riley Reference Fanaroff and Riley1974). FR-I galaxies have a peak-to-extent ratio below 0.45, FR-II above 0.55, FR-x between 0.45 and 0.55, and R-type sources show resolved double jet emission with a single visible central peak (ratio = 0; Norris et al. submitted). The dataset is further expanded in Gupta et al. (Reference Gupta2024b) with 2 100 compact/unresolved galaxies and 100 rare morphologies, including bent-tailed galaxies, cluster halo emissions, and Odd Radio Circles (ORCs; Norris et al. Reference Norris2021b). Gal-DINO refines bounding box and keypoint predictions for identifying radio sources and their infrared hosts. The performance evaluation yields an average precision with 50% intersection over union (IoU), i.e., AP $_{50}$ , of 73.2% for bounding boxes and 71.7% for keypoints, with 99% of central bounding boxes achieving IoU > 0.5 and 98% of keypoints located within $\lt3^{\prime \prime}$ of their true host positions (see Gupta et al. Reference Gupta2024b). We extend RG-CAT from EMU PS1 to the first-year EMU main survey tiles, generating $8^{\prime} \times 8^{\prime}$ cutouts for approximately 3 million Selavy-based sources. Each cutout is analysed with Gal-DINO to extract bounding boxes, categories, and confidence scores, assembling a catalogue per tile. Compact sources are catalogued individually, while extended galaxies are grouped. A detailed catalogue of radio sources and host galaxies will be presented in Gupta et al. (in preparation), while this study focuses on extended radio sources including rare morphologies.

3. Foundation models and fine-tuning

Foundation models capture broad, transferable knowledge and can be fine-tuned to perform specific tasks in astronomy using relatively small amounts of labelled data. In this work, we fine-tune OpenCLIP, a multimodal foundation model, using radio source images and their corresponding textual descriptions. This enables the model to learn the unique visual and semantic features of radio sources. As a result, it can support downstream tasks such as retrieving similar images based on a query image or a text prompt. In this section, we discuss multimodal foundation models and provide details on fine-tuning OpenCLIP for the radio source dataset. Figure 1 provides an overview of our framework.

3.1. Multimodal foundation models

Foundation models have recently gained significant attention for their ability to integrate and process multiple modalities, such as images and text, within a unified framework. Multimodal image-text foundation models, in particular, have demonstrated remarkable capabilities in bridging the gap between vision and language, enabling applications like image captioning and visual question answering (e.g., Ramesh et al. Reference Ramesh, Dhariwal, Nichol, Chu and Chen2022; Rombach et al. Reference Rombach, Blattmann, Lorenz, Esser and Ommer2022). These models are typically pre-trained on large-scale datasets containing paired image-text data, such as captions or descriptions, using self-supervised learning techniques (e.g., Wang et al. Reference Wang2021; Cherti et al. Reference Cherti2023). The self-supervised training paradigm leverages the inherent alignment between images and their corresponding textual descriptions to learn rich, joint representations without requiring explicit human annotations for every task. For instance, models like CLIP (Contrastive Language–Image Pre-training; Radford et al. Reference Radford2021) and ALIGN (Jia et al. Reference Jia2021) employ contrastive learning objectives, where the model learns to maximise the similarity between embeddings of matching image-text pairs while minimising it for non-matching pairs.

In contrast, GPT-based multimodal models extend the autoregressive language modelling paradigm of GPT to incorporate visual inputs (e.g., Alayrac et al. Reference Alayrac2022). These models are trained to predict the next token in a sequence, enabling them to generate coherent text conditioned on both textual and visual inputs. Unlike CLIP, which focuses on alignment, GPT-based models emphasise the generation of text based on multimodal inputs. Gemini represents a unified architecture that aims to seamlessly integrate multiple modalities into a single cohesive model (Team et al. Reference Team2023). Unlike CLIP, which separates vision and language encoders, and GPT-based models, which primarily extend language models to handle visual inputs, Gemini is designed to natively process multiple modalities (e.g., text, images, audio, video) within a single architecture. Similarly, models like MultiMAE (Multi-modal Multi-task Masked Autoencoders Bachmann et al. Reference Bachmann, Mizrahi, Atanov and Zamir2022) use masked reconstruction tasks, where parts of the input (e.g., patches of an image or words in a sentence) are masked, and the model is trained to reconstruct them based on the remaining context.

3.2. Fine-tuning foundation model

The success of multimodal image-text foundation models lies in their ability to generalise across diverse tasks and domains by leveraging the complementary information in both modalities. By pre-training on vast amounts of image-text pairs, these models capture intricate cross-modal relationships, enabling them to excel in downstream tasks with minimal fine-tuning (e.g., Yu et al. Reference Yu2022; Cherti et al. Reference Cherti2023; Touvron et al. Reference Touvron2023). Furthermore, the self-supervised nature of their training allows them to scale effectively with increasing data and computational resources, leading to emergent capabilities such as zero-shot or few-shot generalisation (e.g., Bommasani et al. Reference Bommasani2021; Jia et al. Reference Jia2021; Wang et al. Reference Wang2021; Alayrac et al. Reference Alayrac2022). Despite their successes, several challenges persist. These include the need for high-quality, diverse datasets for pre-training and the substantial computational resources required to train and deploy large-scale models. The limited availability of open-source multimodal foundation models has also hindered their adoption in specialised fields like astronomy. However, recent collaborative efforts have led to the release of open-source multimodal pre-trained models, making them accessible to the broader research community.

In this study, we use the OpenCLIP (Cherti et al. Reference Cherti2023), an open-source multimodal foundation model, trained on 2.32 billion real-world image-text pairs sourced from the publicly accessible LAION dataset (Schuhmann et al. Reference Schuhmann2022). OpenCLIP is based on the CLIP architecture (Radford et al. Reference Radford2021). OpenCLIP employs a Contrastive-Captioning (CoCa; Yu et al. Reference Yu2022) framework that combines contrastive learning and generative captioning into a single unified model. Contrastive learning aligns image and text embeddings in a shared latent space. Generative captioning produces descriptive captions for images. This dual-objective approach allows OpenCLIP to serve as a strong foundation model for both discriminative and generative multimodal tasks. LAION is one of the largest open datasets for vision-language research, containing diverse and noisy web-scraped data that enable the model to learn robust cross-modal representations. By leveraging this vast amount of paired data, OpenCLIP achieves strong performance across a variety of tasks, including zero-shot image classification, cross-modal retrieval, and visual question answering. Fine-tuning OpenCLIP for specific downstream tasks is facilitated by its modular architecture and compatibility with widely used deep learning frameworks such as PyTorch. Users can refine the model by updating all parameters or employing parameter-efficient approaches, such as linear probing or adapter-based fine-tuning. In linear probing, only a task-specific classification head is trained while keeping the pre-trained weights fixed. This makes it a computationally efficient strategy, particularly for applications with limited labelled data. For more complex tasks, full fine-tuning enables the model to adapt its learned representations to the specific characteristics of the target domain. Furthermore, OpenCLIP allows for customisation through modifications to its training pipeline, providing flexibility to explore alternative objectives, optimisers, and data augmentation techniques.

We use the RadioGalaxyNET dataset (Gupta et al. Reference Gupta, Hayder, Norris, Huynh and Petersson2024a) to fine-tune the pre-trained OpenCLIP model. The dataset includes 2 800 FR-I, FR-II, FR-x, and R-type radio galaxies, along with their corresponding infrared hosts. Following (Gupta et al. Reference Gupta2024b), we incorporate an additional category containing 100 peculiar sources and other rare morphologies. For each of these radio sources, we generate $4^{\prime} \times 4^{\prime}$ image cutouts from the EMU-PS1 survey and corresponding cutouts from the AllWISE survey. The host galaxy position is used as the cutout centre, ensuring that the full extent of the radio emission is captured. These cutouts are saved as PNG (Portable Network Graphics) images, with the first two channels containing radio cutouts. Data clipping is applied between the 50th percentile level and the maximum values of the 99th and 99.9th percentiles for the first and second channels, respectively. The third channel contains the AllWISE W1 band image. We expand the labels for these radio galaxies by incorporating morphological descriptions and textual variations (see examples in Table A5), and by adding additional information based on their subcategories (Norris et al. submitted). For instance, an FR-II radio galaxy that exhibits a bent-tailed structure is labelled as: ‘An image of an FR-II or Fanaroff-Riley type II radio galaxy with edge-brightened lobes bent at an angle.’ Similarly, an ORC, an extragalactic, edge-brightened ring-like radio structure surrounding a distant host galaxy, typically lacks detectable emission at other wavelengths beyond its host but can exhibit diffuse radio emission within the bright ring structure (Norris et al. Reference Norris2025), and is labelled as: ‘An image of a peculiar radio galaxy classified as an Odd Radio Circle.’ Additional sub-categories include HyMORS (hybrid morphology radio sources), which exhibit an FR-I appearance on one side of the core and an FR-II appearance on the other; DDRGs (double-double radio galaxies), often interpreted as ‘restarted’ radio galaxies; resolved star-forming radio galaxies; as well as core-dominated radio galaxies where the radio emission associated with the host galaxy is significantly brighter than the lobes.

Figure 2. Model accuracy evaluated on the test set after each epoch. Error bars represent the variance, calculated by fine-tuning and testing the model 10 times with randomly drawn training and test sets.

Figure 3. The top panel shows the confusion matrix comparing ground truth labels to predicted labels for each main category. The displayed values are averaged over 10 training iterations. The bottom panel shows the UMAP projection generated from the image embeddings produced by the model’s image encoder, illustrating that different ground truth categories cluster in distinct regions. The plotted points include test sets from all 10 training iterations.

Using the radio and infrared image cutouts of sources along with the expanded text descriptions, we fine-tune the pre-trained OpenCLIP model on a single NVIDIA H100 GPU for 100 epochs, which takes approximately 1.5 h. We employ adapter-based fine-tuning, which allows the model to adapt its learned representations to the characteristics of radio sources. Given that OpenCLIP combines both the contrastive and generative sides into a single unified architecture, we focus solely on the contrastive side during fine-tuning. This approach encourages embeddings of matching image-text pairs to be close together while pushing non-matching pairs apart, thereby enabling zero-shot retrieval tasks for EMU data. To evaluate the model’s performance, we split the radio source dataset into an 80:20 ratio for training and testing. The training and testing data are randomly sampled from the full set 10 times, and the OpenCLIP model is trained separately on each iteration of the randomly selected training data. The trained models are then tested on independently selected test data, also drawn randomly 10 times. Figure 2 presents the accuracy over 100 training epochs. The error bars reflect the variance in test results across the 10 training iterations. The figure indicates that accuracy exceeds 50% after a single epoch and gradually increases to $84\pm3$ % after 100 epochs. Notably, while the model is trained on images paired with their expanded text descriptions, we assess its accuracy using only the main categories – FR-I, FR-II, FR-x, R, and Peculiar – during testing. Top panel of Figure 3 shows the confusion matrix for these main categories. The values shown are averaged across 10 training iterations. The results demonstrate that the fine-tuned model predicts these categories with high accuracy overall, although there is greater confusion between FR-I and FR-x sources. This is expected, as the primary distinction between these two categories lies in the peak-to-extent ratio (as described in Section 2.3). In contrast, confusion is much lower for the Peculiar category, despite it having the smallest training sample size. Bottom panel of Figure 3 displays the Uniform Manifold Approximation and Projection (UMAP, McInnes, Healy, & Melville Reference McInnes, Healy and Melville2018) projection of image embeddings from the model, with points representing sources in test sets across all 10 training runs. This highlights how different ground truth categories form distinct clusters, while also revealing overlaps that align with the patterns seen in the confusion matrix. Additionally, although the accuracy and confusion matrix evaluations are based on training with 80% of the data, we fine-tune the final model using 100% of the radio source dataset. This ensures that all available image-text pairs are utilised to train the final model used for the EMU search engine.

4. EMUSE application

We develop EMUSE (Evolutionary Map of the Universe Search Engine), a tool that employs similarity search using the fine-tuned model described in the previous section. We use catalogues generated by the RG-CAT pipeline (see Section 2.3), which employs the Gal-DINO object detection model to process each EMU tile. We filter extended radio sources classified as FR-I, FR-II, FR-x, R, and Peculiar from the catalogues. From the 160 tiles observed during the first year of the EMU survey, we identify approximately 170 000 such extended radio sources where the prediction confidence score exceeds the minimum estimated threshold of the Gal-DINO model. Using the sky positions from the catalogues, we generate cutouts from the EMU and AllWISE surveys, which are saved as radio-radio-infrared channel PNG images. The fine-tuned model is then used to generate image embeddings for each PNG. Additionally, we store the corresponding catalogue metadata for each image embedding, including source positions, integrated radio flux, and the potential host name from the CatWISE catalogue (Marocco et al. Reference Marocco2021), as provided by the RG-CAT pipeline. Note that the potential host details provided here are based on estimates from the Gal-DINO model within the RG-CAT pipeline and have not been verified through visual inspection.

EMUSE implements a zero-shot retrieval framework, enabling the model to generalise its knowledge to unseen classes or tasks without explicit training on those specific classes. In this work, we use the fine-tuned OpenCLIP multimodal model, which has been trained to produce aligned embeddings for images and text. Specifically, we generate embeddings for approximately 170 000 EMU survey radio sources from PNGs with radio and infrared channels, using the fine-tuned model. These embeddings replace the original images, which require over 150 GB of storage and are difficult to search efficiently for multiple queries. In contrast, the embeddings occupy only a few hundred megabytes, making the search engine viable. These embeddings are stored in a database and can be queried using either text queries (e.g., ‘radio galaxy with jets’) or image queries (e.g., a sample image of a radio source). The zero-shot capability arises from the model’s ability to retrieve similar sources based on the semantic alignment of embeddings in the shared latent space, without requiring additional training on specific classes or queries.

For a given text query, the input is first tokenised using the OpenCLIP tokeniser, and its embedding is obtained through the fine-tuned model’s text encoder. For an image query, the input image undergoes preprocessing using OpenCLIP’s standard pipeline, which includes resizing to $224\times224$ pixels, conversion to RGB and then to Pytorch tensor, and normalisation with the model’s predefined mean and standard deviation values. The resulting image is then passed through the fine-tuned model’s image encoder to generate its embedding. To search for similar sources, we compute the similarity between the query embedding (either derived from a text or an image query) and the precomputed embeddings of the EMU survey source images as

(1) \begin{equation}S(\mathbf{q}, \mathbf{e}_i) = \frac{\mathbf{q} \cdot \mathbf{e}_i}{||\mathbf{q}||~||\mathbf{e}_i||},\end{equation}

where:

  • $ \mathbf{q} \in \mathbb{R}^d $ : The embedding of the query (text or image) in the shared latent space.

  • $ \mathbf{e}_i \in \mathbb{R}^d $ : The embedding of the i -th image in the database ( $ i = 1, 2, \ldots, N $ ).

  • $ S(\mathbf{q}, \mathbf{e}_i) $ : The cosine similarity function measures the alignment between the query and image embeddings, normalised between 0 and 1.

The top-k most similar image embeddings are retrieved as

(2) \begin{equation}\text{top-}k = \textrm{arg, max}_{i \in \{1, 2, \ldots, N\}} S(\mathbf{q}, \mathbf{e}_i).\end{equation}

The information corresponding to these top-k embeddings is then fetched from the RG-CAT catalogue metadata. This includes the EMU tile SBID where the source is located, its RA ( $\deg$ ), Dec ( $\deg$ ), integrated flux density (mJy), and potential host galaxy names from the CatWISE catalogue, along with the probability describing the estimated similarity between the query embedding $\mathbf{q}$ and the image embedding $\mathbf{e}_i$ . The following sections discuss examples of text and image queries.

4.1. Text queries

We evaluate the zero-shot retrieval capability of the fine-tuned OpenCLIP model using various queries, presenting two examples for brevity. The application is publicly available, allowing readers to submit their queries. For instance, we search for ‘A bent-tailed radio galaxy’. Table A1 displays the EMUSE output, listing the top 50 most similar radio sources along with their potential host galaxies from RG-CAT. The number of displayed sources can be adjusted by modifying the minimum probability threshold and the desired number of results in the interface. Using the positions in Table A1, we present all 50 corresponding images in Figure A1, demonstrating that the fine-tuned model can efficiently retrieve bent-tailed radio sources across the EMU survey. For the second query, ‘Resolved star-forming radio galaxies’, the EMUSE results are shown in Table A2 and Figure A2, further highlighting the model’s ability to identify and classify such morphologies. While these examples showcase the model’s capability to interpret text queries and retrieve relevant image data, this performance is directly attributed to the fine-tuning applied in this work. Sources absent from the fine-tuning dataset – such as cluster relics and supernova remnants – may not be retrieved effectively.

Additionally, text-based queries in EMUSE currently underperform compared to image-based queries. For example, a simple search for ‘odd radio circle’ returns no results above a probability threshold of 0.9, while a more descriptive prompt, such as ‘An image of a peculiar radio galaxy classified as an Odd Radio Circle’, successfully retrieves relevant sources. Conversely, concise text like ‘FR-II’ yields meaningful matches, whereas longer, more complex phrases, such as ‘An image of an FR-II or Fanaroff-Riley type II radio galaxy with edge-brightened lobes bent at an angle’, often result in inconsistent or unrelated outputs. This inconsistency stems from the sensitivity of the model to phrasing and its reliance on the limited and sparse textual descriptions used during fine-tuning. Since the alignment between text and image embeddings depends heavily on how descriptions are written, the model struggles to interpret astronomy-specific language without sufficient contextual variety. While adding a broader range of textual descriptions could help, this approach is constrained by variability in human annotation styles. A more scalable and effective solution may involve augmenting the training data with language rewrites (Fan et al. Reference Fan, Krishnan, Isola, Katabi and Tian2023) and paraphrasing techniques (Kim et al. Reference Kim2024) or by leveraging large language models to generate richer and more diverse textual descriptions (e.g., Nguyen et al. Reference Nguyen, Gadre, Ilharco, Oh and Schmidt2023; Yu et al. Reference Yu2024; Chen et al. Reference Chen2024). These strategies could enhance the model’s ability to interpret different forms of scientific language and better align them with corresponding visual features, and should be explored in future work.

4.2. Image queries

For image-based queries, we demonstrate two examples: an FR-II radio galaxy and ORC J2103-6200 (Norris et al. Reference Norris2021b). We use EMU-PS1 images, open them in CARTA,Footnote g and capture screenshots of these sources (see Figure 4). These screenshots are then used as query inputs to search the EMU survey. For the FR-II source shown in the left panel of Figure 4, the corresponding EMUSE results are presented in Table A3 and Figure A3. Notably, most of the retrieved sources exhibit emission from the core, which is consistent with the query image. Additionally, their sky orientation closely matches that of the input query, further demonstrating the model’s effectiveness in retrieving morphologically similar sources.

Figure 4. Example image queries for EMUSE. These figures are screenshots from the EMU-PS1 image, taken while being viewed in CARTA. The left panel shows an FR-II radio galaxy, while the right panel displays ORC J2103-6200 (Norris et al. Reference Norris2021b).

The EMUSE results for the ORC J2103-6200 image query are shown in Table A4, and in Figure A4. The first four sources include a starburst radio ring galaxy (SRRG), an ORC candidate, another SRRG, and a radio source without a plausible host galaxy, as also identified in Gupta et al. (Reference Gupta2025). Although the training set for fine-tuning included only two ORCs, the model successfully retrieves a known ORC candidate, several half-ring-like structures, and potential GLAREs (Galaxies with Large-scale Ambient Radio Emission; Gupta et al. Reference Gupta2025), which may represent an evolutionary stage of ORCs. This demonstrates the potential of EMUSE for discovering such rare radio sources, which will be enhanced by incorporating a larger training sample of these sources in future updates to the model. Further multi-wavelength visual inspections are needed to categorise the remaining sources in the figure. Due to the limited training data for ORCs, the model also retrieves resolved star-forming radio galaxies and other radio sources occupying similar embedding spaces to the image query. However, it also identifies Wide Angle Tailed (WAT) sources and other diffuse emissions, highlighting the need for more ORC examples in the training data.

Note that when a screenshot is used as a query input to a model trained on 3-channel images, the information in the image is typically replicated across all three channels to match the expected input format. Although the screenshot may lack the multi-channel radio and infrared details present in the training data, the model often still performs reasonably well. This is likely because high-level structural features, such as morphology and spatial patterns, are still available. While the resulting embeddings may not capture the full richness of the original data, such as distinguishing between resolved spirals and ORCs, they can still yield meaningful similarity results. Additionally, we find that different image queries – such as screenshots of this ORC taken from various sources (e.g., academic papers) or images of other previously identified ORCs and ORC candidates – yield different sets of sources in the similarity space. A comprehensive future study of similar sources obtained from various queries will help expand the catalogue of such rare systems.

5. Conclusions

We explore the application of multimodal foundation models in the field of radio astronomy, specifically leveraging the power of OpenCLIP, an open-source pre-trained multimodal model, to classify and retrieve radio sources from the EMU survey. Radio astronomy, with its vast and complex datasets, benefits from advanced machine learning techniques that can efficiently process large amounts of data and provide insights into the nature of celestial objects. This paper aims to enhance the identification and retrieval of different types of radio galaxies by using the OpenCLIP model, which integrates both visual and textual information in a shared embedding space. The motivation behind this study is to bridge the gap between machine learning and astronomy, allowing for more accurate and efficient searches within large radio source databases.

In this work, we fine-tune the OpenCLIP model on a dataset of 2 900 radio galaxies from the RadioGalaxyNET dataset, which includes various morphological classes, such as FR-I, FR-II, FR-x, R-type, and peculiar radio sources. The fine-tuning is performed using adapter-based methods, ensuring that the model adapts effectively to the specific characteristics of radio sources while maintaining computational efficiency. The model is trained to map radio and infrared images to a shared latent space alongside their associated textual descriptions. Through this process, the model learns the complex relationships between image features and text, making it capable of performing zero-shot retrieval tasks without the need for additional task-specific training.

The fine-tuned OpenCLIP model is then integrated into the EMUSE (Evolutionary Map of the Universe Search Engine) application, enabling the efficient search and retrieval of radio sources from the EMU survey. By converting the images of radio sources into compact embeddings, the model reduces the data storage requirements and makes searching across large datasets feasible. The application allows users to query the database using both text and image-based inputs, providing a flexible and powerful tool for identifying and classifying radio galaxies. Notably, the zero-shot retrieval capabilities of the model allow it to generalise to new types of radio sources, making it adaptable to future discoveries without the need for retraining.

The results from the evaluation of the model demonstrate its effectiveness in retrieving radio sources based on both text and image queries. In particular, the model performs well in retrieving sources with specific morphological features. Additionally, the image query functionality highlights the model’s ability to recognise and retrieve similar sources with matching morphological features, even for complex objects like Odd Radio Circles. However, certain categories of radio sources–such as supernova remnants, planetary nebulae, cluster relics, etc.–which were absent from the fine-tuning dataset may not be retrieved as accurately. This limitation highlights the importance of continuously expanding the training data to include a wider range of radio source types.

Future work should focus on extending the model to accommodate more complex datasets, enhancing its performance on rare or previously unseen radio sources, and integrating it with other astronomical databases to further expand its capabilities. Future work should also focus on improving the accessibility of the EMUSE application by displaying the source images from the catalogue generated through image and text queries. This functionality can be implemented by retrieving images via the cutout service, which is currently being integrated into the CASDA server. While this study demonstrates the model’s application using the first-year data from the EMU survey, future efforts should incorporate observations from the ongoing survey in the coming years. In addition, incorporating more multiwavelength datasets will help refine the classification of rare radio sources, improving the model’s accuracy and applicability. The current approach relies on RG-CAT catalogues, which in turn are derived from Selavy-based catalogues. Consequently, sources missed by Selavy–such as very faint objects–are also absent from our results. Future research should explore catalogue-agnostic approaches to mitigate this limitation. Furthermore, with the increasing availability of open-source pre-trained models, whether trained on astronomical or real-world data, future studies should investigate the adoption of newer architectures that may enhance fine-tuning beyond OpenCLIP. By providing an efficient and scalable solution for radio astronomy, this approach paves the way for researchers to explore and classify the ever-growing volume of radio data more effectively, ultimately advancing our understanding of complex radio sources.

Data availability statement

The OpenCLIP model with fine-tuning settings is available at https://github.com/Nikhel1/Finetune_OpenCLIP. The radio source images and labels used for fine-tuning are available at https://doi.org/10.25919/btk3-vx79, while the exact images and expanded text descriptions are available upon request. The search engine is accessible at https://askap-emuse.streamlit.app/ and can also be used locally by cloning the repository and following the steps provided at https://github.com/Nikhel1/EMUSE, i.e., by running the command ‘streamlit run main.py’. The fine-tuned models, EMU survey radio source embeddings, and catalogue metadata are accessible within ‘main.py’.

Acknowledgements

NG acknowledges support from CSIRO’s Machine Learning and Artificial Intelligence Future Science Impossible Without You (MLAI FSP IWY) Platform. This scientific work uses data obtained from Inyarrimanha Ilgari Bundara/the Murchison Radio-astronomy Observatory. We acknowledge the Wajarri Yamaji People as the Traditional Owners and native title holders of the Observatory site. The Australian SKA Pathfinder is part of the Australia Telescope National Facility (https://ror.org/05qajvd42) which is managed by CSIRO. Operation of ASKAP is funded by the Australian Government with support from the National Collaborative Research Infrastructure Strategy. ASKAP uses the resources of the Pawsey Supercomputing Centre. The establishment of ASKAP, the Murchison Radio-astronomy Observatory and the Pawsey Supercomputing Centre are initiatives of the Australian Government, with support from the Government of Western Australia and the Science and Industry Endowment Fund. This paper includes archived data obtained through the CSIRO ASKAP Science Data Archive, CASDA (http://data.csiro.au).

This publication makes use of data products from the Wide-field Infrared Survey Explorer, which is a joint project of the University of California, Los Angeles, and the Jet Propulsion Laboratory/California Institute of Technology, and NEOWISE, which is a project of the Jet Propulsion Laboratory/California Institute of Technology. WISE and NEOWISE are funded by the National Aeronautics and Space Administration.

We acknowledge the use of several open-source Python packages that facilitated this research, including (but not limited to) PyTorch (Paszke et al. Reference Paszke2017), scikit-learn (Pedregosa et al. Reference Pedregosa2011), pandas (McKinney Reference McKinney2010), and Astropy (Astropy Collaboration et al. Reference Collaboration2013, Reference Collaboration2018, Reference Collaboration2022).

Appendix

Table A1. Top-50 EMUSE output for text query, ‘A bent-tailed radio galaxy’.

Figure A1. Top-50 EMUSE output for the text query, ‘A bent-tailed radio galaxy’. Positions in Table 1 are used here for $5^{\prime}\times5^{\prime}$ cutout images with radio-radio-infrared (RGB) channels.

Table A2. Top-50 EMUSE output for text query, ‘Resolved star forming radio galaxy’.

Figure A2. Top-50 EMUSE output for the text query, ‘Resolved star forming radio galaxy’. Positions in Table 2 are used here for $5^{\prime}\times5^{\prime}$ cutout images with radio-radio-infrared channels.

Table A3. Top-50 EMUSE output for image query shown on the left panel of Figure 4.

Figure A3. Top-50 EMUSE output for image query shown on the left panel of Figure 4. Positions in Table 3 are used here for $5^{\prime}\times5^{\prime}$ cutout images with radio-radio-infrared channels.

Table A4. Top-50 EMUSE output for image query shown on the right panel of Figure 4.

Table A5. Examples of the expanded text descriptions for the main radio source classes. These, along with similar variations based on subcategories and special features, are used to fine-tune the OpenCLIP model.

Figure A4. Top-50 EMUSE output for image query shown on the right panel of Figure 4. Positions in Table 4 are used here for $5^{\prime}\times5^{\prime}$ cutout images with radio-radio-infrared channels.

Footnotes

a An approach where a model is trained to recognise or classify objects, concepts, or tasks it has never seen during training.

e Implementation available at: https://github.com/Nikhel1/wise_mosaics.

References

Alayrac, J.-B., et al. 2022, Advances in Neural Information Processing Systems 35, 23716Google Scholar
Alegre, L., et al. 2022, MNRAS 516, 4716 Google Scholar
Collaboration, Astropy, et al. 2013, A&A 558, A33Google Scholar
Collaboration, Astropy, et al. 2018, AJ 156, 123 Google Scholar
Collaboration, Astropy, et al. 2022, ApJ 935, 167 Google Scholar
Bachmann, R., Mizrahi, D., Atanov, A., & Zamir, A. 2022, in European Conference on Computer Vision Springer, 348Google Scholar
Bommasani, R., et al. 2021, arXiv preprint arXiv:2108.07258 Google Scholar
Brown, T., et al. 2020, Advances in Neural Information Processing Systems 33, 1877Google Scholar
Chen, L., et al. 2024, in European Conference on Computer Vision (Springer), 370Google Scholar
Cherti, M., et al. 2023, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2818Google Scholar
Cutri, R. M., et al. 2021, VizieR Online Data Catalog, II/328Google Scholar
DeBoer, D. R., et al. 2009, IEEE Proc. 97, 1507 Google Scholar
Fan, L., Krishnan, D., Isola, P., Katabi, D., & Tian, Y. 2023, Advances in Neural Information Processing Systems 36, 35544Google Scholar
Fanaroff, B. L., & Riley, J. M. 1974, MNRAS 167, 31P Google Scholar
Fayou, S., Ngo, H. C., Sek, Y. W., & Meng, Z. 2024, SciR 14, 11879Google Scholar
Gupta, N., Hayder, Z., Norris, R. P., Huynh, M., & Petersson, L. 2024a, PASA, 41, e001 Google Scholar
Gupta, N., Hayder, Z., Norris, R. P., Hyunh, M., & Petersson, L. 2023, NeurIPS ML4PS 2023, arXiv:2312.06728 Google Scholar
Gupta, N., et al. 2022, PASA 39, e051 Google Scholar
Gupta, N., et al. 2023, PASA 40, e044 Google Scholar
Gupta, N., et al. 2024b, PASA, 41, e027Google Scholar
Gupta, N., et al. 2025, arXiv e-prints, arXiv:2506.08439 Google Scholar
Hopkins, A. M., et al. 2025, PASA, 1–32Google Scholar
Hotan, A. W., et al. 2021, PASA, 38, e009 Google Scholar
Jia, C., et al. 2021, in International Conference on Machine Learning, PMLR, 4904Google Scholar
Johnston, S., et al. 2007, PASA 24, 174 Google Scholar
Kim, H., et al. 2024, arXiv e-prints, arXiv:2402.15120 Google Scholar
Lao, B., et al. 2025, arXiv e-prints, arXiv:2501.09883 Google Scholar
Lastufka, E., et al. 2024, A&A 690, A310 Google Scholar
Lochner, M., & Rudnick, L. 2025, AJ 169, 121 Google Scholar
Lochner, M., Rudnick, L., Heywood, I., Knowles, K., & Shabala, S. S. 2023, MNRAS 520, 1439 Google Scholar
Manzoor, M. A., et al. 2023, ACM TMCCA 20, 1Google Scholar
Marocco, F., et al. 2021, ApJS 253, 8 Google Scholar
McInnes, L., Healy, J., & Melville, J. 2018, arXiv e-prints, arXiv:1802.03426 Google Scholar
McKinney, W. 2010, Proceedings of the 9th Python in Science Conference 445, 51 Google Scholar
Mohale, K., & Lochner, M. 2024, MNRAS 530, 1274 Google Scholar
Mostert, R. I. J., et al. 2021, A&A 645, A89 Google Scholar
Mostert, R. I. J., et al. 2024, A&A 691, A185 Google Scholar
Nguyen, T., Gadre, S. Y., Ilharco, G., Oh, S., & Schmidt, L. 2023, Advances in Neural Information Processing Systems 36, 22047Google Scholar
Norris, R. P., et al. 2025, MNRAS 537, L42 Google Scholar
Norris, R. P., et al. 2021a, PASA, 38, e046 Google Scholar
Norris, R. P., et al. 2021b, PASA, 38, e003 Google Scholar
Parker, L., et al. 2024, MNRAS 531, 4990 Google Scholar
Paszke, A., et al. 2017, in NIPS-WGoogle Scholar
Pedregosa, F., et al. 2011, JMLR 12, 2825Google Scholar
Radford, A., et al. 2021, in International Conference on Machine Learning, PMLR, 8748Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. 2022, arXiv preprint arXiv:2204.06125 1, 3Google Scholar
Riggi, S., et al. 2025, arXiv e-prints, arXiv:2503.23859 Google Scholar
Riggi, S., et al. 2024, PASA 41, e085 Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. 2022, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684Google Scholar
Schuhmann, C., et al. 2022, Advances in Neural Information Processing Systems 35, 25278 Google Scholar
Segal, G., et al. 2023, MNRAS 521, 1429 Google Scholar
Slijepcevic, I. V., et al. 2024, RASTI 3, 19 Google Scholar
Tanoglidis, D., & Jain, B. 2024, RNAAS, 8, 265 Google Scholar
Team, G., et al. 2023, arXiv preprint arXiv:2312.11805 Google Scholar
Touvron, H., et al. 2023, arXiv preprint arXiv:2307.09288 Google Scholar
Walmsley, M., et al. 2022, MNRAS 513, 1581 Google Scholar
Wang, Z., et al. 2021, arXiv preprint arXiv:2108.10904 Google Scholar
Whiting, M., & Humphreys, B. 2012, PASA 29, 371 Google Scholar
Wright, E. L., et al. 2010, AJ 140, 1868 Google Scholar
Yu, J., et al. 2022, arXiv preprint arXiv:2205.01917 Google Scholar
Yu, Q., et al. 2024, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14022Google Scholar
Figure 0

Figure 1. Overview of EMUSE (Evolutionary Map of the Universe Search Engine). Starting with the open-source OpenCLIP model, which is pre-trained on approximately 2.3 billion image-text pairs from the LAION dataset, we further fine-tuned it using an image-text dataset of extended radio sources in the EMU-PS1 survey. The fine-tuned model is then used to generate image embeddings of EMU sources based on PNG images from the EMU and AllWISE surveys at the positions of extended radio sources identified in the RG-CAT catalogue. The fine-tuned model, along with the generated image embeddings and catalogue metadata – which includes sky position, integrated flux, and host galaxy information – is integrated into the EMUSE application framework to retrieve similar sources. EMUSE facilitates the search of the embedding database and outputs a table of EMU survey radio sources that are similar to a given image or text prompt. The search engine is accessible at https://askap-emuse.streamlit.app/ and can be used locally by cloning https://github.com/Nikhel1/EMUSE.

Figure 1

Figure 2. Model accuracy evaluated on the test set after each epoch. Error bars represent the variance, calculated by fine-tuning and testing the model 10 times with randomly drawn training and test sets.

Figure 2

Figure 3. The top panel shows the confusion matrix comparing ground truth labels to predicted labels for each main category. The displayed values are averaged over 10 training iterations. The bottom panel shows the UMAP projection generated from the image embeddings produced by the model’s image encoder, illustrating that different ground truth categories cluster in distinct regions. The plotted points include test sets from all 10 training iterations.

Figure 3

Figure 4. Example image queries for EMUSE. These figures are screenshots from the EMU-PS1 image, taken while being viewed in CARTA. The left panel shows an FR-II radio galaxy, while the right panel displays ORC J2103-6200 (Norris et al. 2021b).

Figure 4

Table A1. Top-50 EMUSE output for text query, ‘A bent-tailed radio galaxy’.

Figure 5

Figure A1. Top-50 EMUSE output for the text query, ‘A bent-tailed radio galaxy’. Positions in Table 1 are used here for $5^{\prime}\times5^{\prime}$ cutout images with radio-radio-infrared (RGB) channels.

Figure 6

Table A2. Top-50 EMUSE output for text query, ‘Resolved star forming radio galaxy’.

Figure 7

Figure A2. Top-50 EMUSE output for the text query, ‘Resolved star forming radio galaxy’. Positions in Table 2 are used here for $5^{\prime}\times5^{\prime}$ cutout images with radio-radio-infrared channels.

Figure 8

Table A3. Top-50 EMUSE output for image query shown on the left panel of Figure 4.

Figure 9

Figure A3. Top-50 EMUSE output for image query shown on the left panel of Figure 4. Positions in Table 3 are used here for $5^{\prime}\times5^{\prime}$ cutout images with radio-radio-infrared channels.

Figure 10

Table A4. Top-50 EMUSE output for image query shown on the right panel of Figure 4.

Figure 11

Table A5. Examples of the expanded text descriptions for the main radio source classes. These, along with similar variations based on subcategories and special features, are used to fine-tune the OpenCLIP model.

Figure 12

Figure A4. Top-50 EMUSE output for image query shown on the right panel of Figure 4. Positions in Table 4 are used here for $5^{\prime}\times5^{\prime}$ cutout images with radio-radio-infrared channels.