Hostname: page-component-89b8bd64d-5bvrz Total loading time: 0 Render date: 2026-05-07T12:49:24.914Z Has data issue: false hasContentIssue false

radio-llava: Advancing vision-language models for radio astronomical source analysis

Published online by Cambridge University Press:  26 August 2025

Simone Riggi*
Affiliation:
INAF – Osservatorio Astrofisico di Catania, Catania, Italy
Thomas Cecconello
Affiliation:
INAF – Osservatorio Astrofisico di Catania, Catania, Italy Department of Electrical, Electronic and Computer Engineering, University of Catania, Catania, Italy
Andrea Pilzer
Affiliation:
NVIDIA AI Technology Center, Bologna, Italy
Simone Palazzo
Affiliation:
Department of Electrical, Electronic and Computer Engineering, University of Catania, Catania, Italy
Nikhel Gupta
Affiliation:
CSIRO Space & Astronomy, Bentley, WA, Australia
Andrew Hopkins
Affiliation:
School of Mathematical and Physical Sciences, 12 Wally’s Walk, Macquarie University, Sydney, NSW, Australia
Corrado Trigilio
Affiliation:
INAF – Osservatorio Astrofisico di Catania, Catania, Italy
Grazia Umana
Affiliation:
INAF – Osservatorio Astrofisico di Catania, Catania, Italy
*
Corresponding author: Simone Riggi; Email: simone.riggi@inaf.it.
Rights & Permissions [Opens in a new window]

Abstract

The advent of next-generation radio telescopes is set to transform radio astronomy by producing massive data volumes that challenge traditional processing methods. Deep learning techniques have shown strong potential in automating radio analysis tasks, yet are often constrained by the limited availability of large annotated datasets. Recent progress in self-supervised learning has led to foundational radio vision models, but adapting them for new tasks typically requires coding expertise, limiting their accessibility to a broader astronomical community. Text-based AI interfaces offer a promising alternative by enabling task-specific queries and example-driven learning. In this context, large language models (LLMs), with their remarkable zero-shot capabilities, are increasingly used in scientific domains. However, deploying large-scale models remains resource-intensive, and there is a growing demand for AI systems that can reason over both visual and textual data in astronomical analysis. This study explores small-scale vision-language models (VLMs) as AI assistants for radio astronomy, combining LLM capabilities with vision transformers. We fine-tuned the LLaVA VLM on a dataset of 59k radio images from multiple surveys, enriched with 38k image-caption pairs from the literature. The fine-tuned models show clear improvements over base models in radio-specific tasks, achieving $\sim$30% F1-score gains in extended source detection, but they underperform vision-only classifiers and exhibit $\sim$20% drop on general multimodal tasks. Inclusion of caption data and LoRA fine-tuning enhances instruction following and helps recover $\sim$10% accuracy on multimodal benchmarks (e.g., ChartQA/DocVQA). This work lays the foundation for future advancements in radio VLMs, highlighting their potential and limitations, such as the need for better multimodal alignment, higher-quality datasets, and mitigation of catastrophic forgetting.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (https://creativecommons.org/licenses/by-nc-sa/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the re-used or adapted article and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Astronomical Society of Australia
Figure 0

Figure 1. A schematic representation of the LLaVA model architecture.

Figure 1

Figure 2. Classification F1-scores obtained with VLMs of different sizes (0.5B, 2B, 3.1B, 7B, 8B, 72B) in zero-shot mode over B1–B6 evaluation benchmarks. We report the F1-score for individual classes, as well as the class-averaged F1-score (labelled as ‘AVG’). LLaVA, TinyLLaVA, Qwen2VL, and InternVL models are respectively shown with blue, green, red, and orange histograms. OpenAI GPT4.1 model is shown with black histograms.

Figure 2

Figure 3. Classification F1-scores obtained with the radio-llava model on B1–B6 radio benchmarks, comparing fine-tuning on the Q&A training dataset (blue histograms) and the combined Q&A and caption datasets (orange histograms). For each training set, results are reported for different training strategies (full vs. LoRA fine-tuning) and training depths (shallow vs. deep). Results from the base model are shown as filled red histograms. Results obtained with a fine-tuned vision-only model (siglip-so400m-patch14-384 encoder) are shown as filled black histograms. The class-averaged F1-scores are labelled as ‘AVG’.

Figure 3

Figure 4. Classification accuracy obtained with the radio-llava model on standard non-radio benchmarks (Section 4.1.2), comparing fine-tuning on the Q&A training dataset (blue histograms) and the combined Q&A and caption datasets (orange histograms). For each training set, results are reported for different training strategies (full vs. LoRA fine-tuning) and training depths (shallow vs. deep). Results from the base model are shown as filled red histograms.

Figure 4

Table 1. Summary of fine-tuned models with alternative hyperparameter configurations.

Figure 5

Table A1. The number of images in the radioimg-multilabel dataset that have been assigned each specific label. Multiple labels can be assigned to a single image, as they are not mutually exclusive.

Figure 6

Table C1. User-assistant conversations on sample radio images for base and fine-tuned LLaVA-OneVision models.

Figure 7

Figure C1. A screenshot displaying the Streamlit web application developed for radio-llava demo purposes.

Figure 8

Figure C2. Class-averaged classification F1-scores obtained with the radio-llava model on B1–B6 radio benchmarks, comparing fine-tuning on the Q&A training dataset (black solid histograms, labelled as ‘default) with model variants (v1–v6, coloured histograms), fine-tuned on the same dataset using alternative parameters (see text). The dashed black histogram represents the standard model evaluated on B1–B6 benchmarks using an alternative prompt.