Hostname: page-component-77f85d65b8-zzw9c Total loading time: 0 Render date: 2026-03-29T05:33:22.517Z Has data issue: false hasContentIssue false

Towards efficient and accessible geoparsing of U.K. local media: A benchmark dataset and LLM-based approach

Published online by Cambridge University Press:  09 October 2025

Simona Bisiani*
Affiliation:
Institute for People-Centred AI, University of Surrey , Stag Hill, Guildford, UK
Agnes Gulyas
Affiliation:
School of Creative Arts and Industries, Canterbury Christ Church University , Canterbury, UK
Bahareh Heravi
Affiliation:
Institute for People-Centred AI, University of Surrey , Stag Hill, Guildford, UK
*
Corresponding author: Simona Bisiani; Email: s.bisiani@surrey.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

Location mentions in local news are crucial for examining issues like spatial inequalities, news deserts and the impact of media ownership on news diversity. However, while geoparsing – extracting and resolving location mentions – has advanced through statistical and deep learning methods, its use in local media studies remains limited and fragmented due to technical challenges and a lack of practical frameworks. To address these challenges, we identify key considerations for successful geoparsing and review spatially oriented local media studies, finding over-reliance on limited geospatial vocabularies, limited toponym disambiguation and inadequate validation of methods. These findings underscore the need for adaptable and robust solutions, and recent advancements in fine-tuned large language models (LLMs) for geoparsing offer a promising direction by simplifying technical implementation and excelling at understanding contextual nuances. However, their application to U.K. local media – marked by fine-grained geographies and colloquial place names – remains underexplored due to the absence of benchmark datasets. This gap hinders researchers’ ability to evaluate and refine geoparsing methods for this domain. To address this, we introduce the Local Media UK Geoparsing (LMUK-Geo) dataset, a hand-annotated corpus of U.K. local news articles designed to support the development and evaluation of geoparsing pipelines. We also propose an LLM-driven approach for toponym disambiguation that replaces fine-tuning with accessible prompt engineering. Using LMUK-Geo, we benchmark our approach against a fine-tuned method. Both perform well on the novel dataset: the fine-tuned model excels in minimising coordinate-error distances, while the prompt-based method offers a scalable alternative for district-level classification, particularly when relying on predictions agreed upon by multiple models. Our contributions establish a foundation for geoparsing local media, advancing methodological frameworks and practical tools to enable systematic and comparative research.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Methodological framework for evaluating and implementing geoparsing in local news studies.

Figure 1

Table 1. Summary of studies investigating geographic content in local media: Objectives and methodological approaches

Figure 2

Figure 2. Overview of candidate list: (a) distribution across documents based on the number of candidates per toponym, (b) candidate source distribution, and (c) number of articles categorised by the number of districts.

Figure 3

Figure 3. Overview of procedure for creating the dataset.

Figure 4

Figure 4. Schematic representation of the proposed LLM prompt-disambiguation approach.

Figure 5

Table 2. Descriptive statistics of the novel dataset LMUK-Geo

Figure 6

Figure 5. Performance of the LLMs across prompts and metadata configurations. Each dot represents a temperature.

Figure 7

Table 3. Evaluation results on LMUK-Geo with different handling of missing coordinates

Supplementary material: File

Bisiani et al. supplementary material

Bisiani et al. supplementary material
Download Bisiani et al. supplementary material(File)
File 72.4 KB
Submit a response

Rapid Responses

No Rapid Responses have been published for this article.