Hostname: page-component-77f85d65b8-8wtlm Total loading time: 0 Render date: 2026-04-23T03:49:09.342Z Has data issue: false hasContentIssue false

Retrieving information from unstructured historical sources using large language models

Published online by Cambridge University Press:  02 December 2025

Spencer Dean Stewart*
Affiliation:
Libraries and School of Information Studies, Purdue University , West Lafayette, USA
Sanskriti Sinha
Affiliation:
Department of Statistics, Purdue University , West Lafayette, USA
*
Corresponding authors: Spencer Dean Stewart; Email: stewa443@purdue.edu
Rights & Permissions [Opens in a new window]

Abstract

The volumes of historical data locked behind unstructured formats have long been a challenge for researchers in the computational humanities. While optical character recognition (OCR) and natural language processing have enabled large-scale text mining projects, the irregular formatting, inconsistent terminology and evolving printing practices complicate automated parsing and information extraction efforts for historical documents. This study explores the potential of large language models (LLMs) in processing and structuring irregular and non-standardized historical materials, using the U.S. Department of Agriculture’s Plant Inventory books (1898–2008) as a test case. Given the frequent evolution of these historical records, we implemented a pipeline combining OCR, custom segmentation rules and LLMs to extract structured data from the scanned texts. It provides an example of how incorporating LLMs into data-processing pipelines can enhance the accessibility and usability of historical and archival materials for scholars.

Information

Type
Short Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Page examples from Plant Inventory booklets.

Figure 1

Figure 2. Pipeline for structured data extraction using OCR and LLM prompting.

Figure 2

Table 1. Comparison of large language models on extraction accuracy, speed and cost

Submit a response

Rapid Responses

No Rapid Responses have been published for this article.