Hostname: page-component-77f85d65b8-t6st2 Total loading time: 0 Render date: 2026-04-19T13:40:15.936Z Has data issue: false hasContentIssue false

Detecting Formatted Text: Data Collection Using Computer Vision

Published online by Cambridge University Press:  31 July 2025

Jonathan Colner*
Affiliation:
Center for Data Science, New York University , New York, NY, USA Center for Urban Research, City University of New York , Graduate Center, New York, NY, USA
Rights & Permissions [Opens in a new window]

Abstract

Research in political science has begun to explore how to use large language and object detection models to analyze text and visual data. However, few studies have explored how to use these tools for data extraction. Instead, researchers interested in extracting text from poorly formatted sources typically rely on optical character recognition and regular expressions or extract each item by hand. This letter describes a workflow process for structured text extraction using free models and software. I discuss the type of data best suited to this method, its usefulness within political science, and the steps required to convert the text into a usable dataset. Finally, I demonstrate the method by extracting agenda items from city council meeting minutes. I find the method can accurately extract subsections of text from a document and requires only a few hand labeled documents to adequately train.

Information

Type
Letter
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Political Methodology
Figure 0

Figure 1 Shows the workflow used to identify and extract agenda items from city council meeting records. In step 1, I use Label Studio to annotate a training set of meeting minutes specifying segments of the page with agenda items. In step 2, I train an object detection model that identifies segments on a different page that matches the formatting. In step 3, I use LayoutParser to extract text from those segments with OCR. In this example, both pages are from the city of Chula Vista’s June 10, 2014 meeting.

Figure 1

Figure 2 Compares the number of agenda items identified in meeting minutes to the number of agenda items listed on Legistar for that same date to the number of items identified by hand coding agenda items from the meeting records.

Figure 2

Table 1 Similarity between matched agenda items.

Supplementary material: File

Colner supplementary material

Colner supplementary material
Download Colner supplementary material(File)
File 2.5 MB