Hostname: page-component-77f85d65b8-grvzd Total loading time: 0 Render date: 2026-04-22T07:17:29.802Z Has data issue: false hasContentIssue false

Fiscal data in text: Information extraction from audit reports using Natural Language Processing

Published online by Cambridge University Press:  28 February 2023

Alejandro Beltran*
Affiliation:
The Alan Turing Institute, London, United Kingdom
*
*Corresponding author. E-mail: abeltran@turing.ac.uk

Abstract

Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.

Information

Type
Commentary
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Table 1. Number of ASEs that publish historical audits

Figure 1

Figure 1. Boilerplate paragraph example.

Figure 2

Figure 2. NER annotation in Prodigy.

Figure 3

Figure 3. NER identification in paragraphs.

Figure 4

Table 2. Total value of annual discrepancies in MXN$

Figure 5

Table 3. OLS Model 1: DV is total discrepancies per municipality

Figure 6

Table 4. OLS Model 2: DV is $ {Discrepancies}_t $

Figure 7

Figure 4. Marginal effects of Model 2.

Submit a response

Comments

No Comments have been published for this article.