A multi-stage approach to information extraction and text classification in large untranscribed manuscript collections

01 December 2025, Version 1
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

This poster presents a work-in-progress pipeline for Information Extraction and Text Classification from large, untranscribed manuscript collections. Our three-stage approach begins with collaborative Handwritten Text Recognition (HTR) for digitisation, followed by Normalisation and Segmentation to create searchable collections, and concludes with Text Classification and Information Extraction to identify Tibetan 'Pagan' religious texts concealed amongst Buddhist and Bön works. Due to the complexity of the materials (abbreviations, misspellings, niche terminology, etc.) Normalisation is not a trivial task. In a semi-supervised fashion, part of the corpus is manually checked twice and the rest is processed automatically with optimised Normalisation rules, yielding searchable eTexts in two different formats: diplomatic versions for historical linguists, philological and palaeographical researchers, and normalised versions for anyone who is interested in the content of the text or further downstream NLP tasks such as Information Extraction. For the purpose of creating a content catalogue we piloted the use of a classifier based on similar text embeddings, visualised through a simple Single Value Decomposition. This enables systematic text labelling, facilitating the identification and cataloguing of texts within our large manuscript collection to find out which ones are likely to contain 'Pagan' features and therefore the most relevant for reconstructing the oldest religion of Tibet.

Keywords

HTR
Normalisation
Text Classification
Information Extraction
Tibetan
Religious Studies
Large Text Corpora

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.