Abstract
This poster presents a work-in-progress pipeline for Information Extraction and Text Classification from large, untranscribed manuscript collections. Our three-stage approach begins with collaborative Handwritten Text Recognition (HTR) for digitisation, followed by Normalisation and Segmentation to create searchable collections, and concludes with Text Classification and Information Extraction to identify Tibetan 'Pagan' religious texts concealed amongst Buddhist and Bön works. Due to the complexity of the materials (abbreviations, misspellings, niche terminology, etc.) Normalisation is not a trivial task. In a semi-supervised fashion, part of the corpus is manually checked twice and the rest is processed automatically with optimised Normalisation rules, yielding searchable eTexts in two different formats: diplomatic versions for historical linguists, philological and palaeographical researchers, and normalised versions for anyone who is interested in the content of the text or further downstream NLP tasks such as Information Extraction. For the purpose of creating a content catalogue we piloted the use of a classifier based on similar text embeddings, visualised through a simple Single Value Decomposition. This enables systematic text labelling, facilitating the identification and cataloguing of texts within our large manuscript collection to find out which ones are likely to contain 'Pagan' features and therefore the most relevant for reconstructing the oldest religion of Tibet.
Supplementary weblinks
Title
PaganTibet supplementary materials
Description
Scripts for pre- and postprocessing handwritten Tibetan texts to create diplomatic and normalised eTexts.
Actions
View Title
Zenodo community for the PaganTibet project
Description
Repository for HTR and other resources for the PaganTibet project, such as Ground Truth and training materials.
Actions
View 


![Author ORCID: We display the ORCID iD icon alongside authors names on our website to acknowledge that the ORCiD has been authenticated when entered by the user. To view the users ORCiD record click the icon. [opens in a new tab]](https://www.cambridge.org/engage/assets/public/coe/logo/orcid.png)