Hostname: page-component-848d4c4894-xfwgj Total loading time: 0 Render date: 2024-06-22T07:45:54.959Z Has data issue: false hasContentIssue false

Digitizing the Textual Heritage of the Premodern Islamicate World: Principles and Plans

Published online by Cambridge University Press:  31 January 2018

Matthew Thomas Miller
Roshan Institute for Persian Studies, University of Maryland, College Park, Md.; e-mail:
Maxim G. Romanov
Department of History, University of Vienna, Vienna, Austria; e-mail:
Sarah Bowen Savant
Aga Khan University, Institute for the Study of Muslim Civilisations, London; e-mail:


The varied textual traditions of the premodern Islamicate World represent an opportunity and a problem for the Digital Humanities (DH). The opportunity lies in the sheer extent of this textual heritage: if we combine the textual output of premodern Persian and Arabic authors (not to mention Turkish and other less well-represented Islamicate languages), this body of texts constitutes arguably the largest written repository of human culture. Analytical methods developed for other linguistic heritages can be repurposed to make use of this wealth of texts, and efforts are now underway to apply to them a series of computationally enhanced methods that derive from a variety of disciplines (e.g., corpus linguistics, computational linguistics, the social sciences, and statistics). The application of these forms of analysis to these large new corpora promises new insights on premodern Islamicate cultures and the improvement of existing digital tools and methodologies.

Copyright © Cambridge University Press 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)



1 In alphabetical order.

2 For the guidelines, see See also Presner, Todd, “How to Evaluate Digital Scholarship.” Journal of Digital Humanities 1 (2012)Google Scholar, accessed 18 September 2017,

3 See, for example, the “Collaborators’ Bill of Rights” and the “Student Collaborators’ Bill of Rights” for important efforts to lay out foundational principles for equitable collaboration: Tanya Clement and Doug Reside, “Off the Tracks: Laying New Lines for Digital Humanities Scholars,” Media Commons Press, accessed 15 September 2017,; Haley Di Pressi, Stephanie Gorman, Miriam Posner, Raphael Sasayama, and Tori Schmitt, with contributions from Roderic Crooks, Megan Driscoll, Amy Earhart, Spencer Keralis, Tiffany Naiman, and Todd Presner, “A Student Collaborators’ Bill of Rights,” UCLA Center for Digital Humanities, accessed 15 September 2017,

4 See al-Maktaba al-Shamila, accessed 15 September 2017,

5 See al-Maktaba al-Shiʿiyya, accessed 15 September 2017,

6 See A Digital Corpus for Graeco-Arabic Studies, accessed 15 September 2017,

7 See Arabic Commentaries on the Hippocratic Aphorisms, accessed 15 September 2017,

8 See Ganjoor, accessed 15 September 2017,

9 For more on OpenITI mARkdown schema, see Maxim Romanov, “OpenITI mARkdown,” al-Raqmiyyat, accessed 15 September 2017, For more on CTS and specifically CapiTainS, see CapiTainS, accessed 15 September 2017, For more on TEI, see Text Encoding Initiative, accessed 15 September 2017,

10 The OpenITI repository is available at, accessed 15 September 2017. For more on OpenITI CTS URNs, see Maxim Romanov, “OpenITI,” al-Raqmiyyat, accessed 15 September 2017,

11 Traditional OCR approaches work by segmenting page images into lines, then each line into words, and then each word into characters. Since segmentation is extremely problematic when it comes to connected, ligature-rich scripts, performance is consistently poor on the last two steps. In contrast to this approach, Kraken completely eliminates the issue of word/character segmentation by instead employing a form of machine learning called a neural network. Neural networks mimic the way we learn, enabling Kraken to “learn” from transcriptions (training data) to recognize letters in the images of entire lines of text. This new approach to OCR makes Kraken uniquely able to handle the wide variety of ligatures in connected scripts such as Arabic and Persian.

12 Benjamin Kiessling, Matthew Thomas Miller, Maxim Romanov, and Sarah Bowen Savant, “Important New Developments in Arabographic Optical Character Recognition (OCR),” al-ʿUsur al-Wusta, accessed 20 November 2017,

13 Generalized models incorporate script features from multiple typefaces and thus are less typeface specific and better able to handle typefaces for which we have not trained a specific model.