Skip to main content Accessibility help

Digitizing the Textual Heritage of the Premodern Islamicate World: Principles and Plans

  • Matthew Thomas Miller (a1), Maxim G. Romanov (a2) and Sarah Bowen Savant (a3)


The varied textual traditions of the premodern Islamicate World represent an opportunity and a problem for the Digital Humanities (DH). The opportunity lies in the sheer extent of this textual heritage: if we combine the textual output of premodern Persian and Arabic authors (not to mention Turkish and other less well-represented Islamicate languages), this body of texts constitutes arguably the largest written repository of human culture. Analytical methods developed for other linguistic heritages can be repurposed to make use of this wealth of texts, and efforts are now underway to apply to them a series of computationally enhanced methods that derive from a variety of disciplines (e.g., corpus linguistics, computational linguistics, the social sciences, and statistics). The application of these forms of analysis to these large new corpora promises new insights on premodern Islamicate cultures and the improvement of existing digital tools and methodologies.



Hide All


1 In alphabetical order.

2 For the guidelines, see See also Presner, Todd, “How to Evaluate Digital Scholarship.” Journal of Digital Humanities 1 (2012), accessed 18 September 2017,

3 See, for example, the “Collaborators’ Bill of Rights” and the “Student Collaborators’ Bill of Rights” for important efforts to lay out foundational principles for equitable collaboration: Tanya Clement and Doug Reside, “Off the Tracks: Laying New Lines for Digital Humanities Scholars,” Media Commons Press, accessed 15 September 2017,; Haley Di Pressi, Stephanie Gorman, Miriam Posner, Raphael Sasayama, and Tori Schmitt, with contributions from Roderic Crooks, Megan Driscoll, Amy Earhart, Spencer Keralis, Tiffany Naiman, and Todd Presner, “A Student Collaborators’ Bill of Rights,” UCLA Center for Digital Humanities, accessed 15 September 2017,

4 See al-Maktaba al-Shamila, accessed 15 September 2017,

5 See al-Maktaba al-Shiʿiyya, accessed 15 September 2017,

6 See A Digital Corpus for Graeco-Arabic Studies, accessed 15 September 2017,

7 See Arabic Commentaries on the Hippocratic Aphorisms, accessed 15 September 2017,

8 See Ganjoor, accessed 15 September 2017,

9 For more on OpenITI mARkdown schema, see Maxim Romanov, “OpenITI mARkdown,” al-Raqmiyyat, accessed 15 September 2017, For more on CTS and specifically CapiTainS, see CapiTainS, accessed 15 September 2017, For more on TEI, see Text Encoding Initiative, accessed 15 September 2017,

10 The OpenITI repository is available at, accessed 15 September 2017. For more on OpenITI CTS URNs, see Maxim Romanov, “OpenITI,” al-Raqmiyyat, accessed 15 September 2017,

11 Traditional OCR approaches work by segmenting page images into lines, then each line into words, and then each word into characters. Since segmentation is extremely problematic when it comes to connected, ligature-rich scripts, performance is consistently poor on the last two steps. In contrast to this approach, Kraken completely eliminates the issue of word/character segmentation by instead employing a form of machine learning called a neural network. Neural networks mimic the way we learn, enabling Kraken to “learn” from transcriptions (training data) to recognize letters in the images of entire lines of text. This new approach to OCR makes Kraken uniquely able to handle the wide variety of ligatures in connected scripts such as Arabic and Persian.

12 Benjamin Kiessling, Matthew Thomas Miller, Maxim Romanov, and Sarah Bowen Savant, “Important New Developments in Arabographic Optical Character Recognition (OCR),” al-ʿUsur al-Wusta, accessed 20 November 2017,

13 Generalized models incorporate script features from multiple typefaces and thus are less typeface specific and better able to handle typefaces for which we have not trained a specific model.

Digitizing the Textual Heritage of the Premodern Islamicate World: Principles and Plans

  • Matthew Thomas Miller (a1), Maxim G. Romanov (a2) and Sarah Bowen Savant (a3)


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed