See, for example, the “Collaborators’ Bill of Rights” and the “Student Collaborators’ Bill of Rights” for important efforts to lay out foundational principles for equitable collaboration: Tanya Clement and Doug Reside, “Off the Tracks: Laying New Lines for Digital Humanities Scholars,” Media Commons Press, accessed 15 September 2017, http://mcpress.media-commons.org/offthetracks/part-one-models-for-collaboration-career-paths-acquiring-institutional-support-and-transformation-in-the-field/a-collaboration/collaborators%E2%80%99-bill-of-rights/; Haley Di Pressi, Stephanie Gorman, Miriam Posner, Raphael Sasayama, and Tori Schmitt, with contributions from Roderic Crooks, Megan Driscoll, Amy Earhart, Spencer Keralis, Tiffany Naiman, and Todd Presner, “A Student Collaborators’ Bill of Rights,” UCLA Center for Digital Humanities, accessed 15 September 2017, www.cdh.ucla.edu/news-events/a-student-collaborators-bill-of-rights/.
For more on OpenITI mARkdown schema, see Maxim Romanov, “OpenITI mARkdown,” al-Raqmiyyat, accessed 15 September 2017, https://alraqmiyyat.github.io/mARkdown/. For more on CTS and specifically CapiTainS, see CapiTainS, accessed 15 September 2017, http://capitains.org/. For more on TEI, see Text Encoding Initiative, accessed 15 September 2017, http://www.tei-c.org/index.xml.
Traditional OCR approaches work by segmenting page images into lines, then each line into words, and then each word into characters. Since segmentation is extremely problematic when it comes to connected, ligature-rich scripts, performance is consistently poor on the last two steps. In contrast to this approach, Kraken completely eliminates the issue of word/character segmentation by instead employing a form of machine learning called a neural network. Neural networks mimic the way we learn, enabling Kraken to “learn” from transcriptions (training data) to recognize letters in the images of entire lines of text. This new approach to OCR makes Kraken uniquely able to handle the wide variety of ligatures in connected scripts such as Arabic and Persian.
Generalized models incorporate script features from multiple typefaces and thus are less typeface specific and better able to handle typefaces for which we have not trained a specific model.