A Brief History of Philologic
Concordances are the oldest, simplest, and in many ways the most powerful tool for navigating and exploring texts.Footnote 1 Concordances provided the opportunity to compare the different uses of the same word in the context of surrounding words within a coherent set of texts. Centuries later, this same tool was at the origin of the digital humanities: in 1951, a Jesuit researcher, Roberto Busa, undertook to create a complete concordance of the entire work of Thomas Aquinas with the help of IBM, leading, decades later, to the publication of the Thomistica Index in fifty-six volumes.Footnote 2 Based on the recognized power of this approach to text analysis, the ARTFL Project began in the late 1980s and early 1990s to work on PhiloLogic, an open source full-text search, retrieval, and analysis system powered by concordances. Originally developed to serve the needs of the ARTFL ProjectFootnote 3 and its large user base, in particular the exploration of the “ARTFL-Frantext”Footnote 4and “ARTFL-Encyclopédie”Footnote 5 collections, PhiloLogic has been extended to handle an ever increasing array of collections, text encodings, and languages.Footnote 6 The ARTFL Project and the Textual Optics LabFootnote 7 relies on this software not only to deliver many different types of text collections and dictionaries to scholars, but also to host databases for collaborative projects such as the “Opera del Vocabolario Italiano,”Footnote 8 the “History of Black Writing,”Footnote 9 or the “Shanghai Library Republican Journal corpus.”Footnote 10 As mass digitization efforts such as HathiTrust, Google Books, the Internet Archive, and the Digital Library of America Project transform the conditions (and stakes) of humanities text research, PhiloLogic is designed to provide rigorous analysis and discovery tools on these larger and more complex collections while accommodating the growing demand for new kinds of text analysis and visualization of results.
A considerable amount of work has been devoted to designing a user-friendly interface, most notably by making all of the functionality of the software easily accessible. In many ways, PhiloLogic's user experience is meant to provide a workspace for the researcher, where hypotheses can be formulated and verified with just a few clicks, guiding scholars deeper and deeper into the collections as they leverage the frequency distributions of results displayed by the faceted browser, to finally arrive at the actual text. The seamless navigation between queries at the corpus level and the text is an essential aspect of PhiloLogic: it gives the researcher the opportunity to change reading scales, from a more distant perspective appropriate for finding general patterns, to the close reading experience, which has been further enhanced by the information gathered at the corpus level queries.
Additionally, PhiloLogic also provides a number of APIs and data outputs for further text-mining experiments. In particular, the TextPAIR sequence alignment tool can leverage the PhiloLogic index, text structure, and metadata gathered during the parsing stage to conduct a document to document comparison to uncover text-reuses within and between PhiloLogic databases. Future work on PhiloLogic will involve a closer integration between the results of such data-mining tasks into the main search and navigation interface, therefore allowing researchers to gain new insights into an ever-growing array of text collections.
Philologic Concordance Search, Kwic, Collocations and Time Series
The ability to perform targeted searches is a key component of any digital textual data repository. The PhiloLogic Concordance functionFootnote 11 is designed to search the corpus by harnessing the power of regular expression syntax, including full support for wildcards and complex sequences of search terms. Beyond the concordance search function, the use of facets, drawn from the metadata that accompanies each TEI-encoded document in the source corpus, provides an on-the-fly means to further constrain and explore the output; this becomes particularly valuable when thousands or millions of results are returned. Metadata categories within the facet browser are sorted by default, with the largest frequencies (both absolute and relative to the total size of each document) presented at the top of the list. Common facets include title, date, author, and section type, but any metadata tags can be employed as facets and thereby provide users with a highly individualized and configurable experience while browsing through search results.
By default, each search term returned is highlighted within its specific textual context of twenty-five words or graphs on each side; clicking on “View occurrences by line (KWIC, Key Word In Context)” changes the output to a vertical display with the primary search term in the center and its context on either side. This display gives the user the ability to quickly scan through and sort by the words directly preceding and following the highlighted search term, providing penetrating insights into which sequences of words and phrases containing the search terms occur most frequently. Clicking on “Export results” in the upper right produces a JSON file object in Unicode encoding in a new window, containing the complete data for the concordance produced by the search.
Beyond the concordance functions, PhiloLogic provides two additional algorithmically generated types of results: collocations and time series. The Collocations function generates a word cloud visualization of the top 100 most frequent words (or in the Chinese case, characters) that co-occur in the sentences retrieved; the terms in the word cloud are sized by their relative frequency so that the most frequent collocates are displayed largest. The time series visualization provides a simple horizontal timeline of the works returned from the concordance search, with vertical bars representing the absolute counts or relative frequency of the search terms in each document in the corpus.

Figure 1. Screenshot of Concordance Search in PhiloLogic
Textpair : Large-Scale Intertextual Analysis (Sequence Alignment)
Algorithmically based intertextual analysis is an increasingly popular analytical method through which all exact or similar textual phrases or passages in a text or corpus can be detected using a simple matching algorithm that relies on an n-gram representation of the text with a sliding window for evaluation.Footnote 12 For a corpus like the Twenty-four Chinese Histories, algorithmic intertextual analysis allows the user to efficiently assemble and review graph-by-graph similarities and differences (changes, interpolations and deletions) between passages of any length greater than a sentence in each source.Footnote 13 As with the concordance functions, full corpus-level intertextual results can also be further constrained by facets, allowing a user to quickly assemble and review all passages shared between two or more sources, authors, or any other criteria included in the metadata.Footnote 14 These results can also be constrained by passage, title, passage length, or time period.
In PhiloLogic's TextPAIR framework, under each pair of similar passages, clicking on “Show differences” provides the at-a-glance results of the similarity matrix mapped onto each source, with identical sections in light blue type, interpolations in dark blue boldface type, and deletions marked in boldface green type with strikethrough. Figure 2 shows a passage from the twenty-ninth chapter of the Records of the Historian (Shi ji 史記) compared with its counterpart from the twenty-ninth chapter of the Book of Han (Han shu 漢書).Footnote 15 It bears noting that this result does not necessarily indicate a direct linear connection between the two passages being compared (à la the stemma codicum model in redaction criticism), all that is necessary is that it meet the mathematical threshold of a minimum shared number of lexical units (words or graphs) in sequence, and thus this type of intertextual analysis equally indicates citations, allusions, and paraphrased portions in any combination of sources from any time period, genre, or style.

Figure 2. Screenshot of the TextPAIR Sequence Alignment in PhiloLogic