When the dictionary-publishing house of Chambers closed its lexicographic department in 2009, a sympathetic article by Allan Brown in the London Times lamented the passing of a venerable institution. For Brown, lexicographers are by nature ‘boffinish’ and ‘pedantic’ and, with the demise of Chambers, the world was losing a valued team of ‘white-haired, cardiganed index-carded old duffers’ whose responsibility it was to keep the language safe. Underlying Brown’s article is a set of popular misconceptions about the process and purpose of lexicography, with which most people in the dictionary community will be depressingly familiar. The telltale word here is ‘index-carded’, reflecting an era when lexicographers – in the tradition of James Murray – approached the task of entry-writing by rifling through citation slips. For Brown (and probably most of the general public), fifty-odd years of technological innovation might never have happened. The reality is that computers have had an essential role in lexicography since as long ago as the 1960s.
In this chapter, we review the transformative effects of technology on dictionary-making in four main areas. We look at the dictionary as a document consisting of continuous plain text, which can be organised as a database. Secondly, we review changes in the evidence base for dictionaries, which has moved from the aforementioned index cards (recording manually gathered citations) to computerised corpora. Thirdly, as corpora grew, the software tools for interrogating them became more sophisticated, thanks to the application of techniques developed in the natural language processing (NLP) community. Finally, we discuss the migration of dictionaries from print to digital media, and the implications of this change for both publishers and consumers.
Early Days of Computers and Lexicography
The first application of computer technology to dictionary-making came during the project which produced the Random House Dictionary of the English Language (1966). Managing Editor Laurence Urdang recognised that the text of a dictionary entry consisted of a finite set of recurrent and clearly delineated components (headword, part-of-speech label, pronunciation, definition, and so on) which were well adapted to configuring as a computer database. This was a huge advance, significantly enhancing the value of a text which could now be stored, searched, and output in new ways. In an early indication of the benefits of technology, it also brought efficiency savings, enabling a large project to be completed in a relatively short time. From our twenty-first-century perspective, Urdang’s insight seems self-evident, but in the early 1960s this was a brilliantly far-sighted innovation. Indeed, Urdang was so far ahead of his time that the computerisation of the Random House operation stopped short at the typesetting stage. The ‘production department failed to find a typesetter able to undertake computer-driven typesetting from customer-generated tape … Astonishingly, therefore, the whole dictionary was dumped onto paper and re-keyboarded at the printing house, adding a year to the schedule’ (Hanks 2008, 468). Nevertheless, within a couple of decades of Urdang’s groundbreaking work, the approach he pioneered had become more or less routine.
The Corpus Revolution
Around the same time, another of the great visionaries of computerised lexicography, John Sinclair, was taking his first steps in corpus development and corpus analysis. With his colleagues Susan Jones and Robert Daley, Sinclair led the Office for Scientific and Technical Information Project, known as the ‘OSTI Project’, in the first half of the 1960s (Daley et al. 2004). The project aimed to provide an empirical description of collocation in English, and at the heart of the research was a corpus of 135,000 words of transcribed conversation stored on magnetic tape. In the report on this project, Sinclair was building on ideas developed by the British linguist J. R. Firth, who believed that the complete meaning of a word is always contextual, and no study of meaning apart from a complete context can be taken seriously. Sinclair’s focus was on recurrent patterns in text (especially collocation) and their relationship with meaning, and he recognised at an early stage that very large corpora would be needed in order to investigate these phenomena on the scale required for lexicography. Electronic corpora of English – notably the Brown Corpus and Lancaster-Oslo/Bergen Corpus ‘LOB’ Corpus – were beginning to be used in the study of grammar. But the well-known Zipfian distribution of vocabulary in a language meant that a corpus like Brown – at one million words – was too small to support an empirical description of the lexicon of English.
Another decade or so would pass before Sinclair returned to serious corpus study, and this time he planned a corpus an order of magnitude larger than Brown and LOB. At the beginning of the 1980s, Sinclair inaugurated a joint project with the publisher Collins, the outcome of which was the Collins COBUILD English Language Dictionary (1987), the first English dictionary compiled through the systematic exploration of a corpus.
The COBUILD project broke new ground in a number of areas, and technology played a major part at every stage. As dictionary text was compiled, it was initially written (by hand) onto a series of paper slips which ‘were specially designed to hold the information in a format suitable for computer input to the dictionary database’ (Krishnamurthy 1987, 79). This data was then keyed into the computer, and as the database grew, ‘facilities for interrogating [it] using almost any item (e.g. a syntax pattern, a particular lexical item entered as a synonym, etc.) were made available … and these facilities were refined and extended as the need arose’ (ibid.). By this time, it was no longer unusual for dictionary text to be stored as a computer database. For example, the first edition of the Longman Dictionary of Contemporary English (1978) was compiled in a database, and included a number of added-value features (including an elaborate system of semantic coding) which – though never visible to the dictionary’s end users – proved an invaluable resource for people working in the field of NLP. But the COBUILD project saw greater interaction than ever before between lexicographers and a database, and this contributed to the internal consistency of the eventual dictionary.
What really set COBUILD apart, however, was its systematic use of corpus data as the evidence base for the dictionary. In the later stages of the compilation process, the Birmingham Collection of English Texts (as the corpus was known) ran to twenty million words, but for much of the project a corpus of around 7.3 million words was used. By today’s standards this looks small, but collecting a corpus of this size was a laborious task in the early 1980s. Many of the source texts for the corpus did not exist in digital form, and the conversion process depended on an early version of the optical scanner – which, like most emerging technologies, was expensive, slow, and error-prone. The raw data for the corpus had to be processed by the University of Birmingham’s mainframe computer in order to produce concordances, testing the available technology to its limits. The lexicographers themselves, it should be stressed, rarely had direct access to computers. Concordances for the words they were working on were printed off from microfiches, and these hard copy versions formed the basis for every linguistic fact recorded in the dictionary. The dictionary entries themselves, as noted above, were handwritten onto forms before being keyed by a separate team of computer officers. The most revolutionary aspect of the COBUILD project was its corpus-driven approach to describing language (which ushered in the new discipline of corpus linguistics), while the technological innovations mostly built on earlier work. Notwithstanding these limitations, COBUILD represented a significant advance in the application of technology to the creation and management of dictionary text. (For an excellent introduction to the COBUILD project, see Moon 2009.)
The decade or so after the publication of the COBUILD dictionary coincided with rapid advances in computer technology. Increased processing power and data storage went hand in hand with dramatically lower costs, and, as personal computers became more widely available, working on computer (though not yet online) became the norm for lexicographers on most dictionary projects. Corpora, too, became progressively larger, and the business of corpus creation – though still a far from trivial task – was no longer quite the heroic undertaking it had been in earlier years. During this period, corpus-based lexicography became standard practice among British dictionary publishers.
The Role of Natural Language Processing
The concordances used in the COBUILD project differed from what we are familiar with today in two important respects. First, they came in the form of static printed pages, with concordances sorted to the right: what you saw was what you got, and options such as sampling or left-sorting were simply not available. Second, there was little in the way of linguistic annotation. The COBUILD corpus (at least for the duration of the original project) was neither part-of-speech-tagged (i.e. each word token is assigned a part-of-speech category) nor lemmatised (i.e. each word is assigned a corresponding basic form or ‘lemma’). Consequently, a lexicographer writing the dictionary entry for sound would be issued with separate concordances for each word form (sound, sounds, sounding, sounded), and would then have the job of disentangling the various adjective, verb, and noun uses; the concordance for the form sound, for example, would include instances of all three word classes.
This was an obvious bottleneck in the corpus analysis, but manual part-of-speech (or lemma) annotation has always been a time-consuming process which is not feasible for reasonably large corpora. To annotate a corpus of ‘only’ 100,000 words would add months to the lexicographic process as well as a need first to establish a reliable annotation methodology. Having the annotation, however, allowed the corpus query system to handle all word forms of the same lemma together and spare the lexicographer from a rather cumbersome synthesis of corpus results. To avoid this, lexicographers soon recognised the value of NLP systems automating these tasks, specifically, part-of-speech taggers and lemmatisers. Fortunately, these were of primary interest in the NLP community too as the first step of natural language analysis (in terms of linguistic levels), and a great deal of effort was (and still is) put into the development of high-accuracy tools for these tasks.
Technically speaking, to perform part-of-speech tagging means assigning a PoS tag – a string of characters encoding the morphological information about the word – to each word. In the case of English, a language with rudimentary morphology, these tags were usually just atomic labels describing the part of speech, possibly with a more fine-grained distinction for some classes such as verbs (e.g. for their tense, or for auxiliary verbs).
Every PoS tagger makes errors, so the information is never perfect; the accuracy of the first manual taggers was around 70 per cent (which means three words in ten had a wrong tag). Today’s tools achieve around 98 per cent accuracy on English texts. While this still cannot be seen as ‘problem solved’, because these 98 per cent usually mean that only some 50 per cent of sentences are completely correctly tagged (so follow-up systems that operate on sentence level still suffer from error propagation), lexicography has greatly benefited from automatically PoS-annotated corpora.
Apart from making it possible to distinguish between sound-noun and sound-verb, the PoS tagging enables querying the corpus for grammatical constructions, such as sound followed by an adverb. For this purpose, a technical language for querying corpora was designed at the University of Stuttgart in the early 1990s – the Corpus Query Language (CQL). Using this language, a corpus can be queried very precisely for a wide range of constructions and word combinations.
A separate task is lemmatisation, which involves assigning a basic form (lemma) to each word in the corpus. Lemmatisation enables searches for all occurrences of a word (e.g. sleep), regardless of the particular word form (e.g. sleeps, slept, etc.). For English, lemmatisation is rather easy (there are only a few suffixes and only a little homonymy), but for morphologically rich languages, it is a challenging task requiring use of large databases. Usually, the lemmatisation is performed after PoS tagging, so that the lemmatiser already knows that this particular word ending with -s denotes a third person singular of a verb, rather than a plural of a noun.
PoS tagging and lemmatisation are the basic annotations that are present in most of today’s corpora. It is possible (and automatic tools are available) to add more information, such as an output from a syntactic parser, or information about a particular word sense or word family – but such annotations are rarer because the accuracy of these tools is significantly lower and they also do not contribute that much to the querying options, except in special cases.
Mega-Corpora and Their Implications for Lexicography
The volume of corpus data available to lexicographers increased by two orders of magnitude in the fifteen years following the publication of the COBUILD dictionary. The British National Corpus (BNC), released in 1993, was created by a consortium of UK dictionary publishers (Oxford, Longman, Chambers) and academic institutions (Lancaster University, Oxford University’s Computer Services department, and the British Library), with significant funding from the British government. Details of the design, creation, and linguistic annotation of the BNC can be found at British National Corpus, www.natcorp.ox.ac.uk/corpus/. Its aim was to collect ‘samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the twentieth century, both spoken and written’. Annotation of the BNC, including PoS tagging and lemmatising, was carried out at UCREL, the Centre for Computer Corpus Research on Language at Lancaster University. Crucially, while the COBUILD corpus measured in the tens of millions of words (rising to twenty million words by 1987), the BNC was the first corpus of English to breach the 100-million-word target.
For lexicographers and corpus linguists alike, the BNC represented a major advance. Yet within a further ten years, we begin to see corpora another order of magnitude larger. The first of these was the Oxford English Corpus, a billion-word corpus assembled by Oxford University Press and made up of texts dating from the year 2000 onwards. Larger corpora followed in quick succession, and at the time of writing, the largest available corpus of English runs to over thirty billion words – though effectively, there is no longer an upper limit on corpus size for English and other major languages. What drove these changes was the arrival of the Internet, so that assembling very large corpora became a far less daunting enterprise than it had been in the 1980s and 1990s. And in another move towards working practices which we now take for granted, lexicographers began to work not only on-screen but also online – enabling large dictionary projects to be completed by geographically dispersed lexicographic teams.
For lexicographers working in English, the days of data-sparseness were well and truly over. But this presented its own challenges. Since the beginnings of corpus lexicography, the concordance was the primary tool for language analysis. Scanning a couple of hundred instances of a word in order to identify its lexicographically relevant features requires skill and patience, but it is perfectly feasible. With the new mega-corpora, even relatively infrequent words come with thousands of concordance lines. For example, the word conceptual occurs 1,000 times in the 100-million-word BNC, which gives it a normalised frequency of ten hits per million words of text. On this basis, one would expect the COBUILD lexicographers to have had access to around seventy-five instances of this word. But users of a billion-word corpus would be confronted with perhaps 10,000 occurrences of conceptual (and most dictionary developers now use even larger corpora). And conceptual is only a mid-frequency word; for most really common English words, today’s corpora provide hundreds of thousands or even millions of examples. It is simply not practical to analyse that much data in the form of a concordance.
Random sampling was one response to this problem (take your 10,000-line concordance for conceptual and make a random sample of 500), but not a satisfactory one. What was needed was a technology which could ‘fully exploit the benefits of very large corpora, while preserving lexicographers from an excess of information’ (Kilgarriff and Rundell 2002, 808). A solution was on the horizon.
‘Lexical Profiling’ and the First Word Sketches
As early as 1990, it was recognised that the available tools for corpus analysis were no longer adequate for use with the larger corpora which were then emerging. A groundbreaking paper co-written by a computational linguist (Ken Church) and a lexicographer (Patrick Hanks) proposed the use of a word association metric from the field of information theory as a means of identifying statistically significant patterns of co-occurrence among words in a corpus. This opened up the possibility that a mathematical formula ‘could be applied to a very large corpus of text to produce a table of associations for tens of thousands of words’ (Church and Hanks 1990, 28). One of the potential applications of this approach was ‘enhancing the productivity of lexicographers in identifying normal and conventional usage’ (ibid.) by providing a simple summary (derived from the data in a corpus) of a word’s most salient collocates.
The metric used by Church and Hanks was the so-called ‘mutual information’ measure (MI), and corpus-querying tools of the time began to incorporate lists of frequent collocates as revealed both by MI and another metric, named the T-score. Initially these had little impact on lexicographers’ working practices, because the lists tended to be noisy and random, requiring a good deal of human effort to extract genuinely useful information. But Adam Kilgarriff and David Tugwell, working at Brighton University, hit on the idea of finding (and then grouping) collocates on the basis of an inventory of grammatical relations. These relations included combinations such as verb+object collocations (like forge+alliance, bond, link, partnership, etc.), adjective+noun combinations (sound, practical, useful, etc. +advice), adverb+adjective combinations, and many others. They also experimented with different word association metrics, in order to smooth out some of the biases in the lists generated by the MI and T-score. Their research led to the development of the Word Sketch, which provides an at-a-glance summary of the most important facts about a word’s combinatory preferences.
The base for Word Sketches is a set of queries in the CQL referred to earlier. Each of these queries describes a grammatical pattern, such as a verb and its object. If the pattern is found in the corpus, the particular words (lemmas) matching the position of the verb and the object are extracted and remembered as a collocation candidate. Then, the word pairs extracted by this procedure are computed and scored statistically. The original Word Sketches used the MI score to compute co-occurrences, but since 2006 another metric, the logDice coefficient, has been used because it is not affected by the size of the corpus, and has been found, through a great deal of trial and error, to deliver more satisfactory results.
A primitive version of the Word Sketches was used in the late 1990s, during the development of the Macmillan English Dictionary (2002; now available at macmillandictionary.com). Around 8,000 sketches, using data from the BNC and covering the most frequent nouns, verbs, and adjectives in English, were loaded onto the computers used by Macmillan lexicographers (who at this stage still worked offline). Collocate lists were organised according to grammatical relations, and for each listed collocate, ten examples from the BNC were provided. The original purpose of using Word Sketches during the compilation stage was to complement the use of concordances and support a systematic account of collocation in English. But the Word Sketches turned out to have considerable diagnostic power. By delivering a user-friendly snapshot of the most important features of a word’s behaviour, Word Sketches ‘came to be the lexicographer’s preferred starting point for analyzing a given word’ (Kilgarriff and Rundell 2002, 817).
Further developments followed, as the same technology was applied to corpora in other languages. The Word Sketch function was harnessed to the fastest concordancer then available, resulting in the first version of Sketch Engine (Kilgarriff et al. 2014) – a software package which gradually acquired more functions (and many more corpora) to become the industry-standard corpus-querying system for lexicography in English.
Dictionary-Writing Systems
Following Laurence Urdang’s pioneering experiments in the 1960s, it gradually became normal for publishers to use computer databases to store and organise their dictionary content. One significant development was the digitisation of the Oxford English Dictionary (OED). The OED, published in twelve volumes in 1933, had been updated with a four-volume Supplement in 1986, and at this point the dictionary ran to over 20,000 printed pages, all typeset using old technology. During the second half of the 1980s, it was converted into a digital database (in a joint project with the University of Waterloo, in Canada) – an enormous undertaking.
At this stage, each dictionary publisher would use its own home-grown software systems, developed in-house, and sometimes linked to general-purpose text-processing programs. This model began to be disrupted with the launch, in 1992, of a generic, off-the-shelf software package for dictionary production. Gestorlex, developed by the Danish software house TEXTware, was an ambitious product, consisting of a database module with a structured environment for writing and editing dictionary entries, and a corpus-querying system which generated concordances.
Gestorlex was too far ahead of its time to survive. Lexicographers were still working offline through the 1990s, and this applied to Gestorlex users too. The corpus and editing software had to be installed on individual PCs, and batches of work would be downloaded, edited, then uploaded to a central database. On top of this, the software was engineered to work on IBM’s new (at the time) operating system OS/2 – which never became a mainstream product. Although Gestorlex was used for a time by several dictionary publishers, including Longman, it proved to be a false start. But it set an impressively high bar for future developments.
Wide Internet accessibility led, finally, to the now-usual working model, in which lexicographers investigate corpus data and compile or edit dictionary text online, with the corpora and dictionary databases typically held on remote servers. To exploit this new environment, a number of dedicated dictionary-writing systems were launched in the early 2000s, including the South African TshwaneLex system and the Dictionary Production System (DPS) produced by Paris-based IDM, which (at the time of writing) is used by most UK dictionary publishers. These sophisticated packages generally include project-management modules and systems for publishing dictionary text in either print or digital media.
Automating Lexicographic Processes
Producing a dictionary is a complex and labour-intensive business. From the outset, the application of new technology has brought efficiency savings, and by the early 2000s, some of the processes involved in dictionary-creation had been substantially automated. With the advent of corpora, the creation of headword lists became simpler and more ‘scientific’. Broadly speaking, a dictionary which is planned to have 50,000 headwords will start from the 50,000 most frequent words in the corpus, plus (say) the next 10,000 words by frequency: this provides a candidate list from which the final headword list can be created.
The new generation of dictionary-writing systems has not only made life easier for lexicographers, but has also contributed to accuracy and systematicity. The internal ‘syntax’ of a dictionary entry is built into the software, and this facilitates the entry-writing process. For data fields with a finite set of options – such as style labels, grammar codes, or part-of-speech markers – lexicographers typically select from a drop-down list. This ensures that the content of these fields, and the order in which they appear in the entry, are controlled by the software. Thus, ‘human error is to a large extent engineered out of the writing process’ (Rundell and Kilgarriff 2011, 260). Meanwhile, various routine tasks are handled in the background, unobtrusively and without human intervention – notably the tedious and (for humans) error-prone business of ensuring that cross-references match up correctly.
Developments like these have relieved lexicographers of much of the ‘drudgery’ which Samuel Johnson famously saw as the lexicographer’s lot: the dull, intellectually undemanding jobs which nevertheless have to be done, and have to be done well. A more interesting challenge is to see how far one can automate the more complex parts of the editorial process. In response to a request from UK publisher Macmillan, Adam Kilgarriff and his colleagues created an algorithm, in 2008, which would facilitate the procedure for finding appropriate example sentences in a corpus (Kilgarriff et al. 2008). Examples are a feature of most dictionaries, especially pedagogical dictionaries, but the usual working practice – whereby lexicographers scan dozens or hundreds of concordance lines in order to identify suitable instances – is time-consuming and therefore expensive. The GDEX (‘good dictionary examples’) tool works from a set of criteria for what constitutes a ‘good’ example, such as sentence length, the presence or absence of rare words and proper names, and the number of pronouns a sentence contains. Those corpus sentences which most closely match the criteria are then ‘promoted’ to the top of the concordance, providing the lexicographer with a list of perhaps a dozen likely cases. When the system works optimally, the task of picking examples for the dictionary becomes much simpler and can be performed far more quickly.
Software tools such as Word Sketches and GDEX point the way to a new way of working. Traditionally, a lexicographer would carefully examine the corpus evidence in order to identify a word’s most significant features, such as its syntactic and collocational preferences, and to find good example sentences. With these new technologies, a different working model is emerging, where the computer presents the lexicographer with intelligently selected information, which can then be finalised by human editors – thus largely bypassing the laborious process of ‘manual’ corpus analysis.
Such automatically generated dictionary drafts exploit NLP technologies to derive headword lists, find corpus examples, induce word senses (by clustering collocates or by other means), and suggest collocation candidates, dictionary labels, or even definitions (taking various forms of explanations by means of using collocations, synonyms, or even multimedia). The focus then needs to be put on devising a solid methodology and tools for the post-editing workflow. One of the most recent attempts on this road is Lexonomy (Měchura 2017), a web-based dictionary-writing system designed for the post-editing of dictionary drafts. The key feature of Lexonomy lies in functions which ease the ‘fixing’ of the drafts and functions which preserve data’s connection to the underlying corpus evidence, so that lexicographers can quickly accept automatically generated content, access corpus data in case of doubt, and revise the drafted entries accordingly.
From Print to Digital
Most of our discussion has focused on the creation, editing, and management of dictionary text. But probably the most significant change since the year 2000 has been in the way that dictionaries are delivered to their users. The complete OED was released as a CD-ROM in 1989, and other publishers quickly followed Oxford’s lead. But at this stage, changes were largely cosmetic; these were not much more than ‘books in digital form’. The CD-ROM is in any case an obsolescent technology, and dictionaries in this form were soon superseded, as most publishers migrated their products to online platforms. (Many online dictionaries continue to exist in print editions, too, but this situation seems unlikely to continue for long.)
From a historical perspective, this is a very recent change, and it will be some time before its full implications become clear. But the concept ‘dictionary’ is in the process of being redefined. Most online dictionaries already include multimedia content, as publishers experiment with the opportunities which the new media offer. Features such as audio pronunciations, animations, and video clips are now common, and one can observe a general trend towards visual (rather than text-based) ways of presenting lexical information. Supplementary materials (blogs, Q&A features, quizzes, and the like) are becoming a normal part of a dictionary website, and the addition of ‘user-generated content’ (still in its early stages) is further broadening the scope of what users expect in a dictionary. Another obvious consequence is that dictionaries can now stay fully up to date. When dictionaries were printed books, they would typically be updated with new editions every four or five years. But in online media, new material (such as novel words or meanings) can be added at any time – and should be, as there is an expectation among users that they will find such data in their online dictionary.
With the move online, lexicography is entering a new era, and the consequences of this change are still being worked out. For example, the space constraints of a printed dictionary meant that the headword list had to be carefully selected, and publishers would have strict inclusion criteria. With no such limitations in digital media, how do we now decide which words should be included (or excluded)? Issues such as these are still under debate in the lexicographic community.
Conclusion
The automation of lexicographic tasks continues. Emerging technologies which will contribute to this include systems for detecting novel vocabulary (new words and new meanings of existing words), and tools for automatically applying subject-field labels (for example, where a word is predominantly used in medical or legal discourse). Developments like these reflect the interaction between lexicography and NLP, which has been so fruitful in streamlining many of the processes involved in creating a dictionary. There are still many aspects of the lexicographer’s job which require a great deal of skilled human effort, notably word sense disambiguation and definition-writing. But even these are unlikely to be intractable in the long term. In this chapter, we have shown how – over the last half-century or so – the application of technology, and collaborations with researchers in the NLP community, have transformed both lexicography and the dictionary. Further exciting developments are in the pipeline.