Data and Data Handling

Olga Fischer; Hendrik De Smet; Wim van der Wurff

doi:10.1017/9781139049559.003

2 Data and Data Handling

2.1 Introduction

In this chapter, we discuss several general issues in the use of data for the history of English syntax, having to do with data collection and the process of making initial sense of the material collected. We begin by considering the use of handwritten and printed texts (Section 2.2). Until the late 1980s, ‘texts on paper’ were virtually the only source of data available for historical studies of English syntax, and this means that many of the handbooks and standard treatments of the subject still in use today are based exclusively on such materials. Because paper texts also form the input to more recently developed materials, a proper understanding of their nature and characteristics remains vital for all historical syntactic work. After discussing the practice, problems and benefits of working with paper texts, we turn to the possibilities opened up by the use of more recent resources, in particular digital corpora (Section 2.3). We give a brief overview of the different types that are currently available for diachronic syntactic work, the advantages that their use can bring and also some of the problems associated with the use of corpora as data sources. Next, we consider two general issues that arise in the handling of historical syntactic data. The first is the issue of variation, in terms of dialect, social variety, text-type and – broadly speaking – style (Section 2.4). It is now generally recognized that there can be no change without variation, so this is an obvious focus of interest for all diachronic syntactic study. A second, partly overlapping issue is that of data patterning (Section 2.5). Once paper texts and/or corpora have been mined for relevant data and these have been grouped according to factors like dialect, social provenance and text-type, what are then the patterns that we could or should look for and what are some of the problems that might arise in identifying them?

It will be clear that there are many theoretical decisions that will influence the selection and handling of data in the study of any specific topic in the history of English syntax. Thus, if one’s aim is to characterize the internal grammars of speakers of English at different points in time, the lack of spoken records for all but the most recent periods makes it necessary to draw motivated inferences on the basis of necessarily incomplete data, whether these are in paper or digital form. Moreover, unlike in research on present-day syntax, there are – in spite of what Lightfoot (Reference Lightfoot1979: 6) suggests – no native-speaker judgements for earlier English; hence it is impossible to be certain about the grammaticality of many types of sentences. In historical studies with a sociolinguistic focus, such considerations may be less of a problem, but it should not be thought that such work therefore stays ‘closer’ to the data in any real sense of the word. The attribution of geographical, social or stylistic properties to a text is the result of a complex process of reasoning, involving decisions about the overall system of categorization to be adopted, the establishing and weighing of criteria for assignment to specific categories, and comparison with other texts. If evidence on which to base these decisions is incomplete, as it usually is in historical work, the whole enterprise goes well beyond what is in the data as such. Clearly, the kind of patterns on which one wants to focus should have some connection with the data, but their nature will also inevitably be influenced by the general view of syntax and syntactic change that is adopted.

This chapter aims to make explicit certain methods and concepts on which there is a large degree of consensus and which are indispensable for developing an initial description of any set of diachronic syntactic data. The next chapter gives an overview of how such initial descriptions can be and have been interpreted further, in sometimes widely divergent ways. All of this, we hope, will give the reader a sense of the evidential and theoretical basis underlying the accounts of specific syntactic changes presented in Chapters 4 through 9.

2.2 Data from Handwritten and Printed Texts

Data in English from before the year 1473, when William Caxton – still in the Low Countries at the time – printed the first book in English, are all written by hand. They include some early inscriptions, such as the runes on the Franks Casket and the Ruthwell Cross, but – unlike scholars working on early North Germanic – historical syntacticians of English have not paid much attention to such materials, which are few and tend to be short anyway. This is more than compensated for, however, by the large number of manuscripts surviving from the medieval period, many containing substantial amounts of English text. For purposes of syntactic inquiry, it is actually rather unusual to work with these first-hand sources: the standard method is to use printed editions of the texts in which one is interested.Footnote ¹ A rich collection of this type is the volumes in the various series published by the Early English Text Society (EETS), set up in 1864 partly to facilitate historical lexical research for what later became the Oxford English Dictionary. But these editions, the total number of which now approaches 500, are obviously also eminently suitable as sources for historical syntactic work. In fact, until the development (around the year 1990) of computer corpora containing historical texts, the EETS volumes, together with editions in other series, such as the Scottish Text Society and the Camden Society, and one-off editions of single texts formed the main source for any serious data work on medieval English syntax. For the post-medieval period, there are very large numbers of printed texts, in which additional data can be found. Of course, handwritten texts did not stop being produced after the fifteenth century, and indeed many early Modern English (eModE) and late Modern English (lModE) diaries, personal notebooks and collections of letters have been published in printed versions suitable for historical syntactic work.

Diligent and intelligent perusal of enormous amounts of material of these various kinds has led to the large-scale descriptions of the history of English syntax found in works like Jespersen (Reference Jespersen1909–49), Mustanoja (Reference Mustanoja1960), Mitchell (Reference Mitchell1985) and Visser (Reference Visser1963–73), as well as a host of smaller-scale investigations of particular syntactic phenomena (many of them conveniently reviewed in Denison, Reference Denison1993). The wealth of detailed descriptive studies of this type has meant that the history of English syntax has become a favourite testing ground for ideas and hypotheses concerning syntactic change in general. Several of the theoretical models discussed in the following chapter were originally developed with reference to changes in English, with data often being drawn from reference works like those mentioned here.

One of the issues that all historical linguistic studies have to grapple with is that of the dating of texts. For example, there has been an (implicit) tendency in the study of OE syntax to regard the period as being rather homogeneous. However, given that the OE materials suitable for syntactic study span more than three centuries, this cannot be right. Hence, it is necessary in collecting and reporting data to make at least initially a distinction between early OE (as attested mainly in the works associated with King Alfred, of the late ninth century) and late OE (as attested in a more diffuse body of work, a prime representative being Ælfric, who wrote around 1000 AD). It has also been demonstrated, in particular by Allen (Reference Allen and Colman1992), that many manuscripts containing OE works are actually late copies, produced in the early ME period, thus making them potentially unreliable witnesses to properties of OE grammar. In dealing with texts from the Modern period, more subtle questions about dating issues can sometimes be addressed, such as whether texts should be seen as representing the state of the language during their year of composition or during the formative years of their authors. Thus, Arnaud (Reference Arnaud and Jacobson1983, Reference Arnaud1998) presents data on the progressive in the nineteenth century which suggest that some authors appear to show stable usage through time, while others exhibit what Sankoff (Reference Sankoff, Ammon, Dittmar, Mattheier and Trudgill2005) calls ‘lifespan change’.

Although work based on paper texts has yielded impressive results, in the form of detailed descriptions of many changes in English syntax through the centuries, there are obvious problems that it faces due to the nature of its data sources. The main problem is that reading texts for the purpose of finding examples of specific syntactic phenomena can be enormously time-consuming. In earlier work, this led to the inclusion of statements about the frequency of constructions that were quite imprecise and/or impressionistic (Visser, Reference Visser1963–73, is particularly notorious in this respect). Moreover, it is often impossible to achieve full accountability, in the sense of considering all potential variants that express a certain meaning or share a common formal property. For example, in a study based on a large number of paper texts, Foster and van der Wurff (1995) consider changes in the frequency of clauses with object-verb order in late Middle English but are unable to provide full comparable data for clauses with verb-object order, instead relying on smallish samples of such cases. Yet another problem is that data work of this type can often not realistically be replicated. This is particularly problematic when, after data collection, it is realized that some crucial variant or alternative has been mistakenly omitted from the search. Reliance on data from paper texts can thus be an obstacle in implementing the ‘virtuous’ cycle where increased insight triggers the collection of additional data, which in turn increases insight, and so on and so forth.

Nevertheless, there are also advantages to the use of paper texts in searching for data. For one thing, the careful reading that it necessitates ensures that the researcher gets a sense of the complete syntactic system of the language of the text, rather than limiting this attention solely to the one phenomenon under consideration. If one believes the structuralist tenet that language is a system where everything coheres (un système ou tout se tient), this is obviously important. From a practical point of view, it can be helpful in identifying phenomena which have similar functions or forms. For example, while doing an exploratory search in some eModE texts for the NP the same used as an anaphoric marker, as in (1) (with antecedent given in italics), Leung and van der Wurff (Reference Leung and Van der Wurff2011) noted the frequent occurrence in the same texts of NPs of the type the said N with rather similar function, as in (2).

(1) The Cooks they wrought both day and night in many curious devises, where was no lacke of gold, silver, or any other costly thing: the Yeomen and Grooms of his Wardrobe were busied in hanging the Chambers with costly Hangings, and furnishing the same with beds of silke and other furniture for the same in every degree (1641, Cavendish, Th.Woolsey, 74).

(2) ere we came to Standingfield, the Cardinall of Lorraine a goodly young Gentleman gave my Lord a meeting, and received him with much joy and reverence, and so passed forth with my Lord in communication untill wee came neere the said Standingfield, which is a religious place standing betweene the English, French, and Imperiall Dominions (1641, Cavendish, Th.Woolsey, 42).

Accordingly, it was decided that developments affecting these two expressions should be compared, as part of a study of anaphoric devices that were frequent in eModE but that have since become rather minor variants – not an earth-shattering idea in itself, but it might not have arisen at all if attention during data searching had been limited to sentences of type (1) only. In working with material from older periods, there is a certain danger that expectations based on present-day English will create blind spots or colour the perception of earlier data. Reading longer stretches of text is no guarantee that this will not happen, but it may help avoid at least some of its negative effects.

2.3 Digital Data

Over the past two decades, there has been great activity in the creation of digital corpora of texts suitable for English historical linguistic work. As with corpora for PDE, various types of historical corpora can be distinguished, according to how they have been planned: balanced (i.e. containing specific proportions of material of different kinds) versus opportunistic (i.e. containing material within the period of interest chosen simply because it is easily available); general (covering a wide range of materials) versus specialized (focusing on texts that meet some very specific criteria); bare (not enriched with linguistic coding) versus tagged (enriched with part-of-speech labels) versus parsed (fully syntactically analysed). Such classifications, though, should not be taken as more than somewhat imprecise labels which can be convenient for identifying corpora of certain kinds but which may not be helpful in discussing other kinds (see e.g. Lüdeling and Kytö, Reference Lüdeling and Kytö2008–09, Bennett et al., Reference Bennett, Durrell, Scheible and Whitt2013, for full discussion of these and many other aspects of corpus use, planning, creation and classification).

The mother of all diachronic English corpora is the Helsinki corpus, a 1.6-million-word, largely balanced, general and initially bare collection of texts approximately spanning the years 730–1710, compiled at the University of Helsinki in the late 1980s and described fully in Kytö (Reference Kytö1996). Data from this corpus have been used in a tremendous number of historical syntactic studies – some early examples can be seen in the papers in Rissanen et al. (Reference Rissanen, Kytö and Palander-Collin1993, Reference Rissanen, Kytö and Heikkonen1997). Several additional corpora have been created collaboratively by the University of Pennsylvania and the University of York. These are partly based on the Helsinki corpus materials but add full syntactic annotation to the texts. They cover the OE period (the York-Toronto-Helsinki Corpus of Old English Prose, which contains all major OE prose texts, and the York-Helsinki Corpus of Old English Poetry, containing all surviving OE verse), the ME, eModE and lModE periods contained in the three Penn-Helsinki Parsed Corpora of these periods. All of these are general and balanced. They use the same system of syntactic annotation and the same query language (named CorpusSearch), which takes some learning effort but allows users to easily search for syntactic patterns through the entire history of English.

In addition, there are now dozens of specialized historical corpora, covering many specific text-types and/or subperiods. Several of these also have been compiled at Helsinki, such as the Corpus of Older Scots (with nearly 1 million words of Middle Scots) and the Corpus of Early English Correspondence, in its first version containing the text of c.6,000 letters written in the period 1410–1680 (Raumolin-Brunberg and Nevalainen, Reference Raumolin-Brunberg and Nevalainen2007). There is a corpus containing the proceedings of the Old Bailey in the eighteenth and nineteenth centuries (Huber Reference Huber2007). The Corpus of Late Modern English Texts (currently at version 3.1) contains 34 million words of British English prose from the years 1710–1920 (De Smet et al., Reference De Smet, Flach, Tyrkkö and Diller2015). A collection of early newspaper writing (1660–1800) can be found in the 1.6 million-word Zurich English Newspaper Corpus (Lehmann et al., Reference Lehmann, auf dem Keller, Ruef, Facchinetti and Rissanen2006). Of particular note are also the corpora developed by Mark Davis at Brigham Young University, which include the Time Magazine Corpus (containing the full text of this magazine since its inception in 1923) and the Corpus of Historical American English (spanning the period 1810 until 2006 and comprising 400 million words), both with part-of-speech tagging and searchable with the same intuitive search engine. In addition, many more corpora exist and others are currently being compiled or planned (for fuller descriptions of some corpora, see e.g. Beal et al., Reference Barðdal, Smirnova, Sommerer and Gildea2007; for more recent corpora see Section 2.4).Footnote ²

Some of these corpora have no linguistic tagging, but this need not be an obstacle to their exploitation for diachronic syntactic work. As long as the researcher is examining a pattern which has one or more specific lexical items uniquely (or nearly uniquely) associated with it, data for many syntactic phenomena can be collected. This can also be done in online text collections primarily created for the purpose of literary or historical rather than linguistic investigation, such as ‘Early English Books Online’ (containing all English books published before 1700) and the historical archives of The Times of London and The Guardian, which can all be treated as large digital sources of data.

The advantages of working with digital corpora will be obvious: because it is many times faster than the method of reading through paper texts, it is possible to inspect much larger amounts of text and, in some cases, to aim for full accountability to the data; it is also replicable. In addition, when one has access to the complete digital files of a corpus, there is the possibility of enriching it with additional linguistically relevant information. In terms of hypothesis generation and testing, if one has a hunch about an explanation or a pattern based on initial corpus data, it is usually fairly easy to do a quick search in the same corpus to see if the hunch might be worth exploring in further detail. Corpus work therefore accommodates itself well to the virtuous cycle of data triggering new ideas, to be checked against additional data.

But working with corpora also has some potential disadvantages. In the interviews with fourteen corpus linguists collected in Viana et al. (Reference Viana, Zyngier and Barnbrook2011), one of the questions addressed was the possible weaknesses of corpus analysis. Although none of the interviewees specifically focused on the use of corpora for historical syntactic study, they identified the following areas of concern which are also relevant to our discussion:

(i) there can be tension between what is easily retrieved through corpus searches and what is thought to be linguistically most significant; a historical syntactic case in point involves patterns of co-reference of noun phrases, as in (1) and (2); these have been largely neglected because they involve information status, which is currently not part of any standard annotation scheme (but see Komen, Reference Komen2009, for ideas on how to incorporate it);
(ii) when a data search yields large numbers of hits, there may be a temptation to interpret corpus results merely as numbers, which is a severely reductive approach; in cases of grammaticalization, for example, changes in frequency may act as tell-tale signs (see Hopper and Traugott, Reference Hopper and Traugott2003: 126ff), but an exclusive quantitative focus will mean that one is ignoring the changes in meaning and context that form the core of the process;
(iii) the substantial amounts of data that can be collected from a corpus can also blind researchers to the dangers of making generalizations about the language as a whole on the basis of a partial view of it; this is a particularly relevant problem for diachronic research, because we only have very incomplete evidence for the state of the language in any historical period (see Section 2.6 for further discussion);
(iv) trying to achieve greater representativeness by collecting and comparing data from various corpora can also be tricky: principles guiding text inclusion vary widely, there is little standardization in user interfaces, and they can require a significant time investment to learn to operate.

2.4 Data and Variation

From whatever source one’s historical syntactic data are collected, a vital part of processing the findings for further interpretation is categorization in terms of a number of parameters of variation. The first of these – not surprisingly – is temporal, as expressed by date of writing of the texts and/or date of birth of the authors. Because syntactic change tends to be somewhat slow, there is usually little point in comparing data separated by a time span of only a few years. Hence, in many studies, the periods investigated cover from 50 up to 150 years. In the study of medieval English, the nature of the data sources sometimes makes it impossible to set very precise boundaries, but it would generally be felt that treating a period longer than two or three hundred years as one single data point (e.g. by taking the average frequency value of some phenomenon in this period) will result in too much loss of information. At the lower end of the scale, a period of about thirty years can already yield interesting syntactic change, as demonstrated by several studies of recent change in written English (e.g. Mair and Leech, Reference Mair and Leech2006, Leech et al., Reference Leech, Hundt, Mair and Smith2009), using the so-called Brown family of corpora, currently consisting of the Brown corpus and the London-Oslo-Bergen (LOB)-corpus (each containing 1 million words of English from the 1960s, from the United States and the UK, respectively), the Frown corpus and the FLOB corpus (with comparable materials from the 1990s) and the BLOB-1931 corpus (a BLOB-1901 corpus is in the making). The existence of such relatively fast change has also been reported in specialized studies of historical texts (e.g. Arnaud, Reference Arnaud and Jacobson1983, Reference Arnaud1998; Raumolin-Brunberg, Reference Raumolin-Brunberg2009).

Another obvious dimension of variation lies in the dialectal or regional nature of the material examined. On the whole, historical syntactic research has tended to adopt the dialect divisions developed in phonological and lexical studies and tried to establish whether there are also syntactic differences reflecting these divisions. A major problem in this is the strong dominance in much writing of standardized kinds of English. Even in OE, the West-Saxon variety was in use in Anglian areas as well, making it somewhat difficult to spot dialectal variation, certainly at the syntactic level. But there is some. One well-established case concerns incorporation of the negative marker ne into verbs (forms like nis ‘not is’ and ne-wat>nat ‘not knows’, a feature most typical of West-Saxon OE; see Hogg, Reference Hogg, Curzan and Emmons2004). In ME, there is much more dialectal diversity in the texts (Milroy, Reference Milroy1992), and although the syntactic side of this has been somewhat neglected until recently, it is now clear that there are, for example, North-South differences in the position of the finite verb in main clauses (Kroch et al., 2001), in subject-verb agreement patterns (de Haas, Reference De Haas2011) and in the use of relative markers (Suárez-Gómez, Reference Suárez-Gómez2009). One possible avenue for further work might be to examine closely the prose texts in manuscripts localized by means of the ‘fit’ technique in work for the Linguistic Atlas of Late Middle English (McIntosh et al., Reference McIntosh, Samuels and Benskin1985) and try to determine whether the phonological and spelling variation exploited in that project is matched by syntactic differences.

For the eModE period, syntactic investigation has until recently largely concentrated on printed works, which right from the start show rather uniform use of syntactic constructions. In fact, Görlach (Reference Görlach and Lass1999b: 492) states that ‘evidence of Early Modern English dialect syntax is almost nil’. But progress in this area is now being made by careful exploration of hand-written texts, for example in the Corpus of Early English Correspondence. Nevertheless, the amount of dialectal grammatical variability detected so far is not very great and tends to be more morphological than syntactic in nature (for examples, see Nevalainen, Reference Nevalainen2011). For lModE, there is a certain amount of printed dialect writing available and there are some contemporary comments on dialectal features (see Ihalainen, Reference Ihalainen and Burchfield1994), but the work done so far on dialect syntax is limited and few results have been reported within British English (see Kortmann and Wagner, Reference Kortmann, Wagner and Hickey2010, for an overview). The best prospects in this area seem to lie in first identifying differences in syntax between present-day regional varieties as is done in Trousdale and Adger (Reference Trousdale and Adger2007; a good example of this can also be found in Anderwald, Reference Anderwald and Iyeiri2005) on the use of double negation in the North versus the South of England), and then to examine earlier materials for their possible presence.

This method has been used to good effect in tracing the development of several syntactic differences between British and American English during the nineteenth and twentieth centuries. Present-day differences are reasonably well documented (in works like Tottie, Reference Tottie2002 and Algeo, Reference Algeo2006), and there are now several corpora that can be used to study their historical origins. Rohdenburg (Reference Rohdenburg, van Ostade and van der Wurff2009), for example, argues that corpus data show American English increasingly favouring less explicit grammatical marking, as shown by the fact that it has led the way in changes like the loss of reflexive marking with verbs such as to straighten (oneself) and the loss of to be in passive complements of ‘order’ verbs, as in He ordered it (to be) done.Footnote ³ Additional studies making such historical trans-Atlantic comparisons can be found in Rohdenburg and Schlüter (Reference Rohdenburg and Schlüter2009).

Syntactic variation conditioned by social factors is probably not recoverable for the OE and early ME periods, owing to the sparsity of texts for comparison and the total absence of information about many of the authors of this time. For late ME and eModE, however, some good results have been achieved. Bergs (Reference Bergs2005) studies the effect of social networks in the language of the Paston Letters and, again, the Corpus of Early English Correspondence has been an important source of useful data of this type (see Nevalainen and Raumolin-Brunberg, Reference Nevalainen and Raumolin-Brunberg1996, Reference Nevalainen and Raumolin-Brunberg2003, where a range of morpho-syntactic features are examined). For the eighteenth century too there has been some work, with letters again being a favourite data source (see Tieken-Boon van Ostade, Reference Tieken-Boon van Ostade1987, on social class and gender as factors influencing use of auxiliary do, Sairio, Reference Sairio2009, on the use of preposition stranding and of the progressive within one specific social network, and Laitinen, Reference Laitinen2009, on gender variation in the use of you were and you was). For the nineteenth century, there are many incidental observations on socially conditioned grammatical variation (e.g. Phillips, 1984) and a few more systematic studies of certain topics (Pratt and Denison, Reference Pratt and Denison2000, on the diffusion of the passive progressive, Kytö and Romaine, Reference Kytö, Romaine, Kytö, Rydén and Smitterberg2006, on gender differences in adjectival comparison), but there is clearly scope for further work.

Yet another source of variation is differences in text-type or register. This is a factor that is taken into account in the creation of virtually all historical corpora, and it is usually not difficult to pinpoint the more obvious text-typical features of the material one is working with. The general finding in work that has considered the role of this factor is that it can cause very big differences in syntax between texts. An obvious case in point is the differences between poetry and prose. These have been studied in particular for the medieval period, for which much of the surviving textual evidence consists of verse. With regard to word order, for example, Foster and van der Wurff (1995) show that, in the fourteenth and fifteenth centuries, poetry and prose increasingly diverge in their use of pre-verbal objects, with prose showing a clear decline but poetry maintaining stable levels, a clear sign that this order is becoming a marked option for language users. Attempts have also been made to exploit the rhythmic organization of verse to draw inferences about the relation between word order and intonation characteristics of the language, something which is otherwise very difficult or impossible to do in historical material. Thus Pintzuk and Kroch (Reference Pintzuk and Kroch1989) argue on the basis of data from the poem Beowulf, which they take to represent a very early type of OE, that post-verbal position of objects in English was originally restricted to NPs separated from the verb by an intonation boundary. Comparing Beowulf with the rhythmical prose of Ælfric, Taylor (Reference Taylor2005) argues that this restriction was gradually loosened in the course of the OE period, suggesting that post-verbal objects were becoming a more freely available option.

In spite of these and other ways in which data from verse can provide insight into processes of syntactic change, poetic usage must be treated with care because part of it may be the result of hard-to-pin-down literary conventions and the even more intangible effects of a poet’s attempt to create local meaning. Most studies of historical syntax therefore focus on prose. Within this category, there are also big differences in the use of syntactic options so that, normally, finer distinctions are made, often depending on what kind of material is available for different text-types in the period of interest. The Helsinki Corpus, for example, contains texts in the following, partly overlapping general categories: statutory (legal and official documents), secular instruction (handbooks, texts on astronomy, medicine, philosophy and education), religious instruction (treatises, homilies, sermons), expository (texts on astronomy, medicine and education), non-imaginative narration (history, biography, religious treatises, travelogues, diaries) and imaginative narration (fiction, romance, travelogues). Grouped in one way or another, these are naturally also the categories distinguished in the numerous studies based on data from the Helsinki corpus. The overview of historical corpora in Section 2.3 reveals some of the other text-types that have been singled out for special attention, such as letters, newspapers and court-room language.

Although the majority of studies focus on the development of a single syntactic phenomenon or a small set of related phenomena in one or more text-types, there have also been studies examining the textual distribution of clusters of features. For example, inspired by Douglas Biber’s (Reference Biber1988) work on text-types in PDE, Biber and Finegan (Reference Biber and Finegan1989) show that, since eModE, the text-types of fiction, essays and letters have become steadily more oral, in the sense of having higher frequencies of features associated with a more involved, less elaborated and less abstract style. Concretely, this means they have come to use, for example, more present tense verbs, more second person pronouns, more that-deletion, more time adverbials, fewer passives and fewer adverbial clauses. The same method has since been used to trace the development of other text-types, such as scientific writing (Atkinson, Reference Atkinson1999) and newspaper editorials (Westin and Geisler, Reference Westin and Geisler2002).

For some researchers, the holy grail of historical syntactic study is the identification of properties of everyday spontaneous speech in historical periods. We have only glimpses of what this text-type was like, but – on the basis of work on PDE – certain other text-types can reasonably be taken to provide approximations to naturalistic speech, in particular personal letters, court-records and (certain kinds of) drama.Footnote ⁴ Work exploring such registers is now abundant (see Claridge and Walker, Reference Claridge and Walker2001, and Laitinen, Reference Laitinen2008, for some examples, in addition to several mentioned earlier in this section). Where such texts converge in their use of syntactic constructions while diverging from other text-types, it does seem reasonable to say that they to some extent provide a window onto properties of earlier speech, including its role in the origin and propagation of innovations.

A method for investigating change in spoken data more directly is now also available: a Diachronic Corpus of Present-Day Spoken English has been developed at UCL. This contains transcribed versions of different categories of spoken British English dating from the 1950s till the 1990s. Study of such data is interesting also from the perspective of the idea that the grammar of speech is fundamentally different from that of written text, as claimed by researchers like Brazil (Reference Brazil1995) and Miller and Weinert (Reference Miller and Weinert1998). If they are correct, the question arises whether the relevant properties can be subject to diachronic change and how such change would take place. Obviously, only direct comparison of diachronic spoken data can lead to answers. But even if one believes that speech and writing do have shared fundamentals, there is much to be gained from studying changes in speech. Modern corpus work on everyday speech has firmly established that it is characterized by great amounts of repetition and very high proportions of fixed and semi-fixed phrases or lexical bundles (see Greaves and Warren, Reference Greaves, Warren, O’Keeffe and McCarthy2010, for an overview). It is natural to suppose that such pervasive features of speech have an effect on its development over time. Unfortunately, these features are underrepresented even in the historical registers mentioned earlier that come closest to speech. Yet if certain general pathways for syntactic change in speech driven by these features can be deduced from examination of contemporary material, it may be possible to project these back in time and make use of them in accounting for historical change. For some work along these lines, see Bybee and Cacoullos (Reference Bybee, Torres Cacoullos, Corrigan, Moravcsik, Ouali and Wheatley2009).

A rather special text-type which is abundantly attested during all stages of the English language is translations, in particular from Latin (during all periods) and French (in ME and eModE). The interpretation of data from such texts requires extra care, because specific features may partly reflect the syntactic system not of English but of the language of the original. One safeguard against faulty interpretations due to this is comparison with texts that have not been directly or indirectly influenced by a process of translation; comparison with the text in the source language is of course also essential. Using such methods, it has been established, for example, that some syntactic phenomena found in OE texts, such as raising-to-object were not native to OE and occurred only as a result of literal translation from Latin (see Chapter 4). Note that such comparisons give the researcher something resembling negative evidence (i.e. information about syntactic constructions that were not licensed by the underlying grammar of the language at that point in time).

The foreign feature can also manifest itself in differences in relative frequency. An example is the use of object-verb word order in late ME: in most prose texts this is not frequent but there are a few translations from French with very high numbers of preverbal pronominal objects, closely reflecting usage in Old French (Foster and van der Wurff, 1995). Yet another way in which features of the source language can affect the language of the translation is through suppression of certain variants. Thus, Taylor (Reference Taylor2008) notes that OE prepositional phrases are normally head-initial, but can have complement-P order if the complement is a pronoun, as in (3).

(3) Þa cwæþ se Hælend him to, Aris hal of ðam bedde
- then said the Saviour him to arise whole from the bed (ÆHom.2,38)
- ‘Then the Saviour said to him, ‘Arise whole from the bed”

Examination of non-translated and translated texts reveals that the word order in (3) is less frequent in texts translated from Latin.Footnote ⁵ The reason appears to be that this order does not occur in Latin (except with the single preposition cum ‘with’); adoption of the Latin word order therefore has the effect that the translation has far fewer instances of what seems to have been an otherwise productive option in OE.

Of course, the (non-)use of certain syntactic constructions in translated texts due to influence of the source language can in the longer run lead to change, where such uses spread to other texts and may eventually become a general feature of English. Raising-to-object, for instance, eventually became part of the grammar of English. The topic of contact-induced syntactic change of this and other types is addressed in Chapter 4.

A final source of syntactic variation in historical texts that we should mention is style. This factor is difficult to operationalize for historical material, but one way it has been done is through levels of formality. Some of this is captured through the concept of register, as when a comparison is made between, say, everyday talk between friends (informal), a conversation with a stranger (moderately formal) and an address to a larger unknown audience (formal). But variation is certainly possible within these and other registers, and style may be a good term to refer to such variability. It could then include cases of jocularity, semi-serious use of old-fashioned or archaic grammar (as perhaps in (8) later in this chapter), and other special uses of language. Identifying such cases can be difficult and requires reading of paper texts – this is another area in which exclusive use of corpora is likely to lead to incomplete descriptions of the data. In spite of these difficulties, the effects achieved by speakers’ and writers’ stylistic choices are important. Recent work on syntactic change suggests that it is often very local in nature; hence, it is important to be aware of all factors that have an effect on the local selection of syntactic options.

2.5 Data Patterning

Although studies of different topics will naturally focus on different types of configurations in the data, there are some general patterns that many historical syntactic studies will try to identify. One such pattern is related to the conditioning of syntactic choices. In essence, this boils down to the question when certain options are used and when they are not. The conditions can of course relate to the dimensions of variation discussed in the previous section, but they can also be entirely linguistic in nature. A good example of the latter possibility is the word-order phenomenon in OE and ME known as verb-second, illustrated for OE in (4), where the initial element eall ðis ‘all this’ is immediately followed by the finite verb aredað ‘arranges’.

(4) eall ðis aredað se recere suiðe ryhte
all this arranges the ruler very rightly (CP.169.3)
‘all this the ruler arranges very rightly’

In the early 1980s, it was known that use of this word order is an option characteristic of main rather than subordinate clauses and that it is categorical in interrogatives, except when these are introduced by the word hwæðer ‘whether’ functioning as a kind of question particle (cf. Allen, Reference Allen1980). Later, it was realized that verb-second tends not to occur in OE if the subject of the clause is a personal pronoun, except after a small set of specific clause-initial elements (Van Kemenade, Reference Van Kemenade1987). Later still, the (non-)occurrence of verb-second with subjects that are not pronouns was shown not to be random but to depend on the nature of the clause-initial element (Pintzuk, Reference Pintzuk1991, Koopman, Reference Koopman1998). Furthermore, it was demonstrated that verb-second does sometimes occur in subordinate clauses but only when the verb is passive or unaccusative (Van Kemenade, Reference Van Kemenade1997).

This example is in no way exceptional: usually more is going on in the data than one might initially be inclined to think. Failure to recognize the more fine-grained aspects of the conditioning of a syntactic option results in descriptions that are lacking in precision, and this in turn means that questions about the how and why of syntactic change become difficult or impossible to address successfully. It is true that progress in understanding is a cumulative and protracted process, but when reading historical syntactic work carried out a few or more decades ago, it can sometimes be difficult to avoid the feeling that the writers could have achieved more if they had inspected more carefully the way the data are patterned. The best precaution to ensure that one’s work will not provoke such a response at some point in the near future is to look at the data long and hard before drawing any conclusions from it. One can never be sure in advance what the relevant conditions are but it is always worthwhile to spend time on one’s data to determine the linguistic contexts which may cause variation in the use of a particular syntactic phenomenon (more on this in Chapter 3).

Another aspect of the patterning of historical data that is obviously important is the following: when is the phenomenon under investigation attested for the first or last time? In theory, these should be straightforward questions to answer, but in practice there can be various difficulties. An instructive example here is the search for the earliest cases of the progressive BE+being followed by an adjective or past participle. The latter sequence forms the passive progressive, a relatively recent innovation in the history of English. Nehls (Reference Nehls1974: 158) gives (5) and (6) as the earliest cases.

(5) Sir Guy Carlton was four hours being examined
(1779, J. Harris, Letters, ibid.: 158)

(6) like a fellow whose uppermost upper grinder is being torn out by the roots by a mutton-fisted barber
(1795, Southey, Life & Correspondence, ibid.: 158)

However, Denison (Reference Denison1993: 431–32, 444) argues that (5) is not a secure example since being examined may be ‘a participial or gerundial phrase … used absolutely’ (ibid.: 431), leaving (6) as the earliest case. Denison (Reference Denison and Romaine1998: 152), after having looked at more texts, is able to give an even earlier instance, from the year 1772. But this too has been antedated: Van Bergen (Reference Van Bergen2013), searching in several lModE corpora that have become available since the late 1990s, has identified several instances that are even earlier, including examples from several London-based newspapers in 1761.

For cases of progressive BE+being followed by an adjective, Visser (Reference Visser1963–73: §§1834–35) gives examples like (7)–(9).

(7) With tendre youth was he hote being
‘He was hot (was being hot) with tender youth’
(a1500, Partenay)

(8) but this is being wicked, for wickedness sake
(1761, Johnston, Chrysal)

(9) You will be glad to hear […] how diligent I have been, and am being
(1819 Keats, Letters)

While data like this suggest that the construction has been in use since the start of the eModE period, Denison (Reference Denison1993: 396) points out that (8) and similar examples from the eighteenth century are actually not instances of progressive BE. The meaning of (8) is ‘this amounts to being wicked’, where being wicked is a gerund clause functioning as subject complement and is is the simple present tense of the copula. The result is that there are a few instances of this construction attested around the year 1500, then a long silence, followed by additional examples from the early nineteenth century onwards.

These examples with BE+being illustrate several of the problems that can make the search for earliest attestations difficult:

(i) there may be doubt (sometimes resolvable, sometimes not) over whether a concrete instance is actually an example of what one is looking for; because new constructions often seem to be based on ambiguous instances of constructions already in existence, this is a recurrent problem;
(ii) findings are inevitably affected by the (lack of) availability of texts, which in turn reflects the amount of scholarly work on text edition and corpus creation for the relevant period as well as the vagaries that determine the loss and survival of historical texts;
(iii) if one has to search for data in unparsed texts (as Visser, Denison and van Bergen did in trying to find passive progressives), the work can be very time-consuming;
(iv) even if one has all the texts and all the time that could exist in any possible world, there will probably be cases in which an early example of an innovative construction is found, followed by a period without any examples and then what looks like a restart, sometimes with several cases attested closely together.

Similar problems can arise when trying to determine the date when a construction passes out of use. In particular, problem (iv) – in this case taking the form of unexpectedly late instances of a construction – can be troublesome. The reason for such late survivals may be that a particular usage is not completely extinct but continues as a stylistically or lexically restricted option, for example as a marker of old-fashioned language or as a feature of certain more-or-less frozen expressions. An example is the use of negatives without do in PDE, as in (10).

(10) I kid you not.

Tieken-Boon van Ostade (Reference Tieken-Boon van Ostade1987) has traced the decline in frequency of do-less negatives in the eighteenth century and Varga (Reference Varga2005) shows they become very infrequent in the course of the nineteenth century (with verbs of cognition surviving longest and strongest). Yet the continued use of cases like (10) makes it impossible to assign a definite date of disappearance to this option. Instead, we have to recognize that the construction is undergoing obsolescence very gradually.

Part of every change, including the process of obsolescence, is quantitative and that is the final aspect of data patterning that we discuss here. Study of the frequency of diachronic data always involves comparison: of frequencies at different time periods and/or in different text-types, produced in different regions and/or by different types of authors (in terms of gender, social class or other characteristics). Moreover, it is often necessary to consider not just one syntactic phenomenon but to make a comparison between two phenomena, regarded as subtypes of a larger category. It is not always the case that there is patterning involving each of these factors, but if a phenomenon has undergone change over a longer period of time, all of them may well have played some role at some point. For example, in research on the history of possessive marking in English, Mustanoja (Reference Mustanoja1960: 75) reports that there was a steep increase in the use of the of-phrase as compared with the genitive in the course of the ME period. If this shift was partly due to French influence (see Chapter 4), the social factor of prestige may have played a role in it. After 1400, the shift was to some extent reversed, with the genitive regaining some ground (Rosenbach et al., Reference Rosenbach, Stein and Vezzosi2000). In terms of text-type, it has been shown that in the seventeenth century the genitive was frequent in poetry but also in texts reflecting language use at a personal and informal level (Altenberg, Reference Altenberg1982). Work on more recent periods, such as Rosenbach (Reference Rosenbach, Rohdenburg and Mondorf2003) and Mair and Leech (Reference Mair and Leech2006), has demonstrated that the genitive is still increasing in frequency as compared with the of-phrase, in particular in text-types belonging to more informational genres.

Quite complicated patterns can emerge once linguistic conditioning factors are also taken into account (e.g. the preference for genitives when the possessor NP is short has human reference and encodes given information). The recognition that quantitative patterns in data can result from the interaction of many different factors has led to the use of increasingly sophisticated statistical methods of analysis. For example, Hinrichs and Szmrecsanyi (Reference Hinrichs and Szmrecsanyi2007) examine possessive marking in the Brown family of corpora using multivariate analysis to establish the relative weight and interaction of no less than sixteen factors, including genre, dialect (UK or U.S. English), time, the nature of the final segment of the possessor NP (sibilant or not), length of possessor NP and possessum NP, etcetera. Like earlier work, they find an increase over the past few decades in the frequency of the genitive in press texts; their detailed findings on the factor weighting allow them to attribute this to a process of economization, where – all else being equal – shorter expression is favoured, rather than to a process of colloquialization, as had been suggested before.

2.6 Conclusions

In this chapter, we have covered some of the issues that arise in the collection and description of data for historical syntax. We have discussed the material basis of these data in handwriting or print (often coming to the researcher only after a certain amount of scholarly pre-processing, as in editions of earlier texts), the care that is needed in interpreting such materials with respect to date of creation and origins, the problems that arise when trying to extract precise and accountable information from them and the benefits that can come from reading entire texts on paper. We contrasted this with the use of digital corpora, available in increasing numbers and types and searchable with much greater speed and accuracy.Footnote ⁶ Yet, it is important when using the latter to strike a balance between number-crunching and interpretation of the data in light of the (textual) world outside the specific pattern under examination. Aspects of this world that are likely to be relevant include the different dimensions of variation that have also been found to be relevant in non-historical work: dialectal or regional variation, social variation and variation in text-type. Usually, it will be found that syntactic data pattern along one or more of these dimensions, which may also play a role in the causation of change. In addition, there is nearly always language-internal patterning, in the form of linguistic conditioning of syntactic options, making for a total configuration that is potentially quite complex and may require the use of statistical analysis. At a simpler level, an obvious aim of practical data work on syntactic change is often to determine when a construction first appears or is lost – but this too can be more difficult than it sounds, particularly given the fact that constructions tend to fade away very slowly rather than disappear abruptly.

In studying data and all the types of patterns in which they can be involved, it is necessary to firmly keep in mind the limitations of the evidence that we have for most of the history of the language. Instead of a rich spectrum of vernacular regional dialects, what we have for OE are documents in mostly standardized versions of the dialect of South-Western England (with the early and later texts representing different localities in this area), with uncertain amounts of influence from Latin, from which many texts are translated, and with Celtic possibly still present as a substrate language (see Chapter 4). For ME, the evidential database is much richer, but there are huge gaps for early ME when writing in French occupied much of the space that writing in English might have done. Throughout ME there is also a dearth of materials that reflect more spontaneous types of language, and when these come in towards the end of the period (as in plays and personal correspondence) there is simultaneously an increasing effect of standardization. For eModE and lModE, existing materials are much richer but – certainly when it comes to texts that are easily available – they are heavily slanted towards the formal usage of middle-class males. And they are all written. While it is entirely possible for a change to find its origin in writing (see Biber and Gray, Reference Biber and Gray2011, for an example), it is probably much more usual for innovations to start in speech. This means that historical material gives us only an imperfect data-set to work with. Because of the gaps in the written record and the complete absence of speech, including its patterns of intonation and its heavy use of fixed phrases and interactional routines, we cannot claim to have a good idea of what many regard as the normal locus of syntactic change. Some of the fluctuation and variability that is typically found in historical data may therefore well be noise, masking the regularities of an underlying system that we can access only very indirectly and incompletely.

Book contents

2 - Data and Data Handling

Summary

Information

2 Data and Data Handling

2.1 Introduction

2.2 Data from Handwritten and Printed Texts

2.3 Digital Data

2.4 Data and Variation

2.5 Data Patterning

2.6 Conclusions

Footnotes

Accessibility standard: Unknown

Why this information is here

Accessibility Information

Book contents

2 - Data and Data Handling

Summary

Information

2.1 Introduction

2.2 Data from Handwritten and Printed Texts

2.3 Digital Data

2.4 Data and Variation

2.5 Data Patterning

2.6 Conclusions

Footnotes

Accessibility standard: Unknown

Why this information is here

Accessibility Information

Save book to Kindle

Save book to Dropbox

Save book to Google Drive