In this chapter, we provide an overview of what this book is about, how it connects with what you likely already know about English grammar, and how each chapter is structured to lead you from learning new concepts and skills to applying those concepts and ultimately to designing your own research projects.
1.1 What This Book Is About
“Discourse syntax” is the cover term for the study of all aspects of syntactic form that we can only explain with reference to the surrounding discourse. The objects of study are syntactic forms, constructions, or phenomena that can only be fully accounted for by taking the surrounding discourse (also referred to as “text,” regardless of written or spoken mode) into consideration. This is where this book is fundamentally different from a general introduction into English syntax, which first and foremost deals with the structure of English sentences without considering any surrounding discourse. For example, in an introductory English syntax class, one would discuss that in English the subject position cannot be left empty (hence we have “dummy subjects,” like there or it), that the object comes after the verb, and that questions require subject–auxiliary inversion. One might also discuss that under certain circumstances the object may come first in the sentence and one might introduce the name for such a construction (topicalization). Example (1a), taken from the freely available Corpus of Contemporary American English (COCA; Reference DaviesDavies 2008), illustrates this construction; Example (1b) would be its canonical syntax counterpart.
a. The second batch I only microwaved for 4 minutes covered and that seemed to work. (COCA, Blog, 2012)
b. I only waved the second batch for 4 minutes covered and that seemed to work.
What one normally does not study in such a class is under which discourse conditions topicalization occurs, which can leave the impression that word-order variations like topicalization are simply a stylistic choice. But that is not the case! We cannot give a proper account of topicalization without specifying the discourse conditions that warrant it. The same holds for other constructions that you are probably already familiar with, like passivization, inversion, or left dislocation, which are all, in some way, non-canonical clauses, that is, clauses that are derived from more basic, or canonical, patterns through a change of word order or the addition of extra lexical material. Canonical and non-canonical patterns of a clause typically share the same propositional meaning (or proposition), which is to say that they underlie the same truth conditions. There are no circumstances under which only sentence (1a) would be true, but not (1b), and vice versa. The second kind of phenomena considered in discourse syntax are constructions and elements that cannot be described as variations of another pattern but whose very function is to hold the discourse together and weave connections beyond the sentence, such as conjunctions and coordinators. We refer to the first set of phenomena as grammar in discourse and the second as grammar of discourse. Discourse syntax is therefore a field of linguistic study that describes and investigates patterns of syntax in language use, with the focus on illuminating the role of the surrounding text, rather than on the role of the speaker’s individual preferences or choices.
1.2 How This Book Connects with What You Already Know
Perhaps you are working with this book because it has been assigned as the textbook or supplementary reading for one of your classes. If that is the case, the chances are that you already have some knowledge of English syntax. You have probably studied parts of speech (nouns, verbs, adjectives, conjunctions, etc.), tests for the constituency of phrases (noun phrases, verb phrases, and so on) and functions they take in the clause (subject, object, adverbial/adjunct). You know that before in (2b) is a preposition, which makes the italicized sequence in (2b) a preposition phrase, while before in (2a) is a subordinator (conjunction), which makes the italicized sequence in (2a) no phrase at all.
a. We’ve got about 10 weeks before the trial comes up in January. (COCA, Movies, 1996)
b. By its very definition, any expertise a juror gained before the trial cannot be extrinsic. (COCA, Academic, 2017)
You are likely familiar with the distinction between simple and complex sentences and the functions that subordinate clauses can take. The that-clause in (3a) is a content clause (traditionally also known as an object clause), while the one in (3b) is a relative clause (modifying the noun plan).
a. He said that the progress of the action plan will be reviewed by him every month. (COCA, Web, 2012)
b. Choose the plan that’s right for you. (COCA, News, 2019)
You also probably know that English is a language that has the requirement that the subject position be filled (in finite, or tensed, clauses) and that this is the reason we have sentences like (4b), with a dummy subject. What you may not have studied are the circumstances under which speakers may choose to produce a sentence like (4a), with the what-clauses serving as a subject, rather than the construction known as extraposition in (4b).
a. What had happened was clear. (COCA, Magazine, 1993)
b. It was perfectly clear what had happened. (COCA, Web, 2012)
You may also have heard that English is considered a strict word order language. The function of a phrase is indicated by its position in the clause rather than by a case-marking affix, as in other Germanic languages. In German, unlike in English, the function of a phrase can be expressed through morphology. For example, in (5), it doesn’t matter if we put the phrase den Jungen (the boy) in a preverbal or postverbal position. The accusative case marking -n indicates that the phrase functions as the direct object.
a. Den Jungen kenne ich nicht.
Theacc boyacc know1st-person-present I not.
“I don’t know the boy”
b. Ich kenne den Jungen nicht.
I know1st-person-present theacc boyacc not.
“I don’t know the boy”
English used to be like that. Old English, the earliest recorded form of English (roughly from the fifth to the eleventh century), was a language in which grammatical relations could be expressed through inflection. However, by around the twelfth century, much of the case inflection in Old English had become optional. How and exactly why this happened is a complicated question, due to the limited availability of data. What we do know is that this “deflexion” (Reference Allen, Kytö and PahtaAllen 2016) happened most rapidly in the northern parts of England, where there was close contact with Scandinavians. It seems, therefore, that language contact is an accelerating factor, if not the initiating one. The further loss of inflectional morphology during the Middle English period led to a decrease in flexibility in word order for canonical sentences, which ultimately paved the way for non-canonical patterns to be associated with specific communicative purposes. One example would be the increase of constructions that allow for non-agents to fill the subject position and become the entity that the sentence is “about” (also known as the topic), such as passives, cleft sentences, and middle constructions (This jam spreads easily). In other words, in a language with a strong sense of canonical syntax, non-canonical patterns have to be motivated, and, more often than not, that motivation lies in the surrounding text. As unlikely as it sounds: A language with strict word order, like Modern English, is the perfect ground to study syntax beyond the sentence boundary.
1.3 How This Book Is Structured
This book is divided into three parts: Part I: Foundations (Chapters 1 and 2) introduces the concept of canonical and non-canonical syntax in English and situates the book in the context of variationist linguistics – the idea that linguistic variation does not occur at random, but is highly structured and influenced by linguistic and extralinguistic factors. Of all those factors, we will focus on the role of the surrounding discourse of different registers and on genre conventions. As a reference model for linguistic terminology, we will use the Cambridge Grammar of the English Language (Reference Huddleston and PullumHuddleston & Pullum 2002), which also comes in a less comprehensive student version (Reference Huddleston and PullumHuddleston & Pullum 2005). Another grammar we will often refer to is the Grammar of Spoken and Written English (Reference Biber, Johansson, Leech, Conrad and FineganBiber et al. 2021), which provides rich usage data on grammatical features in four register varieties (conversation, fiction, news, and academic prose). There is also a student version of the previous edition (Reference Biber, Conrad, Finegan, Johansson and LeechBiber et al. 1999) of this grammar (Biber, Conrad & Leech 2002).
Reference Grammars of EnglishThere is a myriad of grammars of the English language. While they all, in theory, describe the same facts, they do not all have the same goal. Some grammars intend to provide guidance to learners, some resemble style guides, and some use rather idiosyncratic terminology. We are choosing the Cambridge Grammar of the English Language as our point of reference because it is a comprehensive, synchronic account of English grammar, written with the goal of incorporating “as many as possible of the insights achieved in modern linguistics” (Reference Huddleston and PullumHuddleston & Pullum 2002: xv), without assuming great familiarity with theoretical linguistics. Using the Cambridge Grammar as our reference grammar means that we will adopt its terminology along with the premise that the role of the linguist is to “describe and not prescribe” (Reference Huddleston and PullumHuddleston & Pullum 2002: 2). By contrast, we will use the Grammar of Spoken and Written English (henceforth GSWE, Reference Biber, Johansson, Leech, Conrad and FineganBiber et al. 2021) mostly as a resource on usage data. The GSWE is based on data from a proprietary corpus (40 million words of text representing four main register categories) and, along with a description of the linguistic system, also provides information, including quantitative data, on how a grammatical feature is used in a particular situational variety of English. The GSWE itself closely follows Quirk et al.’s A Comprehensive Grammar of the English Language (Reference Quirk, Greenbaum, Leech and Svartvik1985) in its terminology. An earlier version of the GSWE was published as the Longman Grammar of Spoken and Written English (Reference Biber, Conrad, Finegan, Johansson and LeechBiber et al. 1999). As you can see, a lot of thought goes into the writing of grammars, and the analysis of grammar writing has become a linguistic field in its own right, called “grammaticography.”
Part II: Grammar in Discourse (Chapters 3–5) focuses on grammatical phenomena in discourse and looks at how syntactic patterns that you are most likely already familiar with (like topicalization or particle shift) are realized in different discourse situations. The underlying assumption here is that the reason for choosing one syntactic option over another lies in the surrounding discourse. The three chapters in this part move from phenomena that affect the beginning of the sentence (such as any kind of fronting operation, Chapter 3), over phenomena in the core sentence (anything from subject to object position, Chapter 4), to complex sentence endings (any kind of construction that shifts material toward the end of the sentence, like extraposition, Chapter 5). If the names of these constructions don’t mean anything to you at this point, don’t worry, every construction that we highlight will be defined and illustrated with data from linguistic corpora (and we will also show you how to work with these corpora).
Part III: Grammar of Discourse (Chapters 6–9) looks at the grammar of discourse, in particular the way in which sentences are connected, and also discusses syntactic phenomena conditioned by or brought about by genre conventions. Chapter 6 deals with the most obvious way of connecting sentences, namely through extra linguistic material known as “connectives” (such as coordinators). Chapter 7 looks at how we can underspecify information in a given sentence (through the use of pronouns and elliptical constructions) because it refers to something already mentioned before. Chapter 8 introduces the concept of “discourse markers,” a series of elements (not necessarily unified in their appearance or syntactic behavior) that have the function of structuring the discourse. Lastly, Chapter 9 goes beyond the local discourse and discusses the role that register or genres as varieties of discourse have for shaping the syntactic characteristics of a text. You will see that certain constructions and phenomena may be used not so much because of any relationship with the surrounding discourse, but because the text is part of a whole discourse situation, and the relationship between discourse participants or the purpose of the text may determine the use of syntactic constructions. For example, in scientific abstracts we have an overrepresentation of passive-voice constructions because in many scientific fields there is still a convention to make the object of inquiry (the thing that is studied) the subject of a sentence, rather than the person who led the inquiry (the agent).
We like clarity in the classes that we teach and have aimed to achieve the same in this book. Each chapter will start with a definition of learning outcomes for the chapter. Each chapter will then introduce core concepts and questions necessary for the subsequent discussion, such as non-canonical word order or discourse grammar and cohesion. This introductory part is followed by a discussion of two or three selected syntactic phenomena or constructions that illustrate the concept under discussion, plus one subsection presenting exemplary empirical work (mostly based on corpus data
, but occasionally also on experimental data). That way, you will gain insight into what is being studied in a field without being overburdened with all the complexities around the issue right away. References come both from recent work and from classics in the field. The chapters conclude with a summary, recommendations for further reading, and two kinds of exercises: level-one exercises will help you practice your analytic skills by applying the concepts introduced in the chapter, and level-two exercises will ask you to create or interpret data and guide you toward thinking about designing your own research. Throughout the text we will let you know when you should be ready to take on which exercise with this study icon:

For these exercises, we do not presuppose any prior experience with corpus-linguistic methodology or statistics and we include practical information on carrying out corpus-based research (how to do it, how to present and interpret data, but also which problems to watch out for) throughout the book. These tips will be marked with the toolbox icon.

The chapters are rounded out with squibs on interesting, but adjacent questions of language usage, clearly set apart from the main text in boxes labeled “Good to Know,” marked with the owl icon:

The index and glossary at the end will help you with studying specific concepts (terms that are printed in bold in the text when occurring for the first time have their own entry in the glossary). The index focuses on those pages where a concept is introduced or elaborated on. Overall, the chapter structure of the book makes it possible to cover the whole book in the course of one semester or to cover selected parts and use the book as a secondary textbook or resource book.
As the linguistic world we live in continues to develop, presenting us with new words, new modes of expression, and new communicative needs, we hope you will share our excitement about studying syntax beyond the sentence. It is our goal to empower you to pursue projects of your own that show that syntax is a lot more than a set of rules holding a sentence together.
Further Reading
The developing field of grammaticography is discussed in Reference Ameka, Dench and EvansAmeka, Dench & Evans (2006). Articles in this edited volume deal with aspects like the role of linguistic theory in grammar writing and the distinction between native and non-native speakers.
If you are interested in learning more about the language-internal and external factors that made English the strict word order language it is today, you will find the articles in The Cambridge Handbook of English Historical Linguistics (Reference Kytö and PahtaKytö & Pahta 2016) useful, especially Cynthia Allen’s article “Typological change: investigating loss of inflection in early English” (Reference Allen, Kytö and Pahta2016). The book also offers overviews on theoretical frameworks, such as historical pragmatics or generative grammar, and on methodologies, such as corpus linguistics and philological methods.
2.1 Introduction: Why Discourse Syntax?
In introductions to linguistics, we all learn that syntax is the study of rules that form sentences. If you are used to tree diagrams to visualize syntactic structures, you know that they never go beyond the boundary of a sentence. But we don’t speak in isolated sentences (actually, we often don’t speak in sentences at all) and syntactic phenomena do not stop at the sentence boundary. To demonstrate how strongly the structure of sentences is influenced by the surrounding discourse, let’s look at the sentences in Figure 2.1, taken from the beginning of an op-ed in The New York Times (August 2020) and numbered arbitrarily. Let’s put them in the order that seems most natural to you. Note that this text is an opinion piece – it does not recount a series of events. Therefore, in arranging the sentences, you cannot rely on your extra-linguistic knowledge of how certain types of events unfold (for example, we know that in a car accident the crash comes before anyone is taken to the hospital). Rather, you will apply your intuitive knowledge about discourse syntax. Keep track of the markers that you rely on for making your choice and write down the sequence of numbers that seems right to you.
Perhaps your thought process looks something like this: Sentence #6 does not make a good first sentence – a connective like and would typically not be the first word in a text. You also probably did not put sentence #4 first – a demonstrative pronoun like that needs to have an antecedent and it wouldn’t be clear what that refers to if sentence #4 came first. Similarly, you probably ruled out sentences #1, #3, and #5 as potential first sentences because they have personal pronouns (it, her, she) whose reference would be equally unclear. Sentence #8 is an elliptical structure (there is no verb in the main clause), which points to this sentence not being the first sentence in this text either. This leaves us with sentences #2 and #7. Sentence #2 has an indefinite noun phrase as its subject (Hundreds of thousands of undergraduates in America) and tells us something about them. This is achieved through the use of a passive construction (won’t be allowed). Sentence #7 uses a dummy pronoun (there) as a subject and thus does not really establish an entity the text could be about. That makes #2 the most natural first sentence. Once the topic of students and their return-to-college experience is established, sentence #4 is the one that is the best continuation of that topic. Sentence #4 is in question format, and sentence #8 is the best answer to that question because it picks up on the same topic, choosing the same noun, kind, but this time with a definite article, the. And so on.
Go to Exercise 1 to see if your sequence matches the one of the original op-ed. There will also be the opportunity to reconstruct another paragraph.
It is this kind of syntax – constructions and grammatical devices that are chosen to integrate a sentence into the surrounding discourse – that we will focus on in this book. Studying discourse syntax means (a) making ourselves aware of options that speakers have to express an event or a statement, and (b) looking at the role the discourse has on choosing one of those options. For example, you probably know that almost all active voice sentences can be converted to passive voice sentences, but why is it that speakers choose one or the other option? You probably know that we can “front” almost any type of phrase in English, but why is it that speakers would choose a word order that is so different from what English normally looks like? On its own, a sentence like (2) sounds awkward, because it does not follow the canonical word order of English (Subject–Verb–Object), as exemplified in (1), but put into context, as in (3), the non-canonical word order is just right.
(1) You would never soak lentils.
(2) Lentils you would never soak.
(3)
a: When you make white bean soup, you might want to soak your beans first so that they don’t have to simmer so long. b: What about lentils? c: Lentils, you would never soak first, you just put them in the soup.
Studying discourse syntax, therefore, requires us to look at grammar beyond the question of what is canonical and what is not. What is non-canonical in an isolated sentence (e.g., the choice of passive voice) may be the preferred pattern in a specific discourse situation. “Discourse syntax” is thus the cover term for the study of all aspects of syntax that we can only explain with reference to the surrounding text, which itself is embedded in a situation. In this chapter you will learn how to investigate those patterns – the type of questions that are asked, the type of constructions that are looked at, the methodologies that are employed.
Good to Know: Discourse Syntax vs. Discourse AnalysisYou may also have heard of the term discourse analysis, which applies to a different subfield of linguistics. In discourse analysis, the properties and patterns of different kinds of texts are investigated, the basic unit under investigation being then the entire text or discourse itself. Typical research questions in this field include how texts perform specific acts of communication (e.g., an apology, an invitation) or reflect issues of society (e.g., discourses of migration or discourses of sustainability), and how they receive their coherence.
As you just saw above, one concept that is central to the study of discourse syntax is the concept of syntactic variation. This may at first glance seem like a contradiction to what we said in Chapter 1, where we discussed English as a language with rather strict word order. However, it is precisely this strict word order that makes syntactic variation such a powerful tool in English.
Let’s talk a little bit more about what we mean when we say “syntactic variation.” Most speakers are probably more familiar with the concept of language variation in the domains of lexical choices or pronunciation. One person’s soda is another person’s pop. Dialects can also have distinctive grammar features. For example, some Southern dialects in the United States consistently make use of double modals (might could), as exemplified in (4), while speakers from other regions may never use this feature, and African American English uses invariant be as a marker of habitual aspect, as in (5), a line from a song by American songwriter will.i.am.
(4) … we’ll never live up to the impossible standards so many in the world have for us. But, for the first time in my life, I think we might could do it. (COCA, Blog, 2012)
(5) All the time she be working working working (will.i.am, Mona Lisa Smile 2013)
We will not deal with dialects or other speaker-based variation in this book. Rather, we will look at syntactic choices speakers make based on the discourse situation. For example, the language used in spoken conversation among friends is quite different from the language used in newspaper articles. A grouping of language chosen in accordance with the discourse situation is referred to as a “register.” Registers can be defined at a very high level (for example, the GSWE is based on the four registers: news writing, academic writing, fiction, and conversation) or more locally, as sub-registers (for example, news writing includes the sub-registers international news and business news). Register analysis is concerned with syntactic patterns and choices motivated by the discourse situation. For example, since newspapers report past events, the language used in newspapers shows a high number of past tense verbs and time adverbials, as in the sentence in (6).
(6) The storm began Tuesday night, with snow starting first in most areas and changing to a wintry mix, then ice, then drizzle, starting south of Chicago. At 6 a.m., weather spotters throughout the area reported trace amounts of snow at O’Hare. (COCA, News, 2019)
Methodologically, it follows that register variation studies work with large corpora of data based on similar discourse situations. Register-based corpora typically consist of many excerpts from similar types of texts (including spoken texts), without much information about the speakers who produced them.
A note on our example sentences: Most sentences that we use for illustration in this book are taken from English language corpora, i.e., they are authentic attestations of the constructions we deal with, and not examples that we thought up. We do not want to imply that constructed sentences would not also work to illustrate a syntactic phenomenon, but we want you to get into the habit of working with corpus data, because for many research questions, quantitative data – how often a construction occurs – will be relevant. Section 2.4 will discuss in some detail how to access and gather data from a corpus. Individual search strategies for constructions, including some necessary shortcuts, will be presented in later chapters.
In this book, we will mostly be concerned with studying the role of the context in making linguistic decisions limited to the area of syntax, including, but not limited to, word order variation. As we have seen, one important component of the situation is the text itself, which for each and every sentence determines the precise contextual conditions at the moment of its use.
In the following, we will introduce core concepts and methodological considerations for the study of discourse-based syntactic variation. We will discuss several key contrasts that lead you step-by-step to an understanding of discourse syntax and the phenomena and research questions studied in this field: (a) sentences in isolation as opposed to discourse as a whole, (b) context vs. co-text, (c) discourse type and register, (d) given as opposed to new information, and (e) the difference between two perspectives on the relationship between syntax and discourse. We’ll refer to these two approaches as the variationist vs. the text-linguistic perspective, following influential work by corpus linguist Douglas Biber (e.g., Reference BiberBiber 2012). We will also introduce you to methods of data gathering and presentation common in this field.
After reading this chapter, you will be able to:
identify the different properties of a discourse and of the discourse situation (co-text and context);
describe how discourse syntax is shaped by the distribution of information within the sentence;
carry out first searches in a corpus for gaining data on natural language use and make decisions on how best to visualize your findings;
develop a basic research design for each of the main two perspectives employed in the study of discourse (variationist and text-linguistic perspective).
Concepts, Constructions, and Keywords
complement clauses, context, corpora, co-text, discourse type, functional linguistics, given/new information, predictor variable, rate of occurrence, register analysis, that-omission, utterance, variationist/text-linguistic approach
2.2 Sentences vs. Utterances
So far, we’ve focused on the “syntax” part of discourse syntax. Let us now turn to the “discourse” part. Discourse might be understood either formally, as all language elements larger than the sentence, or, functionally, as any language unit serving a communicative purpose. Let us illustrate those two different meanings through an example. Imagine that somebody wants you to go and clean your room. That person might try to bring this event about by uttering a single sentence such as (7). The communicative purpose of the sentence would be quite clear. However, more often than not, a slightly longer sequence of sentences (let’s assume for a moment that people speak in sentences) will be produced, perhaps something like the sequence in (8). By the formal definition, only (8) would constitute a discourse (because it is larger than a sentence), while from a functional, communicative point of view, both examples are units of language use with a communicative purpose and would therefore both constitute a discourse.
(7) Go and clean your room.
(8) Listen! I told you a hundred times. You need to clean your room. Please do it now!
For a clear definition of discourse and its relation to syntax, the problem therefore arises that, on the one hand, discourse can comprise just a single sentence, perhaps even less (for instance, a verbless phrase like Never!). On the other hand, when we see discourse as a communicative event, it comes to involve numerous and non-linguistic aspects, such as the interlocutors, their relationship, the mode of communication, the social and cultural background, and so forth. From a functional perspective, then, discourse is a very complex phenomenon to analyze. In this book, following a prominent approach in the field (based on early work by discourse linguist Deborah Schiffrin, e.g., Reference SchiffrinSchiffrin 1990), we adopt a third view of discourse, focusing on sentences or other syntactic units together with their context of occurrence. We define discourse through utterances, which means that the sentence in (7) also qualifies as a (short) discourse.
Sentences from that point of view are called utterances: units of language structure bound to a given context. One and the same formal sentence easily gives rise to different utterances, in the sense that it is possible that the same sentence can be spoken, or written, or repeated, or quoted under different conditions. For instance, the same sentence I told you a hundred times can be thought of and used in many different utterances. Its ultimate communicative meaning can vary from being aggressive, ironic, or a lie, to a warning, and so on.
By the same logic, any sentence that has been used, or is likely to occur, as an utterance is bound to its discourse. For example, the sentences I’ll give you some money and I’ll give it to you, with it referring to the money, are likely to be uttered under different discourse conditions. The latter would more likely be used when money has already been talked about and the focus is now on the transfer. This is what discourse syntax is all about, viewing syntax as being discourse-driven, and it rests upon defining discourse as consisting of utterances. Think of utterances as sentences anchored in specific discourse situations.
As we saw in Chapter 1, a sentence, whether simple or complex, is generally considered the largest unit of analysis in syntax. Within the sentence, we have a hierarchical structure of constituents (for example, NPs inside of PPs, which in turn can be inside VPs), but discourse is not made of elements that are part of such a systematic structural hierarchy. Instead, what connects pieces of discourse in the first place are semantic and pragmatic relations, which are expressed only partly via language. This is also why you have probably seen the structure of a sentence represented as some kind of diagram, but the structure of discourse is not represented in this way.
Within a discourse, every sentence is an utterance within the surrounding text, also called “co-text” (discussed further below), as well as the situational context. The text excerpt in (9) comes from an article in USA Today and compares the Japanese snowboarder Ayumo Hirano to Shaun White, an American snowboarder, who, like Hirano, began his career very young, at the age of 13, and became a professional at the age of 17.
(9) Air apparent: Ayumu Hirano might challenge Shaun White Moments before he put down an X Games run that would make him the youngest medalist ever, Ayumu Hirano crashed hard enough to crack his helmet. From 13 feet up, he hit the deck of the halfpipe before falling another 22 feet to the flat bottom. He was sore, getting the wind knocked out of him, but not concussed. At his first X Games, Hirano, then 14, could have easily walked away. The snowboarder would have plenty more chances to compete in an event he’d watched on DVD at home in Japan as a kid, his coaches told him. […]. (USA Today, December 10, Reference AxonAxon 2013)
A longer piece of discourse such as this one is not free of what we might consider as structure. For example, the sequence of tenses and the interpretation of pronouns depend on each other, and information is built up gradually. In the last sentence the past perfect (he’d watched) is used, because the event that it expressed (Hirano watching a DVD when he was a kid) precedes the event that this excerpt is about (Hirano winning his first medal at an internationally broadcast extreme sports event). However, the interpretation of each sentence within the excerpt does not result from structural relationships based on a limited set of functions within a larger whole, like being a subject, object, or modifier in the sentence. Still, there are meaning relations across sentence boundaries, which often connect considerable stretches of discourse, that need to be interpreted. This is what readers or listeners can do by using information from the surrounding text or the discourse situation. For example, the first sentence in (9) refers to Hirano via the pronoun he, making use of the presence of a headline. The cues for the retrieval and connection of information within a discourse are therefore given within the sentence but their motivation often lies beyond the sentence. What kind of cues are available for constructing discourse? And how do speakers interpret them with reference to the discourse? These are the kind of questions that linguists seek to answer in the field of discourse syntax.
Before we move on to discuss in greater detail the properties of discourse and the discourse situation, you should go to Exercise 2 to apply the distinction between sentences and utterances.
While sentence structure follows rules that substantially limit the number of patterns that are grammatical, variation in the actual use of grammar in discourse is the rule rather than the exception. When speakers are asked to tell the same kind of story, for instance, a picture story, a cartoon, or a film, in all likelihood no two speakers choose exactly the same wording. Try it for yourself: How would you narrate what you see in the simple cartoon in Figure 2.2? Ask a friend or a classmate the same question. Do they pick the same words and sentences?
Figure 2.2 Bunny cartoon
Apart from many individual preferences that go into a person’s re-telling, there are also systematic reasons for presenting roughly the same content in one way or another. These principles are generally referred to as strategies for “packaging” information within the sentence. To explore this kind of variation in detail over the chapters to come, we need to keep the grammatical form of a sentence apart from its meaning. Strictly speaking, we only deal with a case of true syntactic variation when the basic meaning of two alternative grammatical realizations remains the same. As introduced in Chapter 1, this basic meaning of a sentence is called its proposition, referring to the truth value carried by the sentence. For example, the sentence I gave you the money is true under exactly the same conditions as those for the sentence I gave the money to you. The proposition mainly results from the relation of the verb to its core arguments (I, for example, being the agent of the act of giving, you being the recipient), which is here not affected by the change of sequence of the direct and the indirect object.
The separation of sentential meaning in the sense of the truth value and of the many other meanings that are also expressed by packaging information into grammar (attitudes, emotions, indirect intentions) is relevant for the study of discourse syntax because we are interested in options here that do not change the basic meaning of the sentence. Choosing one option is not motivated by a change in propositional meaning, but by the discourse situation. In that sense, looking at grammar in discourse means analyzing situated syntactic decisions. But what exactly determines the discourse situation of a sentence? There are several levels to consider here. First, there is the concrete surrounding, that is, preceding and subsequent, text, which is commonly referred to as the co-text. Second, there is also a wider kind of discourse context, namely the nature of the overall discourse (or, possibly, of just a discourse segment). This is the level of the type of discourse, which involves, for example, the choice of a certain rhetorical mode (e.g., narration or argumentation) or of a given genre (e.g., a blog entry, movie review, essay, and the like). More on the interaction of grammar and genres will be discussed in Chapter 9. In the broadest sense of the conditions that surround sentences as utterances, any discourse is placed within a given situation, commonly referred to as the proper context. This context is, for example, governed by the channel or medium, the relationship between the speaker and the hearer, a private or institutional setting, a specific topic or purpose of the discourse, the formality of the situation, and the like.
Situational varieties and their patterns of language use have their own area of research, which is the study of registers. A register analysis investigates linguistic features in functional varieties of language use. It takes place at different levels of specificity for classifying discourse, ranging from widely studied, rather general varieties, such as conversation, academic writing, or fiction, to specific sub-registers like article introductions or office hour consultations (e.g., Reference Biber and ConradBiber & Conrad 2019). Throughout this book, our discussion will highlight that register variation also plays a substantial role in the study of discourse syntax. Chapter 9 in particular examines the role of register and genre. However, you should note that, in this book, situational varieties are not the object of interest per se; instead, we are dealing with the use of grammar as governed by the discourse as a whole, that is, on all levels discussed above. This means that the focus of our analysis lies on the context-bound nature of grammatical constructions, not so much on specific situations, such as the situation you find yourself in right now, as a student who is reading this text (see Figure 2.3).
Figure 2.3 Three levels of the discourse situation for sentences as utterances
Throughout this book you will see how elements of grammar or certain patterns of sentences depend on these different levels of the discourse situation, that is, how sentences become utterances. Often, the starting point for this discussion is the observation that a sentence follows a non-canonical syntactic pattern. As introduced in Chapter 1, a non-canonical clause is a clause that can be derived from a more basic pattern, that is, from the sentence that we would expect based on our knowledge of the general rules of English grammar. A deviation from the canonical pattern is usually motivated by reasons that lie in the discourse. For example, in English the object follows the verb, it does not precede it, so that a transitive verb will be followed by an object. Now, a sentence in which the object comes before the subject (as is the case in (2), Lentils you would never soak) will have to be motivated. The same holds for the passive construction (Lentils are never soaked), which enables you to drop the agent of an event and turns the former object into the subject. Constructions like these result from a decision about which of the verb’s arguments should come first and in which order and how densely you want the informational content to be packaged. Looking at sentences as utterances therefore means looking at the reasons for their variant forms in the surrounding discourse, and not just in the rules of grammar.
2.3 Discourse Syntax and the Functional Tradition in Linguistics
The interest in syntactic usage dates back quite far in the history of linguistics. It came up with an early movement of what is nowadays called “functionalism” in linguistics, which is commonly seen in opposition to form-based approaches to syntax. The origins lie in the work of some linguists working in Prague in the 1920s, who referred to the packaging of information in the sentence as the “functional sentence perspective” (Reference NewmeyerNewmeyer 2001). With that, the Prague School linguists were, in a way, syntacticians with an early interest in utterances. For example, they were already discussing that given and new material in the sentence is not distributed randomly, but that information which is familiar to the addressee (which they termed “thematic”) tends to come first, while new information (referred to as “rhematic”) is presented later in the sentence (a thorough discussion of the distribution of given and new information follows in Chapters 3 to 5). This characteristic sequence of information was found to apply to many languages, especially those with a flexible word order system, such as Russian or Czech.
The functional view of syntax emphasizes that syntactic patterns are not arbitrary because they are a means of communication. We saw above that syntax is not self-contained, but substantially influenced by how it is situated in a discourse situation. If syntax responds to and is shaped by the surrounding discourse, we will expect that it is not rigid, but will allow for some kind of variation, even in a language like English. If it were random, linguistic variation would indeed not be economical. Let’s turn to an example for illustration: Examples (10) and (11) have similar propositions (something is located between x and y), expressed by two different sequential arrangements of sentence elements. In (10), where speaker b’s statement is a canonical English sentence, the location follows the grammatical subject, while in (11) its position is in front of the subject:
(10)
a: Where is Sherman, Connecticut? b: Sherman’s about halfway between Long Island Sound and Massachusetts, right on the state line that we share with New York. (COCA, Spoken, 1999)
(11) In the heart of the South Pacific Ocean, just about halfway between South America and Australia, lies a very small island called Hikueru. (COCA, Fiction, 2010)
According to a functional perspective on syntax, the grammatical subject Sherman in (10) is positioned early in the sentence because it has been introduced in the co-text, that is, in the preceding interrogative (Where is Sherman, … ?) and is therefore given information. By contrast, in the sentence in (11), which is the very first sentence of a fictional narrative, there is no previous information on the story available. With this lack of co-text, the sentence sets the scene for the discourse: The reader gets to know that the story will take place on a very small island called Hikueru. In this position in the discourse, the grammatical subject (a very small island) carries new information, which is why the noun phrase is moved to the end of the sentence. We could also say that this sequence is functional because it corresponds to how readers will “see” the scene in their minds.
Good to Know: The Functional Tradition in Theories of GrammarThe distribution of information and the discourse function of sentences are the focus in various syntactic theories with a usage-related component, adding interesting aspects to the discussion of discourse syntax. For example, the theory of Systemic Functional Grammar (e.g., Reference FontaineFontaine 2013) emphasizes the importance of the social function of language, adding to the level of the proposition (called “experiential” meaning) and the co-text (“textual” meaning) a third level that highlights the so-called “interpersonal” meaning of an utterance, that is, the relation between the speaker and the addressee. The framework of Functional Discourse Grammar (e.g., Reference KeizerKeizer 2015) also emphasizes the relevance of interpersonal relations but, in addition, contributes the idea that the social functions of discourse are expressed in specific components, in what is called “discourse acts” and “moves.” For example, adverbials in initial position regularly occur within the move of opening a discourse, like in (12), or they continue a move, like in (13). By contrast, the same adverb will simply relate to the semantic level, that is, be a part of the proposition, when in non-initial position, as in (14).
(12) When he was warm and twenty-something, the Grey Star had been a garden of delights. Then responsibility fell on his shoulders. (COCA, Fiction, 2003)
(13) At least Blake’s predecessors, the flamboyant Schnellenberger and the colorless Gary Gibbs, brought some solid credentials to the table. Then again, if high qualifications are any barometer of future success, it makes all the sense in the world that Oklahoma decided on Blake. (COCA, Magazine, 1996)
(14) […] he became fascinated with death. He then became a medical doctor. (COCA, Spoken, 2015)
Cognitive Grammar argues that syntactic choices are often grounded in the structure of human perception. A cognitive view highlights that different sentence patterns direct the attention to different elements within the sentence (technically, this process is called “profiling”). For example, in the sentence in (15) the event, a trolley colliding with a car, is expressed as a change that also affects the trolley. In one reading of the sentence, the event could be that Herman Strodmann is driving the BMW; in another, he could be the initiator, acting upon the trolley (he hit the brakes and smashed the trolley into the BMW):
(15) Herman Strodmann hit the brakes just as the trolley smashed into the BMW and rode up over it. (COCA, Fiction, 2015).
Despite this ambiguity, the sentence in (15) is made possible by the argument structure of the verb smash, which allows the profiling of the theme as an instrument.
Having discussed that a functionalist view on syntax has quite a long and varied tradition, we now turn to reporting actual research in this area, looking at two fundamentally different approaches that integrate discourse into the analysis, and to the basic procedures for gathering usage-based data from a corpus.
We saw above that the distinction between “given” and “new” information can lead to different syntactic patterns. We will talk more about this in Chapter 3. For now, you should be able to handle exercises 4 and 5, which build on that contrast.
2.4 Gathering Data for the Study of Discourse Syntax
As the discussion so far has highlighted, evidence in the field of discourse syntax must be based on the actual use of sentences. One way of gathering usage-based data is to pull sentence attestations from linguistic corpora. Other methods of collecting data include conducting surveys or eliciting data experimentally, but we will focus on corpus data here. In principle, any collection of utterances from real language use is called a corpus (from the Latin word for “body”) in linguistics. However, these days the term usually refers to a searchable electronic database of texts. Corpora (the Latin plural of the word “corpus”) in this sense are collections of written texts and/or transcripts of spoken language in digital form, which can usually be searched by way of a specific interface. They can be collections of full texts, but more often they comprise text excerpts. Usually, corpora are pre-analyzed in some way. For example, in most linguistic corpora words are tagged as belonging to a specific part-of-speech category by running them through specific tagging software. A large and well-known group of corpora for English, which we also use for illustration throughout the book, is available at the online interface English-corpora.org (Reference DaviesDavies 2004–). The website provides free access, for example, to the British National Corpus (BNC), the Corpus of Contemporary American English (COCA), or the Corpus of Historical American English (COHA). In this book, we mainly use examples from COCA (and sometimes from other corpora from that interface), but note that new corpora are constantly becoming available (for example, in 2020, when the world started fighting the novel coronavirus, a new corpus consisting of texts about the virus and COVID-19 was added to the corpus site at English-corpora.org).
Using a corpus or corpus interface, you can type in an individual word or a sequence of words, such as the highlighted sequence think that, illustrated in the screenshot in Figure 2.4. You can also expand each attestation to see the surrounding co-text. The interface will also tell you which year and register the example comes from (in Figure 2.4, all attestations are from blogs). As you can see right away, it is much easier to search for something that is present (the verb think followed by a clause beginning with the complementizer that) than for something that is absent (the verb think followed by a clause in which that has been omitted).
Figure 2.4 COCA screenshot highlighting search item
Most corpora also allow you to search for syntactic categories (“parts of speech”, or POS, search), as well as specific lexical items. For example, you might be interested in checking the occurrence of that after any verb, not just after think. The POS search is also helpful in the case of homonyms, of which English has a lot. For example, you might wish to find out how often and in which registers impact is used as a verb (some people still argue that it cannot be used in that way – they would be surprised to see how often it is). A corpus like COCA allows you to add a part-of-speech tag to a search term, giving you the ability to look at occurrences of a word only when it is used as that part of speech. Instead of looking for impact, which is what led to the result list shown in Figure 2.5, you’d be searching for <impact_v*>, or, to include all forms of the verb impact (e.g., impacts, impacted, impacting) for <IMPACT_v*>. If this all seems technical and complicated, don’t worry. Web-based corpora usually provide documentation on how to use them and there are also videos available on YouTube. For working with part-of-speech tags, see the toolbox at the end of this section.
Figure 2.5 COCA screenshot for the word impact (noun or verb)
In addition to giving you attestations, corpora also often provide you with quantitative data and charts. The image in Figure 2.6, for example, includes frequency information – both in raw numbers (second column) and in occurrences per 1 million words (fourth column and bar chart). You can see that impact as a verb occurs most frequently in web-based registers and in academic writing and that it is becoming more and more frequent.
Figure 2.6 Annotated COCA screenshot for impact as a verb
However, you should always go through your attestations to see if there are any false positives. Usually some form of data cleaning is required. For example, some false hits found for the verb smash in COCA include a reference to the NBC series “Smash” or the use of the expression SMASH! in fiction. After deleting such cases from your set of attestations, the attested sentences from the corpus are ready for an analysis as attested utterances. For example, they could now be analyzed in terms of the distribution of information within them, something we did with the corpus attestations (10) and (11); as being specific discourse acts or moves, as we did with the sentences (12) to (14); or for the grammatical subject co-occurring with smash, which is what we looked at when discussing the use of smash in (15). Also problematic are mistakes in a corpus (which you will see in a few examples in later chapters). There may be spelling errors (for example, if someone misspells the word occasion as occassion, it won’t come up in a search for occasion) and disfluencies, which can often be observed in corpora from spoken language. For example, a speaker might self-correct the choice of verb, resulting in a sentence like He gave sent me the money, which, in isolation, everybody would rate as ungrammatical. (You may be familiar with marking an ungrammatical sentence with a preceding asterisk and a semantically not well-formed sentence with a preceding question mark.) More information on the searches for specific constructions in a corpus and on analyzing the corresponding results will be provided in Section 2.5 and also later in the book.
Part-of-speech Tagging in a CorpusOne major difference between data from linguistic corpora and data that you might compile yourself by just amassing text is that linguistic corpora are pre-analyzed. At a minimum, corpora are annotated for part-of-speech information by software referred to as a tagger. (The error rate is estimated at 1–3 percent.) There are different tagger applications and they all use different labels, but what they have in common is that they go well beyond the nine or twelve major part-of-speech classes you may be familiar with from an introductory syntax class (verbs, nouns, adjectives, modals, conjunctions, etc.). For example, the CLAWS
tagger (“Constituent Likelihood Automatic Word-Tagging System”), versions of which were used to tag the British National Corpus (BNC) and the Corpus of Contemporary American English
(COCA), has a set of about 60 different tags, which can be quite specific. The tagset used for COCA includes, for example, tags for singular common nouns (NN1), plural common nouns (NN2), singular proper nouns (NP1), singular weekday nouns (NPD1), and singular locative nouns (NNL1). There are tags for subordinating conjunctions in general (CS), as well as for particular conjunctions (CST for that, CSW for whether), and tags for different morphological forms of verbs, including VVG for the -ing participle, VVD for past tense verbs and VVN for the past participle. The tag label is not always intuitive (for example, the tag for not and n’t is XX), which is why it is important that you look up the tag list for the corpus that you are working with. Some corpora, such as the current version of COCA or the BNC, will also allow you to select tags from a drop-down menu, which means that you don’t have to memorize any tags, as illustrated below:
Lancaster University, where the CLAWS tagger was developed, provides free access to a web-based version of the tagger, which you can use for small portions of text (https://ucrel.lancs.ac.uk). A sentence like [The bunny dreamt of eating the carrot] tagged by CLAWS comes out as [The_AT0 bunny_NN1 dreamt_VVD of_PRF eating_VVG the_AT0 carrot_SENT ._PUN]. For larger portions of text, you would need a license – or you can rely on an existing corpus.
2.5 Two Approaches for Studying the Relationship of Syntax and Discourse
In the following section, we will explore two common approaches to examining the interaction of grammar and discourse. The main difference lies in the research goal and object of analysis. If one is mainly interested in syntax and in why one way of expressing things is chosen over another, the object of analysis will be a specific construction. For example, one might look at the discourse to figure out why speakers sometimes choose the get-passive over the be-passive in English. We will call this the variationist approach. Alternatively, one might be mostly interested in the syntactic properties of a specific discourse type or register. For example, one might ask how web-based news writing is different from traditional news writing. We will call this the text-linguistic approach. There are other labels one could use, but for the sake of consistency, we will stick with the terms variationist and text-linguistic in this book.
Depending on the research goal that is pursued, the research design and data analysis will be different. Let’s illustrate this with an example. For a construction as illustrated in (16), you will find that the frequency of the construction varies a lot by discourse type.
(16) “This guy he maybe come back and run this ranch?” said Madelaine. (COCA, Fiction, 2000)
Left-dislocation, the construction illustrated in (16), is a phenomenon that mostly occurs in spoken discourse. This observation is a result of looking at the occurrence of topicalization in different contexts – employing a text-linguistic perspective. To see this, we have to compare the use of the same construction in different contexts. By contrast, other phenomena of word order are better understood by comparing their occurrences to those of a competing variant. For instance, the placement of the indirect object directly after a ditransitive verb is more plausibly studied by collecting data that includes both variants of the placement of the object, that is, comparing, for example, give somebody money, as in (17a), with the alternative placement, giving money to somebody, as shown by (17b).
a. Why are you giving this guy money?
b. Why are you giving money to this guy who subsequently is arrested for killing your husband? (COCA, Spoken, 2002)
In all likelihood, the discourse will play a role in determining which placement is chosen. In (17b), the placement of this guy underlines the relevance of the indirect object for the following discourse. Indirect object movement is therefore likely to occur when the indirect object has greater importance in the discourse. To test this assumption in a project requires a research design in which the discourse is not the object of investigation, but potentially predicts a certain pattern of variation.
To reiterate: A project in which discourse has the role of predicting syntactic variation, but is not the proper object of the analysis itself, is a study of syntactic variation, also called a variationist research design. In research of this kind, the ultimate objects of investigation are two (or possibly more) syntactic variants: active and passive voice, get-passive and be-passive, subordinate clauses with and without that, the of-genitive and the ’s-genitive, because followed by a clause and because followed by just a noun (there is more on this particular construction in Chapter 9). On the other hand, if the usage of grammar is studied as a characteristic set of features of a particular type of discourse, the primary objects of investigation are texts and a text-linguistic research design is applied.
Let us turn to a simple example to see how the research procedure will be different. It is well known that speakers of English often omit the complementizer that in informal speech, but tend to maintain it in more formal writing, such as in academic writing. Examples (18) and (19) show the two variants:
(18) […] he argued Park was not strong enough. (COCA, Spoken, 2014)
(19) Ifenthaler, Eseryel, and Ge (2012) argued that learning in the twenty-first century must challenge students to become innovative […]. (COCA, Academic, 2015)
To test our assumption that there is an effect of the medium (written vs. spoken language) and the formality of the discourse on the omission of the complementizer, we need of course more usage-based data. In this case, a variationist study (since our object of investigation is a particular syntactic construction, along with its competitor), our data set should consist of sentences from both formal and more informal contexts. We want to figure out in which kind of discourse speakers or writers are more likely to omit the complementizer that and in which kinds they are likely to keep it. We thus need corpus-based findings from at least two registers that differ in formality. Data like this can come from a corpus that has data from different registers or, for example, from a usage-based reference grammar, like the GSWE (Reference Biber, Johansson, Leech, Conrad and FineganBiber et al. 2021). One such set of results, based on a sample of 3,000 that/Ø-clauses from the GSWE corpus, is shown in Table 2.1.
Table 2.1 Data on complementizer omission in three types of discourse
| Conversation | News discourse | Academic discourse | |
|---|---|---|---|
| verb + that | 141 (14 %) | 733 (73 %) | 940 (94 %) |
| verb + Ø [no complementizer] | 859 (86 %) | 267 (27 %) | 60 (6 %) |
| Total | 1,000 (100 %) | 1,000 (100 %) | 1,000 (100 %) |
The data in Table 2.1 documents that the formality of the register is indeed a reasonable predictor for the omission of that in different kinds of discourse: Only 14 percent of the complement clauses in conversation contain a complementizer, compared to 73 percent in news writing and 94 percent in academic discourse. Since we are interested in a syntactic choice, that is, the variation of that vs. that-omission, we should illustrate the outcome as proportional frequencies. Proportions are already indicated as percentages in the table; the corresponding chart should visualize the preference for one variant over the other, with the difference in register being an independent variable, or the predictor. In this kind of chart (also known as a stacked column chart), the proportions are best given within a single column for each register, which is what you see in Figure 2.7. In a proportional frequency chart, the numbers always add up to 100 percent. The same kind of information could also be presented in three pie charts (data in a pie chart always adds up to 100 percent, which is characteristic of visual representations of insights from a variationist study).
Figure 2.7 Proportions of complementizer omission in three registers of English
Figure 2.7, with the three registers as independent variable on the horizontal (or x-) axis and the proportions of the two variants shown on the vertical (or y-) axis, illustrates the preference for the omission of a complementizer in conversation, and the much higher likelihood of its retention in the context of academic and news discourse.
With the registers being the predictor, the data in Table 2.1 and the chart in Figure 2.7 constitute a case of syntactic variation. What this research design does not look at is the actual pattern of usage in discourse, that is, the frequency of the variants in each kind of context. The reason for this is that the data set that is used is not one that can tell us in which register that-clauses, or that-less clauses, are overall more frequent. For example, based on other sources, but counter to what Table 2.1 suggests at first sight, the mere occurrence of complement clauses retaining that is more than twice as high in newspapers as it is in academic prose: 3,440 compared to 1,260 occurrences per million words in the corpus on which the GSWE is based (Reference BiberBiber 2012: 15). This difference is due to the overall higher frequency of complement clauses in the news register. Table 2.1 and Figure 2.7, following a variationist research design, therefore only highlight that once a that-clause occurs, the preference for that-retention will be stronger in academic discourse than in news. It would be wrong to conclude from Table 2.1 that academic discourse has more complement clauses introduced by that than the other registers.
The other approach to studying an interaction of syntactic form and discourse is to treat it as a case of text-linguistic variation. This type of research on syntactic usage has different varieties of discourse as its primary focus of interest and looks at the density, that is, frequency, of occurrence of certain grammatical features. The frequencies must be calculated as rates of occurrence, whereby the frequency of a given grammatical feature is determined relative to the length of a text or the size of a corpus. This process, known as normalization (see the Toolbox below) allows us to compare frequency rates rather than raw or proportional frequencies.
Calculating Normalized Rates of OccurrenceText samples or sub-corpora often vary considerably in length. COCA, for example, is quite balanced at the highest level of register and contains roughly the same number of words from spoken language, fiction, newspaper, magazine, and academic writing, but within these registers the composition of sub-corpora may vary considerably. For example, in 2020, the sub-corpus for local news was about three times as large as the one for editorial writing. In order to compare frequencies across different registers, as one would do in a text-linguistic study, we have to “normalize” them, which means that we compute them at a rate per a certain number of words. In a large corpus, we often note the occurrence rate per one million words.
The general formula for this computation is the following:
Example: If we look at the frequency of the conjunctive adverb however in different sub-corpora of news writing based on COCA, we find that there are 2,169 tokens in a sub-corpus of editorial writing (4.8 million words) and 3,498 tokens in a sub-corpus of local news (13.8 million words). In order to compare those numbers, we have to normalize them. Using the formula above, we calculate the rates of occurrence per 1 million words for both sub-corpora:
We can now easily see that, relatively speaking, however occurs more frequently in editorial writing than in local news (which will not come as a surprise, considering that however is used to express a writer’s stance).
With a smaller corpus size, it is common to normalize to a rate per one hundred or per one thousand words, accordingly:
Note that you should never inflate the size of your corpus in calculating the frequency rate. If your corpus only has 1,000 words and your target construction occurs twice, it would be appropriate to say that the frequency rate is 2 per 1,000, but not that it is 2,000 per 1 million words, even though mathematically both express the same ratio.
Tip: If you use COCA, the “chart” view of your search results will compare the normalized rate (occurrences per 1 million words) for you, as you can see in the third column of the screen shot in Figure 2.8.
Figure 2.8 COCA screenshot of chart view
Text-linguistic research is equally interested in the connection of syntax and discourse, and rates of occurrence are good evidence of the close connection between discourse and syntax as well. However, the focus is not on variation within the grammar, but on variation among different types of texts. Text-linguistic research examines which constructions are typical of which type of discourse and thus has properties of discourse as its ultimate object of investigation. For example, comparing conversation and academic prose, the GSWE describes higher rates of occurrence for passives or some other non-canonical constructions in academic texts while, for example, noun phrase tags (They are alright, the kids?) or wh-clefts (What I need is some good news) are typical features of conversations (Reference Biber, Johansson, Leech, Conrad and FineganBiber et al. 2021: 948).
The text-linguistic approach is also generally applied in studies of register. It is based on comparing the relative frequency of a feature in different types of discourse, with the aim of finding out in which register a feature is more “pervasive,” in the sense that the feature under investigation occurs with some frequency throughout the text (following Reference Biber and ConradBiber & Conrad 2019). According to this approach, the texts are therefore the primary objects of our observations, which, with regard to most features, differ in typical distributions of an individual feature. Such an approach is necessary for anyone who works with corpora that are built from text excerpts. Only pervasive features are likely to be represented in text excerpts. For example, a salutation like Dear Ms. Franklin may be highly characteristic of letters, but it is not a pervasive feature of letters. In fact, it typically occurs only once per letter. A corpus constructed of letter excerpts is not necessarily likely to include many salutations. By contrast, second-person pronouns (you, your) will occur throughout a letter and may constitute a pervasive feature. However, while studying a register means looking at all features with a characteristic distribution, the focus of discourse syntax is somewhat more limited, being only about the occurrence of the syntactic phenomena under investigation. Nonetheless, findings about the textual variation of these phenomena contribute to the analysis of registers and genres (see Chapter 9 for more details on register-related effects).
Returning to the case of complementizer omission, there are also detailed results on the register distribution of that-clauses, based again on data from the GSWE corpus. The frequencies presented in Table 2.2 show, on the one hand, that complement clauses are overall more pervasive in news discourse and in conversation than in scholarly writing. On the other hand, you will note that there is a different picture for the respective occurrences of verb + that-clauses and verb + Ø-clauses in the three registers: for verb + that-clauses, conversation and academic discourse are closer, and for verb + Ø-clauses, news and academic texts are more similar. A corresponding chart must therefore highlight the difference among registers in total, but also separately for each construction, which is what you see in Figure 2.9.
Table 2.2 Rates of occurrence* of verb + that-clauses in three types of discourse
| Conversation | News discourse | Academic discourse | |
|---|---|---|---|
| verb + that | 890 | 3,440 | 1,260 |
| verb + Ø | 5,400 | 1,250 | 80 |
| Total | 6,290 | 4,690 | 1,349 |
* Per 1 million words. Data is from Reference BiberBiber 2012.
The chart in Figure 2.9 – a comparative column chart – visualizes a case of text-linguistic variation, regarding both the occurrence of the feature of that-clauses in general (the columns in dark grey), and of that-retention and that-omission, in particular (columns in black and light grey). Each register thus constitutes a context which we analyze for the occurrence of complement clauses. With regard to the differences among texts, a striking difference is that, while complement clauses are overall most pervasive in conversation, complement clauses with that-retention are most pervasive in news.
Taking both approaches together, the patterns that we found tell us something both about the construction, as predicted by the discourse context, and about the three kinds of discourse as an object of investigation by itself. Now that we have isolated patterns, the next task will be to find explanations for the patterns that we observed. These will again refer to properties both of the construction and of the discourse. For example, a relevant property of conversation is that the sentences are generally shorter. Therefore, we may not expect many complement clauses in the first place. On the other hand, we know that in conversations people often talk about their and other people’s opinions, which means that we have a relatively high number of verbs like think, say, and believe, which are all typically followed by a complement clause. These verbs make the entire construction with a complement clause overall more predictable and that-omission in conversation therefore more likely. In news discourse and academic writing, complement clauses tend to occur together with longer sentences in an environment that does not favor shortness, a reasonable explanation for why the feature of complementizer retention is more pervasive than complementizer omission in most written discourse (for concrete numbers, see the GSWE, Reference Biber, Johansson, Leech, Conrad and FineganBiber et al. 2021: 673–4). Additionally, retaining that helps the reader parse the structure of long and complex sentences.
You may wonder if all discourse syntax phenomena can be studied from a variationist as well as a text-linguistic perspective. The answer is “sort of.” Of the phenomena of discourse syntax that you will learn more about in this book, some constructions, such as cleft sentences, or pronouns and ellipsis, are typically explored by a study of text-linguistic variation, while others, for instance particle placement or indirect object movement, lend themselves more readily to a variationist approach. Throughout this book, we will ask you repeatedly to be cognizant of both types of research design.
This is a good point to check out Exercise 6 as well as the Level 2 Exercises 7 and 8, which focus on research design and the difference between the two approaches we have just discussed.
2.6 Summary
In this chapter we have emphasized that sentences seen as utterances are discourse-driven in many ways. They are shaped by the discourse situation and package their information in accordance with what is given or new at the moment of the utterance. We have highlighted the difference between the surrounding co-text and the more comprehensive context of an utterance. Furthermore, we have been able to see that speakers and writers make their syntactic choices in accordance with the discourse type and the register.
With syntactic choices being rooted in the context of use, we took a brief look at the origins of the interest in usage-based, functional linguistics and at corpora as an important source for gathering usage-based data. (You will occasionally read about other sources of evidence, such as experiments, in other chapters in this book.) We then looked in some detail at two different research perspectives for the study of grammar and discourse. We first looked at the procedure for the study of syntactic variation, where the objects of analysis are variants of sentence structure, such as complement clauses with and without that. We then compared this research design to text-linguistic research, where texts and their grammatical properties are the primary object of investigation. We also dealt with the analysis and the presentation of data, which must be in accordance with the research question.
Together with Chapter 1, this chapter forms the foundation for much of what we present in the remainder of the book. Part II of this book (Chapters 3 to 5) deals with grammar in discourse, looking at how different sentence patterns are realized in specific discourse situations. The chapters in this part move from left to right in their organization: from non-canonical beginnings (such as topicalization) to variation in the core clause (for instance, passivization) to complex sentence endings (for example, a cleft construction). In Part III of this book (Chapters 6 to 8), you will learn about the elements of the grammar of discourse, which means grammatical devices that turn sentences into discourse, such as connectives or pronouns. The final chapter (Chapter 9) moves beyond the local discourse and discusses syntactic phenomena as conditioned by genre.
2.7 Exercises
Level 1: Classification and Application
1. At the beginning of this chapter, we asked you to rearrange sentences so that they are in a sequence informed by your intuitive knowledge about elements of discourse syntax. Below is the answer to that question.
Let’s do the same activity again, and this time, let’s pay even closer attention to the linguistic means that you rely on to make your decision, including the use of pronouns, determiners, and the flow from given to new information. The following sentences (again, arbitrarily numbered) are taken from the beginning of a Washington Post article on the funeral of civil rights leader John Lewis in July 2020.
1. In it, he challenged the next generation to lay “down the heavy burdens of hate at last.”
2. His words came as the country has been roiled by weeks of protests demanding a reckoning with institutionalized racism – and hours after President Donald Trump suggested delaying the November election, something he doesn’t have the authority to do.
3. Hailed as a “founding father” of a fairer, better United States, John Lewis was eulogized Thursday by three former presidents and others who urged Americans to continue the work of the civil rights icon in fighting injustice during a moment of racial reckoning.
4. The longtime member of Congress even issued his own call to action – in an essay written in his final days that he asked be published in The New York Times on the day of his funeral.
5. The nation’s first Black president used the moment to issue a stark warning that the voting rights and equal opportunity Lewis championed were threatened by those “doing their darnedest to discourage people from voting” and to call for a renewal of the Voting Rights Act.
6. Former President Barack Obama called Lewis “a man of pure joy and unbreakable perseverance” during a fiery eulogy that was both deeply personal and political.
7. After nearly a week of observances that took Lewis’ body from his birthplace in Alabama to the nation’s capital to his final resting place in Atlanta, mourners in face masks to guard against the coronavirus spread out across pews Thursday at the city’s landmark Ebenezer Baptist Church, once pastored by the Rev. Martin Luther King Jr.
2. In Section 2.2 we discussed the distinction between sentences and utterances. Consider the sentence I told you this before and imagine how it can be used as two (or more) different utterances. How do you imagine the discourse to which it belongs, and what might be the communicative purpose of the sentence in each case?
3. In Section 2.3 we highlighted that co-text, discourse type, and situational context are the three components of the discourse situation of an utterance. Discuss how the two utterances in italics shown below with some of their context are influenced by these factors: What is there in the co-text and the situational context that turns the same sentence into two very different utterances?
(20) Interviewer (CBS news): Would you say that the President is wrong to release that memo? Senator: I would definitely say Mister President, please don’t do that, that is wrong, it is absolutely wrong. (COCA, Spoken, 2018) (21) When reading this article, some developers will think that it’s a good idea to register their app for every possible file extension, because it will keep their app and logo in front of their users as often as possible. PLEASE DON’T DO THAT. That rule applies to App Contracts as well. Don’t abuse them, especially if you aren’t actually providing the functionality required. (COCA, Web, 2012)
4. Describe the difference between (22a) and (22b), referring to the function of given and new information in a sentence, discussed in Section 2.3. Considering the role of the surrounding discourse, why do you think the sentence in (22a) was chosen as the original version?
(22a) This morning we’d dropped into a broad valley, then climbed steadily for 2,000 feet, until now we were traversing a cinder cone whose lower reaches must once have oozed pasty lava. To the left rose a steep slope of volcanic cinders; to the right the lava fell away in a jumble of jagged rock. (COCA, Fiction, 2004)
(22b) This morning we’d dropped into a broad valley, then climbed steadily for 2,000 feet, until now we were traversing a cinder cone whose lower reaches must once have oozed pasty lava. To the left a steep slope of volcanic cinders rose; to the right the lava fell away in a jumble of jagged rock.
5. From the point of view of the English grammar system, as discussed in Chapter 1, the clauses in italics in (23)–(25) are all non-canonical, i.e., there are more basic versions for all of them. What would these canonical versions look like, and what kind of co-texts would be in line with them?
(23) It is very true, we do owe our existence to LotRO [= Lord of the Rings Online] – I never would have, by chance (if chance you call it), join the same server […]. (COCA, Blog, 2012)
(24) Just outside the starting cave there’s a little campfire blazing beneath a rock outcropping – the same one I scaled. Sitting in front of the fire is an old man. (COCA, Magazine, 2017)
(25) […] it’s clear that Ohtani the hitter is as good as ever. That much the Angels’ designated hitter showed on Thursday night […]. (COCA, Magazine, 2019)
6. In Section 2.5, we discussed two kinds of research design for the study of discourse syntax. Below you find two excerpts from abstracts of articles published in a peer-reviewed linguistics journal. Which one represents a variationist research design, and which one uses a text-linguistic methodology?
A. In this paper we analyse variable presence of the complementizer that, i.e. I think that/Ø this is interesting, in a large archive of British dialects. Situating this feature within its historical development and synchronic patterning, we seek to understand the mechanism underlying the choice between that and zero. Our findings reveal that, in contrast to the diachronic record, the zero option is predominant – 91 percent overall. Statistical analyses of competing factors operating on this feature confirm that grammaticalization processes and grammatical complexity play a role. (Reference TagliamonteTagliamonte and Smith, 2005)
B. The so-called invariant tags, such as eh, okay, right and yeah, are extremely frequent in general English speech and have been studied extensively in recent years, especially in the spoken expression of teenagers, where they are a very common feature. In this article I focus on innit, as in She love her chocolate innit? and It was good innit? For this purpose, I analyse and discuss data extracted from two comparable corpora of teen speech […]. Findings confirm that innit is typical of the language of London teenagers and has not gone out of use; on the contrary, its frequency has increased over the last few years. In contrast, the proportion of tokens found in the language of their adult counterparts is rather marginal. (Reference MartínezMartínez 2015)
Level 2: Interpretation and Research Design
7. According to the influential Chicago Manual of Style, the adjective anxious means “worried” and should not be used as a synonym for eager. Consequently, it should not be followed by an infinitive clause (anxious to help, anxious to leave), but by a prepositional phrase introduced by about (anxious about a result). The following table is adapted from a study based on data from COCA (Reference Dant, Mukherjee and HuberDant 2012).
Spoken Fiction Magazine Newspaper Academic anxious about 13 % 10 % 25 % 19 % 26 % anxious to 87 % 90 % 75 % 81 % 74 % a. The study combines both syntactic and text-related categories. What type of approach does it exemplify?
b. How would you visualize the results? What type of chart(s) would you choose?
c. What does this table not tell you about the distribution of anxious in the five main registers in COCA?
d. Are the results from the study expected, given the Chicago Manual of Style’s guidance?
8. In English, the complementizer that can often be left out, especially if it comes after a verb (like say or believe).
a. Describe the design of both variationist and text-linguistic studies that look at the phenomenon of that omission from different angles.
b. For the variationist study, pick three different verbs and retrieve a set of fifty attestations for each verb from a corpus (for instance, from COCA). How do you ensure that your data includes different tenses? Are there any false positives you need to discard?
c. Based on this data set, carry out an analysis that treats complementizer omission as a case of syntactic variation and visualize your results. Produce a pie chart for each verb. What is the best way to present your results in a chart?
Further Reading
For an overview article that illuminates the distinction between a discourse analytic and discourse syntactic perspective on the relationship between grammar and discourse, see Reference Bowie, Popova, Aarts, Bowie and PopovaBowie & Popova (2020). For a discussion of non-canonical syntax, see Reference Ward, Birner, Horn and WardWard & Birner (2006) or Reference Lange and TanjaLange & Rütten (2017).
For the origins of functionalism and its development in the history of European and North American linguistics, see Reference NewmeyerNewmeyer (2001). Reference NewmeyerNewmeyer (2003) provides a thorough and well-argued reconciliation of form-based and usage-based linguistics. For a combined look at discourse and cognitive linguistics, see Reference TenbrinkTenbrink (2020). An accessible introduction to corpus linguistics can be found in Reference McEnery and HardieMcEnery & Hardie (2012) or, with emphasis on English, Reference Lindquist and LevinLindquist & Levin (2018). On corpus data as an empirical method, including some description of statistical testing, see Reference StefanowitschStefanowitsch (2020).
For a classic variationist approach that discusses a range of grammatical phenomena from this perspective, see the studies collected in Reference Rohdenburg and MondorfRohdenburg & Mondorf (2003); for studies of text-linguistic variation, see, for example, Reference Schubert and Sanchez-StockhammerSchubert & Sanchez-Stockhammer (2016).