This book offers a journey in search of the theoretical foundations of corpus linguistics as an empirical science. It looks for the sources of knowledge about language and society that we can find in corpora (systematic samples of language) and how this knowledge can be evaluated and built upon. But why does this matter? Although we live in an era of unprecedented access to information via various online media (as of 2019, Wikipedia alone had over 50 million articlesFootnote 2) with scientists making use of ever larger data sets (big data), our knowledge base has often been shaken by false claims and systematic disinformation. In politics, terms such as ‘post-truth’ and ‘fake news’ have been widely circulated to further erode our trust in traditional epistemic authorities such as science (e.g. Reference Ylä-AnttilaYlä-Anttila 2018). The classical view that ‘science should deliver the facts, and just the facts, needed for political decision-making, whereas liberal democracy should make decisions on the basis of these’ (Reference Kappel, Zahle, Fricker, Graham, Henderson and PedersenKappel and Zahle 2017: 1) can no longer be seen as generally accepted. We therefore need to look at the role of science in the process of knowledge acquisition, critically reviewing the foundations of scientific practice and situating the endeavour of linguists and social scientists within this framework. In addition, this exploration is important because owing to its relatively short history dating back to the 1960s,Footnote 3 corpus linguistics has never made explicit the essential premises it is based upon, often being caught up in a discussion of whether it is principally a method or a theory of language (Reference McEnery and HardieMcEnery and Hardie 2011: 210ff.).
So what sort of foundations are we looking for? The foundations proposed in this book are arrived at by confronting corpus linguistic practice and its aspirations, on the one hand, with theoretical thinking about science (the philosophy of science), on the other. We start with a simple sketch based on a common-sense understanding of science, reality and truth, which will be further refined in later chapters of this book. The guide in the process of refining our initial ideas will be Popper’s (e.g. Reference Popper1972, Reference Popper and Adorno1975) work on the philosophy of science and its operations. We also draw on a number of examples from corpus linguistics and other disciplines (e.g. astronomy, biology, physics, psychology) and engage with comments and criticisms raised in relation to corpus linguistics by different scholars. Essentially, the foundations of corpus linguistics we are looking for are not completely solid, immovable rules, but a set of principles, considerations and guidelines that reflect the reality of corpus research. These foundations are similar, in a metaphorical sense, to the foundations of the Capitol Building in Utah in the United States of America (see Figure 1.1). These are foundations built not on rock but on shock-absorbing pillars able to withstand the impact of an earthquake. These foundations are strong enough to support an important building (corpus linguistics, in our metaphor), yet they are flexible enough to accommodate exogenous shocks and challenges from critique and research findings.

Figure 1.1 Seismic base isolation system under the State Capitol Building in Utah
The method used in this exercise is one of heuristic engagement with corpus linguistics, philosophy of science and reality. Our claims are not absolute – in most cases, absolute claims lead to self-refutation. Instead, we seek to define and subsequently refine our understanding of the role of corpus linguistics as a discipline by building the contextualised groundwork for the application of corpus linguistics in the analyses of language and society. In this chapter, we sketch with broad strokes our initial position on corpus linguistics as science. We thus present what we believe is a common-sense approach to science. Based on the writings of others on the topic, and our own experience of it, we outline, informally, a description of what science appears to be. With this introduction to the idea of science, and an in-depth exploration of how it may apply to corpus linguistics established, we will then proceed to what will take up the bulk of this book: an exploration of the relationship between corpus linguistics and the ideas of Karl Popper regarding the scientific method. In appealing to this one framework we are, as we will show, drawing on a framework that many corpus linguists have touched upon, but not discussed extensively. We will also use that framework to move critically through a view of what constitutes science towards a view of what constitutes social science. This view will also encompass a variety of possible uses of corpus linguistics in digital humanities (e.g. Reference Mahlberg, Wiegand, Conrad, Hartig and SantelmannMahlberg and Wiegand 2020). After considering the nature of the data we interact with when we examine a corpus, we will conclude with a corpus-based study showing how the approach to social science we propose in this book – critical realism – applies to corpus linguistics through a consideration of the nature of statistically driven investigations of corpus data. While our focus will be on the framework established by Karl Popper (see the Key Thinkers box), we will also draw on the work of other philosophers, including Thomas Kuhn and Imre Lakatos, when discussing some of Popper’s ideas and positioning ourselves relative to those.
Austrian-born British philosopher of science and social and political thinker. Born and educated in Vienna, died in London. Popper held academic positions at the University of Canterbury in New Zealand (1937–1946), the London School of Economics (1946–1949) and the University of London (1949–1969). His book The Logic of Scientific Discovery, originally published in German in 1934 and translated in English in 1959 with latest revision from 1972, is considered one of the most prominent works of the philosophy of science.Footnote 4
1.1 Linguistics and Science
The question of the relationship of linguistics to science has been very influential in the development of corpus linguistics. The move away from using observed language usage as evidence in linguistics was based, in part, on the claim that to use such observational data was not scientific, a claim still made by one of the linguists most closely associated with this claim, Noam Chomsky:
Corpus linguistics doesn’t mean anything. It’s like saying suppose a physicist decides, suppose physics and chemistry decide that instead of relying on experiments, what they’re going to do is take videotapes of things happening in the world and they’ll collect huge videotapes of everything that’s happening and from that maybe they’ll come up with some generalizations or insights. Well, you know, sciences don’t do this.
This quote gives a particular view of science. From it, we can gather that Chomsky believes that science should be based on experimental observation and that natural observations are not the stuff of science. We also may infer that while these claims are being made specifically with reference to physics and chemistry, what is said is meant to be true of the sciences in general as ‘sciences don’t do this’. As a result of views like these, corpus data, observed language use, was eschewed by mainstream linguistics for a long time and to an extent still is by some linguists. Hence, what constitutes science, and the permissibility or otherwise of corpus data, are issues which are very much linked, though rarely explored in depth. We should state clearly and without equivocation that we do not agree with Chomsky’s view of what corpus linguistics constitutes nor do we accept his conception of what science is. That is not to say that we do not believe that there is space for experimentation and evidence in linguistics and in the sciences. There is and this book will argue for the need to bring research methods and data together to focus on research questions in order to explore them from a range of perspectives, insofar as this is possible and productive. Yet we do not accept that there is no place for natural observation in science; indeed, the observational sciences such as astronomy, epidemiology, geology and palaeontology routinely make systematic observations in order to make an assessment of theoretical claims. While they may also draw on other methods, systematic and structured observation is indispensable in these sciences. Science thus does not use only experiments under strict lab conditions to systematically collect data. In certain situations, videotapes, which Chomsky condescendingly uses as a metaphor for measurements in real-world contexts, can provide indispensable and ecologically valid sources of data.Footnote 5 Linguistics, in our view, is the same. Corpus linguists do not aimlessly collect data as Chomsky suggested; however, we always need to be careful to critically evaluate the quality of corpora used in research. This leads us to our first principle.
Principle 1: Either when building a corpus or when using one, corpus linguists should use research questions in order to engage with their data in a structured and controlled way.
So, for instance, if you wish to look at language in informal spoken settings, you go to those settings and, using parameters that you believe may be relevant (different contexts, speakers of different ages, social classes etc.), you gather data and ensure that it encodes all of the contextual information you thought may be of interest. Afterwards, you use the structuring of the data to extract only the data that is relevant to your questions. So both when building a corpus and using it we are aware of the need to allow the data to respond to linguistically motivated, not aimless, enquiry.
American linguist and social and political commentator. He has been a major influence on defining the programme of linguistics in non-empirical terms in the United States and internationally with his influential work Syntactic Structures (1957).
Chomsky is one of major critics of corpus linguistics, famously claiming that corpus-based ‘description would be no more than a mere list’ (Reference ChomskyChomsky 1957: 159).
By the end of the book, we will have modified our claim and our categorisation of linguistics from a science to a social science. This does not change, nor is it caused by, our commitment to the permissibility of controlled observation of language, gathered from natural contexts of occurrence, as an important method in linguistics, science and the social sciences. Yet we will begin our journey to our categorisation of linguistics as social science by considering corpus linguistics and science not only because of how Chomsky has characterised it – as not being science – but also because of how some corpus linguists (e.g. Reference Leech and SvartvikLeech 1992) have characterised it – as science.
One of the claims that has been made to support the corpus approach to the study of language is that corpus linguistics is, or at least contains elements of, science. Reference Leech and SvartvikLeech (1992: 112) argues that corpus linguistics conforms to ‘standards commonly applied to a theory in the scientific method’ while Reference de Beaugrandede Beaugrande (1996: 533) argued that the main reason for using corpora was that it offered ‘the first time in many years, a genuine opportunity to reorganise … doing language science’. Reference McCarthyMcCarthy (2001: 125) makes the claim that corpus linguistics promises ‘cutting edge change in terms of scientific techniques and methods”’ while Reference StubbsStubbs (2001c: 154) draws parallels between corpus linguistics and the observational science of geology. Similarly, Reference McEnery and HardieMcEnery and Hardie (2011) appeal to the scientific method when arguing for the corpus approach to language. Yet the idea of the scientific method, as appealing as it sounds, is somewhat nebulous. This may sound surprising – surely scientists have a well worked out, non-contentious way of going about their studies? The truth of the matter is that the scientific method has been either heavily contested, as this chapter will show, or it is largely presented in the work of most scientists as a set of routinised procedures through which science can be conducted. A grand framework building up an overarching set of principles and practices by which all scientists work is a very nice idea, but, as with many ideas, has proved hard to achieve. For this reason, we build on simple principles such as common sense (see the next section), laying out the presuppositions behind those principles in the open and considering the role of inference and logic in building an empirical paradigm.
Let us begin by trying to draw out the main elements of what constitutes a scientific method. To slice through some of the indeterminacy, we will, as stated, begin with a common-sense approach to the study of science. That will allow us to sidestep some of the more esoteric points about the nature of ‘truth’ for example, to begin with, though we will return to that issue in this chapter and the issue will recur throughout the book. Having established a general framework for the scientific method, we will then explore the extent to which corpus linguistics does actually match it. The consideration of where the match may not be ideal then leads us on to our next chapter. To begin with, we will deal with a very simple conception of corpus linguistics, essentially based on looking up word forms and calculating frequency information. Later in the book, when we move to problematise the engagement of corpus linguistics with the scientific method, we will consider a wider range of searches and techniques in corpus linguistics.
1.2 Realism and Common Sense
Let us begin with a common-sense definition of what science is. Science is the search for rational statements about reality. We appeal to common sense as the grounding for the position of realism taken in this chapter – human thought and individual physical actions and objects exist. Note that in appealing to common sense here we are setting aside, quite firmly, a series of positions that seek to challenge this view of science: scepticism, idealism and relativism. While we reject these as extreme positions, we accept elements of methodological scepticism and relativism (perspectivism) as healthy checks of our own thinking about reality. To deal briefly with each position, radical sceptics may question the very nature of reality itself, claiming that we can know nothing as nothing may exist, for example. However, our appeal to common sense here is designed to put what we view as flights of philosophical fancy to one side. We accept the existence of reality – as indeed radical sceptics tacitly do. We are reminded of a joke told to us by a colleague who studies Hinduism. A pupil is sat in a forest clearing with his guru who is explaining that all of reality is an illusion. At this point, an elephant emerges from the trees, charging at the pair. The guru stands and runs away, with his pupil following him. When the pair stop to catch their breath the pupil asks, ‘Why did you run, the elephant was surely an illusion?’ The guru replies, ‘My running was an illusion too.’ This neatly shows that sceptics seem to act as though reality exists – and quite rightly so too in our view. Sceptical positions can be easy to lampoon, as we, and others, have – Augustine of Hippo reports the difficulty of one prominent sceptic, Carneades, faced with the apparently difficult question of whether he was a man or a bug.Footnote 6 Yet while such positions can be viewed as risible, it took someone of Augustine’s stature to be able to mount such an obvious attack because sceptical ideas, especially when linked to elites such as distinguished philosophers or religious teachers, have great appeal. So-called privileged cynicismFootnote 7 is one of the standard fallacies of reasoning, where highly sceptical views are accepted simply by virtue of being associated with a perceived elite. While critical thinking – the application of methodological scepticism – is a useful guide to querying hypotheses and testing them, so-called ontological scepticism defies common sense all too often, hence we set it aside here.
We also distance ourselves from idealism – the position that only the mind exists and objects and actions are imagined by the mind. This was the proposition of thinkers such as George Berkeley who, when proposing the possibility that a stone existed in the mind only, was famously confronted by Samuel Johnson who kicked the stone and said ‘I refute it thus’. While Johnson was actually missing Berkeley’s point, we would refute the argument on the lines of common sense – it beggars belief that this is the case. Our everyday practice also points to the reality of the world around us: it is a common-sense assumption that underlines our interaction with other people (who are outside our mind as clearly shown by the possibility of misunderstanding them) and objects around us (which we treat as real).
Finally, we set aside relativism, that truth is for the individual only hence no shared and objective truth is possible. Again, we appeal to common sense – in our everyday practice we assume convergence on some basic truths, which guide us as rational actors in the world. We assume, for instance, that if we drop an object, it will fall down to the ground. We also make a distinction between facts and opinions, the former being undisputable while the latter are open to debate and interpretation. Note that at this stage, we intentionally do not problematise this distinction, although examples of sensory deceptions and measurement errors can blur this line. Later (in Principle 14), we further develop this point within the framework of critical realism, which acknowledges that we have quasi-contact with reality. As with scepticism, we allow the possibility of methodological relativism (or perspectivism), which recognises the complexity of reality and our imperfect grasp of the truth. This allows for multiple perspectives, or interpretations, to compete in a rational debate when searching for the truth.
Dismissing such fine philosophical positions on the basis of common sense may seem to be rather high handed. However, a study of the scientific literature reveals that scientists are much more concerned with common-sense views of reality like ours than any of alternative positions outlined earlier. A cursory inspection of popular science journals will demonstrate to any inquisitive reader the utter absence of scepticism, idealism and relativism from the scientific literature. Indeed, such is the absence of the ideas that the overt identification of realism is also absent from the literature, though it may be implied. Take the following quote, for example:
On 20 March, the GBR Marine Park Authority in Townsville, Australia, reported that divers were finding extensive coral bleaching—the loss of symbiotic algae—in remote northern areas of the reef. Many sections were already dead. Subsequent flyover surveys have confirmed an unfolding disaster, with only four of 520 reefs appearing unscathed.Footnote 8
There is no scepticism in this text. There is an engagement with probability, but no denial of reality. There is no suggestion that coral reefs are the projection of the mind. There is no attempt to say that while the author believes certain coral reefs have died, the reader may choose to believe that this is not so. The position of realism is accepted in science to the extent that the other positions are bracketed away. We can see similar processes at work in corpus linguistics too – Reference LeechLeech (2011) titles his article arguing for the decline of modal verbs in English ‘The modal verbs ARE declining’. His certainty, which will be explored in Chapters 6 and 7 of this book, brackets away positions just as neatly as the example given previously. Common sense, the tool we use to do this here, is widely appealed to in science, as shown in this quote from physicist Max Born:
The simple and unscientific man’s belief in reality is fundamentally the same as that of the scientist.
For Einstein, a correspondence to what was observed in nature was the only thing that he felt could be seen as proving a statement ‘true’.Footnote 9 Scientists routinely rely on realism and so should linguists.Footnote 10
Note, however, that common sense, while useful in supporting a realist position, must itself be held accountable to correspondence with reality. Here, we make a realist assumption about the nature of truth as correspondence to reality. Our statements are true, if they correspond to or are in agreement with facts. The problem often lies in the fact that while we have direct access to our statements and theories, we only have indirect quasi-contact with reality. This point is further developed in Section 2.4. For a discussion of the correspondence theory of truth and its competitors, see, for example, Reference MostellerMosteller (2014).
The most important implication of this position, however, is the fact that common sense itself is not to be blindly believed but must be open to new evidence. What is observed may, at times, confound common sense and must have primacy over common sense. There are good examples of this aplenty in corpus linguistics. We have asked audiences, many times over, to give a quick definition of the verb cause in English using nothing but their powers of introspection. People invariably say ‘make something happen’ or words to that effect. It is good common sense that the word has that meaning. Yet such audiences invariably, we have found, fail to note the negative discourse prosody of the word, even though that is well attested in corpus linguistics, that is, the word has a typical meaning of ‘make something bad happen’. Their world is not, however, undermined by their failing, because of the primacy of the correspondence to reality, the key feature of realism. They can accept that their common-sense view of the world is wrong because it can be demonstrated that, on the basis of an appeal to observed evidence, it is so. A common-sense view of linguistics is that what we say or write matters – this in turn allows a clear correspondence to reality to be observed. If I say that a construction is impossible in a language, but I observe it in use and it is clearly acceptable to members of the relevant speech community, then my claim has failed to correspond to linguistic reality. We will return to consider this view of language in more depth in Chapter 5.
So, we settle here on a common-sense view of the world in our explanation of the scientific method. In doing so, we align ourselves with what is common scientific practice. We also accept the primacy of the correspondence with reality in the deployment of common sense as that is, in and of itself, a common-sense position. Corpus linguistics sits very well with this approach to science. With that established, we can now move on to consider what a rational (i.e. scientific) approach to the study of language would entail because, as you will remember, we had initially defined science as a search for rational statementsFootnote 11 about reality.
1.3 The Rational Approach to Language
A rational approach to the study of language relies first and foremost upon the resources that can be marshalled to investigate a hypothesis. This includes the equipment to collect data, the means of analysing the data, the necessary infrastructure that a scholar requires to undertake research and the training in the specialism necessary to carry out the research. Applied to corpus linguistics, we assume that researchers:
need the means to collect/select appropriate data – perhaps they select an off-the-shelf corpus or they build one, for example, by using a news aggregator or by scraping material from the web;
have access to a computer and a concordancing package (or programming language) suitable for the investigation in hand;
are able to discuss their work with other scholars and to consult relevant studies in a library;
have been trained to use the concordance package (or programming language) and understand the principles behind the tools available in that package (or affordances of the programming language).
The work itself, of course, is rooted in realism – corpus linguists accept that language data exists and can be used to test hypotheses for their correspondence to linguistic reality – how language operates in the real world. Note that if any of these preconditions fail, we may conclude that the research is not well founded. For example, a lack of access to relevant data and appropriate software precludes the work. A lack of training in how to use and interpret corpus linguistics methods raises the spectre of the results being misinterpreted. In short, these simple preconditions for successful research within the scientific method are drawn from the best practice in the field; the failure of any of these preconditions can be entirely debilitating.
The key point here is that corpus research can and often does meet all these preconditions. Note that, in doing so, working with linguists who use other methods (e.g. eye-tracking, linguistic experiments) is not precluded – they simply fill these preconditions in different ways. However, when looking across the matrix built from how different researchers working together fill these preconditions, we need to ask why and how these approaches work together, of course (Reference Baker and EgbertBaker and Egbert 2016). This allows us to consider also how one researcher may deploy many methods. So, for example, let us consider how Reference McEneryMcEnery (2005) worked when looking at swearing in English. His initial approach applied existing analytical frameworks to the words he was investigating and found those to be wanting when faced with corpus data. He then used intuition and the exploration of small samples of swearing to build up a hypothesis about how bad language could be analysed. Having developed the analytical scheme, he then coded up a large corpus with it and studied the subject afresh using a corpus approach.
In short, the same researcher went through the process of fulfilling the preconditions for research three times. On the first occasion, library research helped to reveal existing schemes of analysis that were tested on examples from the corpus, retrieved through concordancing, and found wanting. This led to the second approach, building on that experience and adding data that was introspective in part, but also based on small-scale qualitative studies of textual data and the experience of the failure of the existing schemes. The equipment used in this stage was largely that of the qualitative researcher, which we might characterise as hand, eye and a pencil. That was supplemented by a testing of the work of other scholars to see whether it was able to account for the data observed. The techniques were those of the linguist working in close contact with the data and bearing in mind the endeavours of other linguists who had attempted the task. This resulted in a new categorisation scheme that allowed the different factors influencing a word’s use to be identified and applied in context. For the third pass through the research, that analytical scheme was applied to corpus data – it was a process of data categorisation and collection. Corpus analysis software was then applied to the data in order to see large-scale quantitative patterns in the data set. What does this discussion show us? It not only shows us very clearly the necessary preconditions that have to be fulfilled before a study can be said to be well founded, but also points towards the reality that researchers often analyse and re-analyse data using different approaches in an appropriate sequence, and that scientific investigation is structured in such a way as to facilitate different perspectives furnished by different methodological approaches. This is a theme that will be returned to in Chapter 3.
The list of preconditions of a rational (scientific) approach to linguistic reality helped us with the first delimitation of the scientific method. However, this sketch is fairly general. While it would be immensely useful to be able to set out a fixed menu of the things that scientific studies do, in the order in which they should be done, this is not possible. Consider the example given regarding swearing. There are many ways to approach the topic. We cannot provide a simple menu of things to be done to explore swearing in a scientific way – different researchers, with different interests will approach the question in different ways and that is quite valid (see, e.g., Reference JayJay 2000, Reference LjungLjung 2011 and Reference McEneryMcEnery 2005). So, for example, Reference McEneryMcEnery (2005) was interested in characterising swearing in a general population and could not rely on previous coding schemes as being reliable. On the other hand, if he had chosen to look at parts-of-speech across speech and writing, the existence of reliable part-of-speech taggers applying generally agreed upon analytical schemes to corpus data would have meant that the initial qualitative investigation would have been arguably unnecessary (though we will cast some doubt on this claim in Chapter 4). So rather than a simple-minded, recipe-like approach to the scientific method, we focus instead upon the general principles that embody the scientific method. Based on the discussion so far, we argue that the scientific method, beyond the question of resources, is formed of three basic components of the scientific approach: presupposition, logic and empiricism. The following sections outline and explore each of these items in turn.
1.4 Presuppositions
We need to be very clear that scientific investigation entails presuppositions, sometimes called a priori assumptions. These are understood as a starting point for an argument, something we can (hopefully) agree on before we proceed with further investigation. To better understand this, we need to split the presuppositions we take into the study of language between the ontological and epistemological; the former are related to the nature of reality, while the latter are related to how we acquire knowledge about reality. The ontological presuppositions include the existence of reality and our ability to interact with it – this is our realist position that we have established earlier (see Section 1.2). Similarly, we presuppose the existence of an individual mind (brain) and the social reality of human interactions, with language being an important part of both. We thus assume language as having some form of existence in the brain of an individual (e.g. in the form of a mental lexicon) as well as the social existence as a shared entity in a society, such as people interact using language and transmit language to their children.
As an example of an epistemological presupposition, which we have discussed earlier (see Section 1.1), we can state a new principle of corpus linguistics.
Principle 2: The study of observed language is a means by which language can be investigated and explained.
Sometimes, epistemological assumptions directly follow ontological ones. For instance, we assume that language is an ontologically complex system. Recognising this complexity, linguistics as a science of categorisation offers different ways of dealing with linguistic reality. When studying human communication, we may thus posit the phoneme as a unit of sound and produce different categorisations of sounds into phonemes. We may decide that we need to study the different categories of sign in British Sign Language. We may argue about where word boundaries are and what constitutes a word. Parts of speech, syntactic structures, collocations, semantics, pragmatics, sociolinguistics and so on – all of these areas, rooted in our epistemological approach to language, demand rich ontological knowledge. This means that there are ontological presuppositions underlying linguistics and corpus linguistics – and this knowledge is important to our understanding of human communication. This leads to another principle.
Principle 3: Corpus linguistics, especially in the form of corpus annotation, is an area where the ontological presuppositions of linguistics become clear.
To the extent that corpus linguists have been very clear about the epistemological presuppositions behind their choice to use corpus methods to approach the study of language, corpus linguistics has abided by this principle of the scientific method. Similarly, by developing annotation schemes and showing how those schemes apply to observations in nature, that is, attested language use, many corpus linguists also clearly acknowledge the ontological presuppositions underlying their work. However, whether those ontological presuppositions are openly acknowledged or not, there is no escaping the importance of ontological presuppositions in all linguistic enquiry.
1.5 Logic
Science applies logic to real-world observations in order to develop rational models of the world. In doing so, it may work from the observed to the non-observed through a process of reasoning. So, Newton could make a series of observations of the operation of gravity on and from Earth and through that infer the operation of gravity both on Earth and beyond it. Corpus linguistics works on similar grounds using similar logical procedures.
Broadly, logic falls into two categories: deductive and inductive. In deductive logic, the premises of a well-formed (valid) argument entail the conclusion; if these are correct, the conclusion must be correct. In inductive logic, on the other hand, the premises, which are individual statements, provide some degree of support to the conclusion.Footnote 12 The distinction between the two is more complicated as will be shown later (Chapter 3).Footnote 13 In deductive logic, which we may characterise as top-down, we typically start with a general statement, which we believe to be true and proceed to an individual conclusion. The following is an example of deductive reasoning:
Premise 1: All birds are animals.
Premise 2: A robin is a bird.
Conclusion: A robin is an animal.
Induction, which we can characterise as ‘bottom-up’ reasoning, on the other hand, starts with individual premises, from which it derives a universal conclusion.
Imagine that someone travels to a new land and sees an animal (we will call it A), which they believe to be a bird, eating something that they do not recognise (we will call it B), but they believe it to be a fruit. They may make the following induction:
Premise 1: Creature A is a bird.
Premise 2: Object B is a fruit.
Premise 3: Creature A likes to eat B.
Conclusion: All birds like to eat this fruit.
There are premises here – that A is a bird, B is a fruit and that A enjoys eating B. Then there is an induction, a generalisation, that all birds enjoy eating this fruit. However, we continue with the observation and later that day we see another bird that does not like to eat B. This will render our conclusion invalid. The problem of induction is that it is generally impossible to exhaustively analyse something and arrive at a general conclusion (e.g. scientific law) based on individual observations. This point is further discussed in Section 2.2.
In language, even if we wish to proceed by induction, we often face issues of data scarcity. For example, you need a very great deal of naturally occurring data before you begin to see enough examples of what we might term ‘moderately frequent’ words – words that would strike no native speaker of that language as unusually infrequent, but that have a low enough frequency that you need a great deal of data to observe some examples of the word.
To give an example, let us draw on one of the more familiar thought experiments and invoke a Martian scientist (we will call him Ssorg) who has a copy of the 1,000,000-word BE06 corpus (Reference BakerBaker 2009). Ssorg decides to use induction to explore British English by using the corpus and starting to look at words in it – however, the first two words Ssorg decides to look at, raspberry and nanomachines, occur only once. Given the complexity of the language he wants to learn (as demonstrated by variation in the use of more frequent words in the corpus), Ssorg concludes that for these strange words there is nowhere near enough data to be able to look at the behaviour of either word through induction. He acknowledges that any inferences he makes about those two words based on such little data will be very weak – he may generalise about the meanings and behaviours of these words based on the data, but it is far too little data to say anything with confidence. Accordingly, he decides to proceed using deductive inference instead. He feeds the words into a mysterious model of English, based on deduction, which exists in the Martian laboratory. It takes the words and infers the following:
The part of speech of raspberry: singular noun
The existence of the word rasp from raspberry
The part of speech of rasp: verb
The existence of the word berry from raspberry
The part of speech of berry: singular noun
The part of speech of nanomachines: plural noun
The existence of the word nanomachine from nanomachines
The part of speech of nanomachine: singular noun
The existence of the word machine from nanomachine
The part of speech of machine: singular noun
The existence of the word machines from machine
The part of speech of the word machines: plural noun
Ssorg looks in the corpus and finds the word rasp and it indeed appears to be a verb. He finds no examples of nanomachine, but does find machine and machines. He is very interested in the words rasp and berry and uses the corpus to form a hypothesis about their meaning. He hypothesises rasp to be a noise and berry to be some type of edible object . He starts to wonder about the meaning of raspberry. From a previous study of how words are composed in the language, he decides that a raspberry is a noisy vegetable in the subcategory of fruit. He also wonders about what nano might mean. The model did not identify it as a word, but, reasoning by deduction from the case of raspberry, he assumes it must be a type of machine. So, while his model provided him with some answers that seemed to offer insights, it also left some questions for him to consider. However, all told he is pleased with the results provided by the deduction machine, they seem plausible to him.
Another Martian scientist (we will call her Izlyr) in the next room has access to one billion words of present-day English drawn from the web in a similar time period to her colleague. Izlyr finds 334 examples of raspberry and 314 examples of nanomachines. She knows that she is almost certainly not looking at all examples of the word – she is aware that, next time she is in range of communication satellites orbiting the Earth, she will be able to get more data and, given the frequency of the words in this batch of data, she may well expect to see more examples of them then. She concludes, however, that her current batch of data would give her enough examples to proceed with the use of inductive reasoning to explore the words. However, Izlyr decides to put the task aside for the day – she wants to go for a drink with her friend Ssorg. Ssorg and Izlyr meet for a drink and Ssorg reveals his deductions about the words raspberry and nanomachines – Izlyr realises that Ssorg’s beliefs about the use of these words can now be subject to a larger-scale test of correspondence to reality using the data set she had summarily inspected earlier – do the words act in the way Ssorg expects them to in the internet data? Do they have the meanings Ssorg thinks they do? Izlyr’s resulting study of these words by inductive reasoning in this empirical framework is objective – back on Earth you too could find those examples of the words in that data set and check any claims Izlyr makes about those words based on that data. Yet we must always acknowledge, as Izlyr does, that our view of data is partial and this can have consequence for claims about the nature of inferences we draw, as we will explore in Chapter 3. For now note that the way in which the Martian scientists worked together parallels, to an extent, some of the established paradigms of science in which deduction and induction are both used. Of course, much depends on whether we start with one or the other, as will be seen in Chapter 4. However, for now it is important to bear in mind that while we may characterise inference as being inductive or deductive, in science the two are often used in concert. The scientific process can thus be metaphorically likened to the meeting Ssorg and Izlyr had in the Martian coffee house.
A further point that some readers will already have arrived at represents another way, perhaps the principal way, in which corpus linguistics uses inductive reasoning. A form of inductive reasoning, ampliative reasoning, allows us to accept that the inferences we are making are bound to likelihood rather than certainty. Hence, one might wish to place some degree of confidence or strength on the inferences we draw, giving rise to probabilistic statements. Ampliative reasoning does not necessarily give rise to certainty. Hence the appropriate use of statistics in science is a major way in which ampliative reasoning is supported, and some degree of probability may be expressed in the results of that reasoning. Exactly the same is true of corpus linguistics and, to that extent, corpus linguistics once again aligns itself well with the sciences and the scientific method, as summarised in the following principle.
Principle 4: Because we have a presupposition of reality, through which we can measure hypotheses against observable, relatively objective data, we are able to use the apparatus that has been developed over centuries to support inductive reasoning, including through statistical analysis.
Again, a rationalist, exclusively intuition-led approach to the study of language cannot exercise this advantage because its presuppositions and sources of evidence essentially preclude it from doing so. We may be able to express an opinion as to the probability of an introspective judgement being true, perhaps by asking lots of people to answer the same question, but, as we saw with the example of cause earlier in the chapter, we cannot conclude that the introspections or expressions of likelihood based on these introspections bear direct relation to linguistic reality; in short, our intuitions about language are unreliable.
Of course, while we would argue that corpus linguistics uses logical argument and relies mainly on inductive logic to support its observations, that does not mean that researchers in corpus linguistics necessarily do this consciously and explicitly. Logical formalisms are rarely, if ever, seen in corpus linguistics research. Rather the logic is implied and embedded in the narrative of the arguments presented. For example, consider the following statement from the conclusion of Reference Baker, Gabrielatos and McEneryBaker, Gabrielatos and McEnery (2013: 275):
the quantitative analysis found that Muslims were frequently constructed in terms of homogeneity and connected to conflict.
This is a statement arising from inductive logic. The researchers collected a very large volume of newspaper material from the UK press that mentioned Muslims and Islam. On the basis of the analysis of that database, they found that there are many occasions in which the mentions of Muslims define them as an undifferentiated group and also associate them with conflict. The study is well founded in terms of the preconditions for scientific study outlined in this chapter and it shares the presuppositions of corpus linguistics. Inductive reasoning, working on a very large subset of British newspaper writing about Muslims, is used to produce two conclusions couched in probabilistic terms. What the conclusion does not look like, or contain, are logical statements such as this:

Where x is Muslim in the press, A means homogenised and B means engaged in conflict
We might produce this if we were trying to capture elements of this argument in first-order predicate logic. For simplicity, we replace the informal quantifier ‘frequently’ from the conclusion by Reference Baker, Gabrielatos and McEneryBaker et al. (2013: 275) with a universal quantifier ∀ meaning all; this could be further refined by looking at this statement as a probabilistic statement. What is important, however, is the fact that the absence of such a formal expression of the results makes the conclusions drawn no less scientific. The underlying process is still one of logical induction, though informally expressed, a more than common approach to expressing logical inferences in the sciences.
∀ is a universal quantifier meaning all; the expression (∀x) A(x) can be translated as: for all x A applies.
∃ is an existential quantifier meaning some; the expression (∃x) A(x) can be translated as: for at least one x A applies.
¬ negation operator meaning ‘not’; ¬∃ means does not exist.
& logical connective meaning ‘and’.
In corpus linguistics, by building a small-scale model of language, one allows inductive inference to be based on it and the inferences made projected onto the larger population, language itself. These inferences are probabilistic. A key presupposition of corpus linguistics is enshrined in the following principle.
Principle 5: By studying corpora, that is, finite samples of language, we can make general claims about language itself; these claims are probabilistic in nature.
Given that probability and inductive inference are commonplace features of science, asserting that corpus linguistics is in line with the scientific method in doing the same seems non-problematic. Problems do lurk in this approach, however, and we will return to them in Chapters 3 and 4. For now, we shall note that the problems to come are not peculiar to corpus linguistics and that they in no way mean that we should not proceed to analyse language in this way. As will become apparent, it does mean, however, that we need to be cautious and critical of our findings when proceeding by this route, whether we are studying language, geology, astronomy or a host of other subjects that carry out research on this basis.
1.6 Empiricism
Here we touch upon an area where the presuppositions of corpus linguistics become key to understanding the corpus approach to language. What constitutes the data that we use to explore language? The choice of data to be used to study language has important implications for the reliability of the theory based on the data. The great advances in scientific method in the mediaeval period introduced innovations such as controlled experimental methods, most importantly through the work of Robert Grosseteste; in addition, crucial distinctions were made such as William of William of Ockham’s scientia realis and scientia rationalis. The former is an area of investigation ‘concerned with what was known by experience to exist and in which names stood for things existing in nature’ (Reference AbercrombieAbercrombie 1971: 172), a science of real entities, where an appeal to correspondence with nature is clearly possible. The latter is ‘the science of logical entities’ (Reference AbercrombieAbercrombie 1971: 172), what we may think of as the study of concepts in the mind.Footnote 14
Ockham was one of the most prominent figures of mediaeval philosophy and theology. The English Franciscan friar’s work surpassed the conventional teaching and wisdom of his time, so much so, we may add with the benefit of the modern hindsight, that he was charged with heresy. He died in exile in Munich. In his work, Ockham makes a number of important epistemological distinctions – distinctions related to the theory of knowledge. He is best remembered for the principle of parsimony now widely known as Occam’s razor: ‘Entities should not be multiplied without necessity’, in other words, a simple solution to a problem is better than a more complex one.
While scientia realis gives rise to modern empiricism, scientia rationalis is a predecessor of rationalism. The former relies on sense data and experience as the primary source of knowledge, the latter, on the other hand, foregrounds reason, logic and rational constructs. This broad spilt is present in many disciplines, including linguistics. However, the next principle presents the views of corpus linguists.
Principle 6: Corpus linguistics inclines to scientia realis – it is the study of observable language.
This principle is clearly both important and bound to provoke challenge – it is a short statement, but points to quite fundamental differences in outlook in research. As a consequence, we will return to this concept and explore it afresh in Chapters 2 and 4, and consider revising it. Indeed, for any principle presented in this book, we may, as the argument proceeds, revise it. However, for the present we will leave Principle 6 as a clear statement of what we believe corpus linguistics to be.
Through Principle 6, inferences about mental states may be made, but the evidence for those inferences is externalised and observable. Contrast this position with that of linguists such as Chomsky who argue against the use of observational data – they are focused on a scientia rationalis, we would argue, using concepts in the mind to study concepts in the mind. What is studied and the data used to study it are both internal and conceptual.
Such distinctions have provided broad divisions in the intellectual landscape ever since. Descartes carried on the rationalist tradition of science focused upon philosophical reasoning, while early scientists such as Francis Bacon, an empiricist, emphasised instead the need for empirical data. Principle 6 locates corpus linguistics clearly in the empiricist tradition of science – as with the work of the early empiricist John Lock and later scholars such as David Hume, Principle 6 seeks to ‘relate the contents of our minds, our knowledge and beliefs, and their acquisition, to sense-based experience and observation’ (Reference WoolhouseWoolhouse 1988: 2). It is an approach in which external evidence is used to regulate reasoning about internal mental states. This is very important, as the use of such data provide grounds on which theory choice can be conducted scientifically – the choice of data, hardwired into the presuppositions of corpus linguistics, externalise and render objective the grounds on which theories can be refuted. A theory of language is a good theory if it fits the data. It is a poor theory if it does not. If data are internalised and subjectivised, as happens with exclusively intuition-based approaches to the study of language, then while one might argue that theories that fail to fit the data should still be rejected, the data have become more difficult to assess – people’s intuitions about language are unreliable and diverge. As Reference SeurenSeuren (2004: 110) notes, introspections are ‘necessarily fluid because they involve (a) reporting one’s own behaviour, a well-known source of error, and (b) membership of a speech community, well-known for its variation and vague boundaries’. In this context, the path opens to a number of fallacies that can be used to confound arguments, most notably the personal exemption fallacy; if somebody produces a sentence that your theory cannot account for, to save the theory you may simply declare ‘that sentence is ungrammatical’. Whether that statement is true or not, its effect is to dismiss evidence purely on the grounds of your personal, subjective assessment of it. In scientia realis, such a fallacy is much more difficult to sustain – while we might assert that such a sentence could never be spoken, for example, if it is spoken, and if there are no clear signs that hearers of the sentence have the least bit of difficulty in accepting it, then the personal exemption fallacy cannot, at the very least, be easily deployed. In the face of a large body of evidence of felicitous usage of the sentence, then the fallacy should be dismissed and the theory challenged. We will return to a further consideration of these issues in Chapters 2 and 5.
In sum, the point we made here relates to empirical grounding of corpus linguistics. In the first sketch of it, empiricism was identified as the source of knowledge in corpus linguistics in contrast to some other influential approaches such as those based on the assumptions about language underlying the work of Chomsky and his followers. In our empirical approach, when we have a question about language, instead of introspection we look for patterns of language use in our data (corpora). So far, following the metaphor of laying foundations for corpus linguistics, we have carried out a geological survey identifying basic layers on which to lay foundations. Yet that focus on empirical data ushers in a discussion of another important feature of the scientific method as discussed by corpus linguists: replication – a repetition of an empirical study with the same research question but a different data set or tool.
1.7 The Notion of Replication
Corpus linguistics provides data that allow for repeatability – results from a study can be repeated if the data and the tools used to derive the results are available to the researcher. Repeatable results are an essential aspect the scientific method – they provide researchers with some confidence, especially in the context of inductive reasoning, that the results observed and the consequent inferences based upon those results are not simply random chance events.
Replicability is an extension of repeatability. While the two concepts are often bundled together for ease of reference (e.g. Reference Freese and PetersonFreese and Peterson 2017), the distinction between repeatability and replicability is important. Repeatability is reproducing the same results as another researcher given the same data and tools. Replicability is the process of showing that the results are derivable given a different data set and/or different tools that are comparable to those used in another study. This distinction needs to be teased out a little. If we look at the one million-word BE06 corpus and explore the relative frequencies of could, should and must, we find that they occur 1,793, 891 and 538 times, respectively. To derive this result, we used a corpus (BE06) and a corpus search engine (#LancsBox). Using that data and that corpus search engine, we repeated the search twice and got the same results. This set of results is thus entirely repeatable. From it we may, by inductive reasoning, assume that in British English, could is more frequent than should, which in turn is more frequent than must. Yet is the result an artefact only of the data set we looked at? By looking at comparable data sets, other researchers can test whether these results are reproducible – if they are, then this will strengthen our belief in this conclusion. If we replicate this study in the written data in the British National Corpus 1994, we find could, should and must occur 139,674, 96,889 and 63,890 times, respectively. Our hypothesis about the rank ordering of these three words (hypothesis 1) has survived replication. A data set that is similar to the one we used in the first place revealed the same result. Note that this example also shows that we have choice in making more or less specific inferences. Our inference was quite general, only looking at rank ordering. Had we instead decided to say that the proportions of mentions would be stable, which would also have entailed a rank ordering, our hypothesis (hypothesis 2) would have encountered difficulties. In the BE06, for example, there are roughly three mentions of could for every one of must. In the BNC1994, the proportion is more like two mentions of could for every mention of must. So, when we make inferences, we have choices and when we test them, we should be mindful that we are testing the inferences made, not necessarily all the inferences that were in principle possible. We have thus come full circle to our initial definition of replication characterised by the same research question and a different data set or tool.
Popper, the philosopher of science who is our guide on our journey to explore the foundations of corpus linguistics and its relation to science, recognises the crucial importance of replication (he uses the term ‘reproducibility’). In the following extract from The Logic of Scientific Discovery, Popper contrasts the idea of ‘isolated coincidence’ with ‘regularity and reproducibility’ of events, which we can test and include in the domain of ‘scientific observations’. Only those events that are ‘inter-subjectively testable’ can, according to Popper, be taken seriously:
Only when certain events recur in accordance with rules or regularities, as is the case with repeatable experiments, can our observations be tested – in principle – by anyone. We do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them. Only by such repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence’, but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable.
This leads to our last principle for this chapter.
Principle 7: Corpus linguistics promotes and is based upon an intersubjectively observable approach to language in which results are repeatable and replicable.
So how does replication work in practice? There are, in principle, a number of ways to achieve replicability in corpus linguistics. Let us consider this point by returning to the example of could, should and must in the BE06. Our replication of that study relied on a different corpus – it represents British English from ten years or so before that in the BE06 and the balance of different types of writing within it differs from that in the BE06. The replication is imprecise and that may explain the variation we have seen in terms of proportions of mention. A more precise way that we may seek to replicate the results would be to build our own version of the BE06. It would be of the same size, collect texts from the same time period and follow the same sampling frame as the BE06. If they then find the same results, then our findings are replicable. However, even in the natural sciences some variation in findings is permissible – we are, after all, sampling and engaging in ampliative reasoning with induction in this case. Once again statistics can be used to test whether a difference in the observations are significant or not. We will return to consider replication in more depth in Chapters 6 and 7.
Through reproducibility we gain faith in the measurements we take and through replicability we gain confidence in the models, proceeding from (typically inductive) logic, that we build based upon observations. So sharing data and tools, while it may not be the first thing that occurs to corpus researchers, is one of the cornerstones of both corpus linguistics and the scientific method.
1.8 Conclusion
In this chapter, we have established that alignment to the scientific method is important for corpus linguistics. Corpus linguists, however, have often taken this as a given, or approached the alignment with a magpie eye for specific, attractive points to make, rather than with a view to encompassing a whole philosophy of science. We began the process of trying to formalise what a scientific framework would look like and considered how corpus linguistics might fit within it. In doing so, we began the main task of this book, to outline what we believe to be foundational principles that underpin and define work in the field. While we introduced these ideas, we also noted, at times, that our presentation of them was preliminary and that, as a consequence, we may need to revisit some of these principles as we both provide more nuance and depth to the ideas outlined earlier and introduce new ideas from the philosophy of science. At the end of the chapter, we touched upon replication, something that we will return to again and again. It also forms a main part of the empirical work presented in Chapter 7. To move on with the process of outlining the scientific framework within which we are considering corpus linguistics, in the next two chapters we will pick up the thinking of Karl Popper and explore his works on science. By the end of Chapter 3 we will have outlined further useful principles of corpus linguistics, but we will have also shifted the discussion of the relationship of corpus linguistics to science in a decidedly social direction. So, the next step in our discussion, to follow our metaphor, is to build pillars, which will extend downwards into the ground to offer stability to our building. However, as identified in the geological survey undertaken so far, the ground is not homogeneous; on the contrary, it contains variations in the structure, crystalline imperfections and impurities, if you will, as was apparent in the work of our Martian scientists. Imperfection and variation are thus at the heart of empirical evidence. Variation is present in both the natural and the social world and any scientific epistemology needs to account for it. So far, we have touched briefly upon extreme rationalist approaches to language that downplay the crucial role of variation, viewing it as an unimportant artefact of the empirical method. Leaving aside these views that deal with variation by bracketing it away, we will focus in this book on approaches that do not ignore but engage with variation.