Statistics is a powerful analytical tool. This book has demonstrated a number of different ways in which statistical techniques can be used to explore corpora. The robust evidence found in these electronic collections of language offers countless possibilities for both linguistic and social research providing a unique insight into patterns of language use. Statistics, if applied appropriately, can facilitate the process of analysis by serving as a zoom lens through which we can observe the linguistic reality: the details of individual examples of language use as well as the larger picture of grammar, vocabulary and discourse. We need to remember, however, that the lens should always be a transparent one: what we want to observe is not the tools themselves – a showcase full of sophisticated statistical techniques – but the linguistic data. Our analysis thus should always be primarily focused on the data and should take data seriously; if our beliefs and theories are contradicted by the data we shouldn’t simply dismiss the data as ‘inconvenient’ evidence (or hide it behind complex statistical jargon) but, on the contrary, we should engage with it, seeking to genuinely understand and explain the findings. Only in this way can our investigation be meaningful and truly scientific.
Mastery of statistics is empowering. However, as statistical tools and analyses become more complex and sophisticated, they can also become rather daunting for the users. This is because statistical analysis involves many choices. Among other things, we need to select a suitable corpus, an effective analytical technique and an appropriate interpretation of the results. These decisions can often feel challenging especially for novice researchers. The growing demand in corpus linguistics for statistical sophistication and the lack of appropriate resources for beginner and intermediate users of statistics can thus easily lead to frustration. This book (together with Lancaster Stats Tools online) hopes to be a resource addressing this issue by offering readers a guide for making informed choices about statistics in language analysis. The main message of the volume is twofold. First, statistics is not about number crunching or remembering equations (computers are much better at these tasks than humans) but about understanding core, underlying principles of quantitative analysis. Second, I would like to encourage the readers not to let themselves be overwhelmed by the complex statistical techniques or the newest fads on the statistical marketplace. Every summer, a large number of students from different parts of the world come to Lancaster to learn about corpora and statistics during a week of Lancaster summer schools in corpus linguistics. These students often ask me what the best statistical test is to use with corpora, what the best collocation measure is etc. I usually respond: in many cases, the most powerful statistical technique is common sense.