Data and Methods in Corpus Linguistics

Corpus linguistics continues to be a vibrant methodology applied across highly diverse fields of research in the language sciences. With the current steep rise in corpus sizes, computational power, statistical literacy and multipurpose software tools, and inspired by neighbouring disciplines, approaches have diversified to an extent that calls for an intensification of the accompanying critical debate. Bringing together a team of leading experts, this book follows a unique design, comparing advanced methods and approaches current in corpus linguistics, to stimulate reflective evaluation and discussion. Each chapter explores the strengths and weaknesses of different datasets and techniques, presenting a case study and allowing readers to gauge methodological options in practice. Contributions also provide suggestions for further reading, and data and analysis scripts are included in an online appendix. This is an important and timely volume, and will be essential reading for any linguist interested in corpus-linguistic approaches to variation and change.


O
data, it can collect more tokens for each lexeme (p.39) and can make the results of sample analysis more stable.Second, its broad data types make it more suitable for studying the differences between languages, which cannot be done with small data sets.The different information obtained by using BNC and COCA is highlighted.The authors also point out that combining these two databases can help people understand research data more comprehensively (p.4).
The second section is centered on a specific case: the "Principle of Rhythmic Alternation" (PRA).This section points out that PRA is often considered a driver of different types of phenomena.Its properties require different corpus approaches, and different types of data are involved.The impact of data and method interactions are presented at the end of the section.For example, under the premise of using corpus data, written or orthographically transcribed spoken data will be more widely used than spoken data with access to audio recordings (p.65).The research results also show that the 'idiom principle' or the concept of 'lexicalized sentence stems' may be relevant for phonological levels (p.67).For this reason, it is a good choice to combine different data sources and methods for research, which echoes the conclusions from the section on using both large and small databases.
In Chapter 2, the author puts forward two questions: 'what goes into a corpus', as well as 'what goes into an analysis'; the corpus is a huge collection of texts, and it is also a sample (Biber, 2011).However, the corpus is not equal to the language, no matter how many sentences it contains (Jones & Waller, 2015).So, the hierarchical structure of the corpus needs to be taken into account (p.4).The main issue discussed in this chapter is the preparation of corpus data in research and how to analyze it.This chapter is divided into 3 parts.The first is Fabian Vetter's part, which explains the design of experiments to explore differences between corpus registers.The analysis also proves that differences between them may be due to the different sampling strategies adopted by the corpus compiler.In the compilation of future corpora, Fabian Vetter recommends adding situational characteristics to annotated texts (p.98).The second section is about the passive voice.Alternately selecting different baselines, Sean Wallis and Seth Mehl stress that normalized frequencies often fail to yield meaningful measures (p.5).Therefore, a baseline indicating opportunities of use is vital to make the data reliable.At the end of the section, the advantages and disadvantages of three different baselines are provided.The last section is by Lukas Sönning and Manfred Krug, who call for richer metadata and elaborate on the benefits of linking corpus data to speakers.
The main content of Chapter 3 concerns how various researchers utilize statistical methods to evaluate the influence of specific factors on context, and assesses their advantages, disadvantages, and limitations.In addition, the article also develops regression analysis and distance-based visualization.For example, in the first section, Tobias Bernaisch compares, among others, the Generalised Linear Mixed-Effects Models, and concludes that even for the same set of data, if different models are used, there will be differences and diverse observation results.For another example, in the third section, Natalia Levshina compares the standard frequencies with the recent Bayesian method.She focuses on multiple logistic regression with mixed effects and believes that Bayesian statistics could effectively stabilize data and reduce the amount of data preparation.This could provide a powerful tool to overcome limitations and promote collaboration (p.8).
The last chapter concerns the combination of corpus linguistics and computer and machine learning.Because the content of the corpus can be processed using different types of annotations, the specific combination of computer and machine learning will produce innovative methods to serve better the tasks required by corpus linguistics.For example, in the first section, Gerold Schneider uses a corpus-driven approach to study English grammar changes, identifying words combined in a specific form or trend in grammar (Biber et al., 2010).It is clear that corpus linguistics is not an independent discipline.It can draw inspiration from other fields to generate new research hypotheses.
Overall, this book is a useful contribution to the development of corpus linguistics.Through different research cases, this book adopts different types of experimental methods and discusses the basic principles behind these methodologies in depth, so that everyone who wants to participate in corpus linguistics research can distinguish the pros and cons of different corpora, or their data processing methods, and make better choices.Readers can gain an in-depth understanding of essential issues and the latest research methods in related fields through this book.One positive aspect is that the disciplines of corpus linguistics and computer science are combined in the second half of the book, providing researchers in linguistics and other disciplines with rich research ideas and methods.This book is an indispensable reference book in the field of corpus linguistics and has significant value for scholars, teachers, and students engaged in language research.