Why Use Statistics?

Sali A. Tagliamonte

doi:10.1017/9781009403092.008

7 - Why Use Statistics?

Published online by Cambridge University Press: 19 June 2025

Sali A. Tagliamonte

Show author details

Sali A. Tagliamonte: Affiliation:
University of Toronto

Book contents

Summary

This chapter will provide a step-by-step procedure for setting up an analysis of a linguistic variable. It will detail the procedures for coding, how to illustrate the linguistic variable, and how to test claims about one variant over another.

Keywords

statistical modelling factor weight p value coefficient sum coding treatiment coding using R

Information

Type: Chapter
Information: Analysing Sociolinguistic Variation , pp. 113 - 152

DOI: https://doi.org/10.1017/9781009403092.008 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2025

7 Why Use Statistics?

Box 7.01

This chapter focuses on statistical modelling and first asks why we should use statistical modelling of linguistic variables. It will cover the terms ‘factor weight’, ‘p- value’, ‘coefficient’, ‘sum coding’, ‘treatment coding’, and describe what they mean.

This chapter will also be an initiation to doing variation analysis with R. I will take the reader through the basics, including installing packages, loading libraries and data files, and adjusting factor groups.

‘The fact that grammatical structures incorporate choice as a basic building block means that they accept probabilisation in a very natural way, mathematically speaking” (Sankoff, D., Reference Sankoff1978:235). This chapter is designed to introduce you to statistical modelling and the study of linguistic variables. Importantly, I address the issues of why statistical modelling should be used at all.

In earlier times, studies in variationist sociolinguistics used an in-house program, called the variable rule program. The program was written in the 1970s (Cedergren & Sankoff, D., Reference Cedergren and Sankoff1974) but was later recast for personal computers in the 1990s (Rand & Sankoff, D., Reference Rand and Sankoff1990; Robinson et al., Reference Robinson, Lawrence and Tagliamonte2001; Sankoff, D. et al., Reference Sankoff, Ammoon, Dittmar, Mattheier and Trudgill2005) and updated to be compatible with the R environment as RBRUL (Johnson, Reference Johnson2009). For an inside story see chapter 6 in Tagliamonte (Reference Tagliamonte2016a). It can still be found on some websites, including my own, but be warned GOLDVARB may not be compatible with the latest operating systems and the software is no longer being updated.

GOLDVARB:

RBRUL:

www.danielezrajohnson.com/rbrul.html

The discussions of GOLDVARB in the earlier edition of this book (Tagliamonte, Reference Tagliamonte2006a) remain explanatory and valid for use of that toolkit; however, in the interests of advancing the field and offering updated tools to the LVC toolkit, I will use the R software and demonstrate the utility of using the R environment for the study of language variation and change. R is a free software environment for statistical computing and graphics that has become the new gold standard platform for modelling in variation since the late 2000s (R Core Team, 2020). Construct your own ‘Analysing Variation Cookbook' by copying and pasting the code used in the book into an RMD file. The procedures I will use will be fundamental, not comprehensive. They are intended as a clear and effective guide to doing a thorough variation analysis but should not be taken as exhaustive. If you wish to learn more, and advance your knowledge, Google it!

Theory

Much of what is challenging about quantitative methods comes from the highly technical descriptions of statistical practice. Like many things that involve numbers, reading about statistics often incites a negative response. However, statistics is an incredible tool for conducting sophisticated analyses and for exploring complex data, helping you to make sense of it, and even for simply organising it.

My advice in the first edition of this book was the same as it is now: once you get the hang of how a statistical platform works, you will find that there is nothing better than having every minute detail of your data at your fingertips and assembled in a way that makes it maximally accessible and analysable. After the challenges of fieldwork, months and often years of transcribing data, extracting, and coding data, the joy of running your first counts of tokens, rates, and proportions to find out what is going on in your data (Chapters 9 and 10) cannot be underestimated.

Statistical Modelling

The use of logistic regression in variationist sociolinguistics began with the variable rule program. It was developed in-house by the combined efforts of multiple researchers in the field through several rounds of technical improvements.

A great deal has been written about the value of the variable rule program and further explanation about its use can be found in papers from the 1970s and 1980s (Cedergren & Sankoff, D., Reference Cedergren and Sankoff1974; Sankoff, D. & Labov, Reference Sankoff and Labov1979; Sankoff, D. & Rousseau, Reference Sankoff, Rousseau and Jacobson1979; Sankoff, D. Reference Sankoff, Ammon, Dittmar and Mattheier1988c). This historical background is useful in elucidating the reasoning and justification for statistical modelling (see also Tagliamonte, Reference Tagliamonte2016a) and much of it is as valid in the 2020s as it was then. The main change is that statistical modelling has advanced considerably and at the time of writing new modelling techniques and methods in statistics have provided variation analysts with a rich toolkit for analysing linguistic data.

For background, you can find many early discussions about using statistics for variation in the literature, in particular Sankoff, D. (Reference Sankoff, Ammon, Dittmar and Mattheier1988c). Subsequent digests of variable rule methodology can be found in Guy (Reference Guy, Ferrara, Brown, Walters and Baugh1988; Reference Guy and Preston1993), whose aim was explicitly to demystify what had at earlier times been passed on by ‘word of mouth’. The same can be said of Young and Bayley (Reference Young, Bayley, Bayley and Preston1996) and Bayley (Reference Bayley, Chambers, Trudgill and Schilling-Estes2002). Paolillo (Reference Paolillo2002) is perhaps the most detailed, with a focus on statistical terms and explanations. It is important to remember that the variable rule program was developed during a particular intellectual climate in the study of language variation and change in the 1960s when ‘rules’ were the prevailing conception of linguistic structure and few, if any, social science or humanities disciplines were using quantitative techniques.

From the 1960s to the early 2000s the variable rule program was considered the most appropriate method available for conducting statistical analysis on natural speech (Sankoff, D., Reference Sankoff, Ammon, Dittmar and Mattheier1988c:987). However, the choice of optimal statistical tool for analysing linguistic variation has a long history of controversy. Early research questioned the assumption of statistics as a valuable asset for the study of variation (Bickerton, Reference Bickerton1971; Reference Bickerton1973; Kay, Reference Kay and Sankoff1978; Kay & McDaniel, Reference Kay and McDaniel1979; Downes, Reference Downes1984). When researchers started studying morphosyntactic variables employing statistics was questioned again (Rickford, Reference Rickford, Fasold and Shuy1975; Lavandera, Reference Lavandera1978) and again with the advent of discourse-pragmatic studies (Cheshire, Reference Cheshire2005).

Beginning about 2007, the use of the variable rule program itself was called into question (e.g. Johnson, Reference Johnson2009) and the debate over which statistical method is most appropriate raged. GOLDVARB (in all its guises) is simply an implementation of a generalised linear model for data analysis that has two discrete variants (a binary variable). It can model the combined effect of independent, but orthogonal factors. Other statistical packages (e.g. SAS, SPSS, R) offered comparable models.

Box 7.02NOTE

A variable that has two or more categories is a ‘factorial variable’ or ‘categorical variable’. The levels of the variable are referred to as ‘factors’ in variationist research. Factorial variables are categories or groups. They have no inherent order.

Since the 2000s generalised linear mixed effects models have become increasingly popular due to their flexibility in modelling complex data and their ability to handle subtle differences among internal and external factors (e.g. Bates, Reference Bates2005a; Baayen, Reference Baayen2008; Baayen et al., Reference Baayen, Davidson and Bates2008; Jaeger, Reference Jaeger2008; Johnson, Reference Johnson2009). Such models entered variationist studies through statistical packages such as RVARB (Paolillo, Reference Paolillo2002), RBRUL (Johnson, Reference 341Johnson2010), and R (Team, 2007). These software packages are readily available and personal computers have developed sufficient power to run the new models effectively; however, many researchers do not have the background to make informed decisions about how to use the new techniques most effectively. Indeed, the ‘tool’, that is, a generalised linear model versus a generalised linear mixed model, is often confused with the ‘toolkit’, namely GOLDVARB versus SPSS, SAS, or R.

Box 7.03NOTE

Variationist practice uses the terms ‘factor group’ and ‘factor’. Other disciplines use the term ‘predictor’ to cover both factorial and numerical (continuous) predictors. A factorial predictor is one that is unordered, such as noun vs verb; a numerical predictor is one that is ordered, such as year of birth. The values of a factorial predictor are also referred to as ‘levels’.

A History of Statistics in Language Variation and Change

What were referred to as ‘variable rules’ in the early days of variation analysis were founded in the notion of ‘orderly heterogeneity’ (Weinreich et al., Reference Weinreich, Labov, Herzog, Lehmann and Malkiel1968:100), the idea that variation in language is not random or free, but systematic and rule governed.

The analysis of speech behaviour has repeatedly revealed that the possibilities represented by abstract optional rules are distributed in a reproducible and well-patterned way in each individual and in each speech community (Cedergren & Sankoff, D., Reference Cedergren and Sankoff1974:333).

Variable rules were first introduced by Labov (Reference Labov1969a), arising from his fundamental observation that individuals make choices when they use language and, further, that this choice is orderly. Owing to the systematicity of the process, the relative frequency of use can be predicted. The term variable rule was an accountable, empirical model for this phenomenon, thus introducing a probabilistic component into the model of language.

To some researchers, the introduction of statistical concepts was a natural and logical addition to the study of inter-individual, dialectal, and historical variation in language. However, the idea of ‘probability’ in language was met with intense criticism: ‘though “probability of a sentence (type)” is clear and well-defined, it is an utterly useless notion’ (Chomsky, Reference Chomsky1957:195).

There are fundamental epistemological questions involved. Does choice exist in linguistic competence? Some people argue yes; some argue no. This book is not the place for such debates (for further discussion, consult, e.g., Sankoff, D., Reference 347Sankoff and Newmeyer1988b). Instead, I will focus on the development of probabilistic theory in sociolinguistics and the mathematical issues that led to the original formulation of variable rules and onwards in the development of the field to linguistic variables and statistical modelling.

In his study of contraction and deletion of the copula, Labov (Reference Labov1969a) made an interesting discovery – the choice process operates regularly across a wide range of contexts, both external and internal: ‘we are dealing with a set of quantitative relations which are the form of the grammar itself’ (p. 759). Cedergren and Sankoff, D. (Reference Cedergren and Sankoff1974:336) elaborated on the mathematical significance of this discovery, showing that ‘the presence of a given feature or subcategory tends to affect rule frequency in a probabilistically uniform way in all the environments containing it’. Thus, a broader (statistical) generalisation was made. If a given feature tends to have a fixed effect independent of the other aspects of the environment, then this can be formulated mathematically.

However, statistical procedures such as analysis of variance, or ANOVA, were unsuitable for language data. It was necessary to construct ‘probabilistic extensions of the extant algebraic linguistic models’ (Sankoff, D., Reference Sankoff1978:219). To model a grammar that has heterogeneity with contextually conditioned ‘order’ to it as well as innumerable blank regions, a mathematical construct had to be devised that would suitably mirror it.

Where Did Variable Rules Come From?

Applying statistical methods to linguistic variables was first developed as a quantitative extension of generative phonological analysis and notation (Labov, Reference Labov1969a; Reference Labov and Alatisb; Reference 342Labov1972b; Cedergren & Sankoff, D., Reference Cedergren and Sankoff1974). In the early descriptions, variable rules were presented as formal expressions compatible with the apparatus of formal language theory of the time (Chomsky, Reference Chomsky1957), that is, ‘rules’. However, the reference to variation as ‘rule’ has more to do with variation being systematic (i.e. rule-governed) than with any specific formalism. Indeed, variable ‘rules’ do not necessarily involve rules at all (Sankoff, D., Reference 347Sankoff and Newmeyer1988b:984). This terminology is an inheritance of its early contextualisation within formal theories of language, which at the time involved rules. Instead, variable rule is a misnomer. They are actually ‘the probabilistic modelling and the statistical treatment of discrete choices and their conditioning’ (Sankoff, D., Reference 347Sankoff and Newmeyer1988b:984).

The prerequisites for logistic regression using Sankoff, D.’s (Reference 347Sankoff and Newmeyer1988b:984) terms are (1) choice, (2) unpredictability, and (3) recurrence. First, the analyst must perceive that there is ‘a choice between two or more specified sounds, words or structures during performance’. Second, the choice must seem haphazard based on known parameters, that is, not entirely predictable; not deterministic. Third, the choice must occur repeatedly in discourse. Given these conditions statistical inference can be invoked.

The apparent randomness of the choice process makes it appear that the variation has no structure and has many more exceptions that it really does. Statistical inference by its very nature extracts regularities and tendencies from data presumed to have a random component. To accomplish this, the inference procedures must be applied to some sample containing the outcomes of the choice repeated many times (your token file/data) and usually in a variety of contexts, each context being defined as a specific configuration of conditioning factors (your coding schema). In variationist terminology, the choice of one variant over the other is the ‘dependent variable’. The independent variables, features of the linguistic or extralinguistic context which impinge on the choice of one variant over the other, are the ‘factors’ or ‘factor groups’. In other disciplines, all of them are variables or predictors.

The Null Hypothesis

How can we distinguish those factors which have a genuine effect from those whose apparent contribution is an artefact of the data sample? The starting point is the null hypothesis, the idea that no genuine effects exist in the data. Statistical methods are used to distinguish bona fide contrasts and trends from accidental data patterns due to statistical fluctuation, often referred to as random error, or ‘noise’. To establish that a real effect exists, the null hypothesis must be falsified. In the variationist approach to language, the type of data which interests us are ‘choice frequencies in contexts made up of cross-cutting factors’ (Sankoff, D., Reference 347Sankoff and Newmeyer1988b:987). The null hypothesis is that none of the factors has any systematic effect on the choice process and that any differences in the choice outcome among the various contexts is to be attributed to statistical fluctuation. As Sankoff, D. (Reference Sankoff, Ammon, Dittmar and Mattheier1988c:987) says, ‘if we can prove that random processes alone are unlikely to have resulted in the pattern of usage rates observed, we may be able to attribute this pattern to the effect of one or more of the factors’. How can we identify systemic deviations from randomness? This is where logistic regression analysis is key.

Sankoff, D. (Reference Sankoff, Ammon, Dittmar and Mattheier1988c:987–992) elaborates on the use of logistic regression for the analysis of linguistic variables. I simplify greatly in my overview of this process here and draw heavily on Sankoff’s description. More recent explanations can be found in Paolillo (Reference Paolillo, Boberg, Nerbonne and Watt2017) or from outside variation linguistics (Hayes, Reference Hayes2022). For readers who have no need to understand the logic behind the mathematical algorithm, skip to the section on practice.

Models and Link Functions

Separating the effects of different contextual factors requires knowing how they jointly influence the choice process in each context (Sankoff, D., Reference Sankoff, Ammon, Dittmar and Mattheier1988c:987). The simplest way of combining effects is additive, and this is the model that Labov had originally used. However, the additive model is not useful for situations in which ‘the application frequencies are very different in different environments, or when there are many different environments’ (Cedergren & Sankoff, D., Reference Cedergren and Sankoff1974:337) – exactly how language always is! Therefore, the simple additive model that is often used for statistical procedures (e.g. the analysis of variance) is not appropriate for sociolinguistic data analysis.

In typical natural language performance, there are many cross-cutting factors. An additive model applied to such data – say, the combined effect of preceding phonological segment, grammatical category, and following phonological segment – may well produce percentages of more than 100 and below 0. As Sankoff, D. (Reference Sankoff, Ammon, Dittmar and Mattheier1988c:988) points out, ‘such “impossible” predictions’ are a major problem. ‘The solution is to use a model where the sum of the factor effects is not the predicted percentage of a given choice, but some quantity related to this percentage’ (p. 988) – the link function. This function is such that it can take on any value without the risk that the corresponding percentage will be less than 0 or more than 100. In the formulation of the variable rule program, the link function was the logit of the percentage (Sankoff, D., Reference Sankoff, Ammon, Dittmar and Mattheier1988c:988). The logit has two properties which make it superior to other link functions: (1) the predicted percentage always lies between 0 and 100 – this condition does not automatically hold for other link functions; (2) it is symmetrical with respect to binary choices. It doesn’t matter which value is the application value; the model has the same form. This logit link function (i.e. logit-additive model) underlies the variable rule program. For more information, read the section ‘Models and Link Functions’ in Sankoff, D. (Reference Sankoff, Ammon, Dittmar and Mattheier1988c:987–989).

The Likelihood Criterion

How do we find a set of values which best accounts for the observed data? In statistics, how well a model with given factor effects fits a data set can be measured by several criteria. Logistic regression uses the likelihood criterion because it can account for the extreme distributional imbalances, including contrasting full versus near-empty cells, in corpus-based data. In the second half of this chapter, you will see some practical examples of what such ‘lumpy’ data look like.

The likelihood measure indicates how likely it is that a particular set of data has been generated by the model which has the given values for the factor effects. Different sets of factor effects will have different likelihood measures for the same set of data. The principle of maximum likelihood provides a means to choose the set of values which is most likely to have generated the data (Sankoff, D., Reference Sankoff, Ammon, Dittmar and Mattheier1988c:990). As we shall see in the second half of the chapter, the likelihood criterion is critical for establishing which combination of factors is the best ‘fit’ of the model to the data.

The estimation of maximum likelihood is carried out by logistic regression. This type of analysis is not unique to linguistics. In fact, many statistical packages can do it. However, the type of data used in variationist research on language is very different from the data in any other field of study: (1) it is based on language in use, often in highly vernacular registers; and (2) it is badly distributed by nature. The logistic regression embodied in the variable rule program was developed within the field during a time when it was the ideal option for analysis of this type of ‘messy’ data and for calculating results ‘in a form most useful in these studies’ (Sankoff, D., Reference Sankoff, Ammon, Dittmar and Mattheier1988c:990). The groundswell of new developments in statistics, data science, and computational linguistics in the early twenty-first century has had a profound effect on linguistics generally and on analyses of variation across subdisciplines (see e.g. Hayes, Reference Hayes2022).

Although the variable rule program was an ideal tool for its time, using R for statistical modelling offers the analyst many more ways to explore and analyse variation. Of course, statistical analysis does not explain the variability in the data nor its origins. Moreover, there is also no assurance that statistical significance is linguistically meaningful (see e.g. Dion, Reference Dion2023:250) or that ‘the relevant threshold for a linguistically significant difference – even if statistically significant – is unknown, given fluctuations in overall rate due to situational considerations’ (Torres Cacoullos & Traviis, Reference Torres Cacoullos, Traviis, Perez, Hundt, Kbatek and Schreier2021:291). Statistical modelling, whichever way we employ it ‘only performs mathematical manipulations on a set of data. It does not tell us what the numbers mean, let alone do linguistics for us’ (Guy, Reference Guy, Ferrara, Brown, Walters and Baugh1988:133).

The choice mechanism for analysis could originate in the grammatical generation of sentences, in the processes of production and performance, in the physiology of articulation, in the conscious stylistic decision of individuals, or even in an analytical construct on the part of the linguist. On the other hand, the linguistic significance of the analysis does, of course, depend on the nature of the choice process. This is where the important interpretative component of variation analysis comes in. The question of the linguistic (structural) consequences of the choice process must be addressed prior to the formal, algorithmic, statistical procedures. This is done in the collection (Chapter 2), the decision about what choice is to be studied (Chapter 5) and what is to be considered the context (defining the variable context), and coding of the data (Chapter 6). In the end, it is the relevance of the choice process to linguistic and social structures that must inform the discussion, interpretation, and explanation of the results (see Chapters 11 and 12).

The Choice Process in Linguistic Data

To understand the need for linguistic variables, it is necessary to return to the nature of linguistic data. Language gives us options (see Chapter 1). This is the fundamental starting point for employing statistical analysis on language data.

When is it appropriate to invoke statistical notions and methods? Whenever a choice among two (or more) discrete alternatives can be perceived as having been made during linguistic performance and this choice may have been influenced by linguistic or social factors. These factors can be features in the phonological environment, the syntactic context, the discourse function of the utterance, topic, style, interactional situation, cognitive or socio-demographic characteristics of the individual, or other influences. Labov’s contention was that this variation is part of an individual’s linguistic competence. But until we can view these choices statistically, natural speech data often look like a big mess, that is, ‘apparent randomness’. This is the key criterion for a probabilistic model (‘apparent’ being the operative word).

Examining the Choice Process

What does this choice process look like in practical terms? Let us consider the ‘marginal results’ for variable (hwat) and variable (adj_pos). Marginal results, aka ‘comparison of marginals analysis’ (Rand & Sankoff, D., Reference Rand and Sankoff1990:4) or ‘empirical results’ in some circles, refers to the relative rates (percentages) of the variant forms of the dependent variable according to the independent variables that have been coded into the token file. The marginals expose the overall rate of variants, their rate in each context and in cross-tabulation with other factor groups. You can also use the marginals to assess the productivity of each variant overall and in each context. Depending on the nature of the variable context, it may be important to probe the dispersion or diffusion of variants across individuals, linguistic factor groups, points in time, and so on. Factor-by-factor distributions will be covered in Chapter 8.

As I move into the steps for studying the linguistic variable, keep in mind that my goal is to elucidate how an analyst can use variation analysis using the R toolkit not to conduct a bona fide study of the linguistic variables. There will be repetitions and redundancies for clarity and alternative ways of looking at data. If you wish to conduct any of the analytic steps in an alternative way, you can adjust and revise the code as you see fit. Follow your inclinations. Further information about my observations, interpretations, and explanations of the exemplar variables can be found by consulting the refereed articles (Tagliamonte & Pabst, Reference Tagliamonte and Pabst2020; Needle & Tagliamonte, 2022). I now turn to what have become my tried-and-true methods for doing variation analysis with R. For practical steps describing how to load the data files into R, see the section ‘Beginning an R Session’.

The first step in understanding the linguistic variable is to count the total number of tokens in your data file. This calculation provides the denominator for the overall distribution of variants. The count function in (1) provides the total number of rows (i.e. tokens) in the data files that have been loaded into R as ‘hwat’ and ‘adj’.

(1)
Figure 7.01
Figure 7.01
```
hwat %>% count()
adj %>% count()
```

The code returns two simple tables, called ‘tibbles’, which are an enhanced type of data file that make data manipulation and visualisation more straightforward than tables produced with other packages. Each tibble has one row with an ‘n’ column which shows the count of all tokens (2). Note that data files are also referred to as ‘data frames’ in R parlance. A data frame simply refers to a file that is organised in rows and columns.

(2)

Figure 7.02


# A tibble: 1 × 1
      n
  <int>
1  1980
# A tibble: 1 × 1
      n
  <int>
1 3378

Box 7.03NOTE

A ‘function’ is a set of instructions that perform a specific task and have the following syntax, for example ‘function_name( )’, which may be followed by arguments of the function, ‘count(dep_var)’. A ‘command’ is a broader term that refers to any line of code or expression that issues an instruction in R.

Second, determine the ‘overall distribution’ of the dependent. This is the relative frequency of each variant of the dependent variable without consideration of any other factor groups. A variable treated as binary is straightforward. Given a data file named ‘hwat’, with the dependent variable labelled ‘dep_var’, we can calculate the overall frequency of the variants by using ‘mutate’. The percentages are calculated with a mathematical formula, ‘(n/sum(n))’ and multiplied by 100, which provides the percentage for each variant out of the total number of tokens, (3a), producing the output in (3b).

(3a)

Figure 7.03


hwat %>%
count(dep_var) %>%
mutate(hw_pct = (n / sum(n)) * 100)

(3b)

Figure 7.04


# A tibble: 2 × 3
dep_var n hw_pct
<fct> <int> <dbl>
1 w 1277 64.5
2 hw 703 35.5

The dependent variable of the hwat data has been given mnemonic labels for the two variants, ‘hw’ and ‘w’. At the top of the table, you see ‘hw_pct’, the rate of each variant and ‘n’ the total number of tokens of each variant. Where are the total N’s? We output that calculation with count in (1) and determined that there are 1,980 tokens of the variable overall; now we know that 703 are hw and 1,277 are w. Thus, the overall distribution of the aspirated variant is 35.5% compared to the glide variant at 64.5%.

A linguistic variable that has multiple variants, such as the adjectives of positive evaluation, variable (adj_pos), requires additional consideration. The variable is defined as all adjectives of positive evaluation in the data, and there are many forms. Let us begin by finding out how many adjectives of this type there are, and which ones are used. Given a data file named ‘adj_pos’, with the dependent variable labelled ‘dep_var1’, which includes all the adjectives included in the study, we can assess the variants and decide how to proceed with analysis. Use count to list all the variants (4a). In (4b) several different ways of visualising the count is shown with the useful term ‘arrange’ , which can reorder the rows in the data file by column names. When ‘-n’ is specified, the order will be in descending frequency. Example (4b) has three sections. Each section begins ‘adj %>%’. Run each section separately.

(4a)
Figure 7.05
Figure 7.05
```
adj %>% count(dep_var1)
```

(4b)

Figure 7.06


adj %>% count(dep_var1) %>%
arrange(-n)
adj %>% count(dep_var1) %>%
arrange(-n) %>%
print(n=35)
adj %>% count(dep_var1) %>% arrange(-n) %>%
write_tsv(“adj_count_words_5-5-23.tsv”)

The result of counting the contents of dep_var1, (4b) will produce a tibble with ten rows, the default output, and show the name of the adjective and the number of tokens of that adjective in the data file, (4c). With arrange and the argument (-n), the adjectives are output in descending order of frequency, (4c). Recall that labels in the data file and the code must match exactly.

(4c)

Figure 7.07


# A tibble: 34 × 2
dep_var1 n
<fct> <int>
1 cool 905
2 great 884
3 intensifier+good 687
4 amazing 340
5 wonderful 230
6 awesome 214
7 lovely 80
8 exciting 68
9 incredible 63
10 excellent 42

Once you know how many adjectives there are, you can add ‘print(n = xx)’ so that all the adjectives will be listed in the tibble (not shown). Adding the ‘write_tsv’ command enables you to save this file. Give the file a mnemonic label so you can find it easily, in this case, an informative label with the date (i.e. ‘adj_count_words_5–5–23.tsv’). What is the least frequent adjective of positive evaluation? Find out.

The results from printing out all the adjectives will reveal that very few of them occur with robust frequency; most forms are rare. This is a typical distribution of types and is known as a Zipfian distribution. Since statistical modelling using logistic regression requires a binary response variable, the adjectives must eventually be collapsed for modelling. At first glance the ideal way to do that is not entirely clear. The decision about how to adjust the data for statistical modelling comes from data exploration (see also Chapter 9).

To probe the patterns of adjectival use further, you could decide to ‘lump’ the infrequent forms and focus in on the main variants. Notice that there seems to be a natural cut-off after the top six: cool, great, intensifier + good, amazing, wonderful, awesome. Let’s create a new factor group, ‘dep_var2’, in which we will treat the six most frequent forms independently and lump all the remainder together as ‘other’, (5a). The function ‘fct_lump_n’ lumps factors in a factor group based on the frequency of occurrence. You could just as easily keep the seven most frequent forms, (5b). Notice that if you give the new factor group and the ‘other’ category a distinct label, you will be able to explore both configurations.

(5a)

Figure 7.08


adj <- adj %>% mutate(dep_var6 = dep_var1 %>%
fct_lump_n(6,other_level = ‘Other6‘))

(5b)

Figure 7.09


adj <- adj %>% mutate(dep_var7 = dep_var1 %>%
fct_lump_n(7,other_level = ‘Other7‘))

The question is how far do we go? How do you know what will be the best configuration? My advice is to keep investigating the data. Which adjectival choices show interesting patterns by independent variables? In the case of adj_pos, we discovered that the variants differed by community. In two communities (Toronto, Canada and York, England), the main variants were the same; however, minor variants differed dramatically. In York lovely stood out due to its relatively high frequency and correlation with women with less education. In Toronto, awesome and cool stood out due to their emergence in the later decades of the twentieth century among youth (Tagliamonte & Pabst, Reference Tagliamonte and Pabst2020).

Other simple explorations regarding the tokens of factors in factor groups is easily done with count. The code in (6) is also versatile across factor groups (i.e. x, y, z). Here the ‘x, y, z’ can be replaced by whatever factor group you wish, for example adj %>% count(gender), adj %>% count(type), and so on.

(6)
Figure 7.010
Figure 7.010
```
adj %>% count(word, x, y, z …)
```

Issues with the Variable Rule Program

The GOLDVARB series of applications offered two ways of conducting analyses of variable data. The main tool was a step-up/step-down logistic model. This method supplied analysts with ‘three lines of evidence’: statistical significance relative strength, and constraint ranking of factors, all of which were instrumental for interpreting the data. However, as statistical tools advanced, the shortcomings of the variable rule program became problematic. First, it could not account for the behaviour of individuals, a critical (and erstwhile hidden) aspect of variationist sociolinguistic analysis (Paolillo, Reference Paolillo2013; Labov, Reference Labov, Boas and Pierce2019). Second, it could not handle continuous variables. Third, it could only tell the significance of the factor group, not the individual factors. Fourth, it could not easily incorporate interactions into the model (but see e.g. Sigley, Reference Sigley2003; Paolillo, Reference Paolillo2011). Moreover, the three lines of evidence, so critical for interpreting results, had critical flaws: the threshold for statistical significance was the <.05 level, the measure of relative strength was not statistically valid, and the use of factor weights to measure probability was unknown outside of sociolinguistics.

Factor Weights vs Coefficients

The variable rule program used sum coding with factor weight values from 0 to 1. Factor weights are differences from the grand mean, repositioned around 50 per cent. Factor weights closer to 1 were interpreted as ‘favouring’ the application value, whereas those closer to zero were interpreted as ‘disfavouring’ the application value. The relative hierarchy of factor weights, vis-à-vis each other, was the relevant criterion for interpreting the factor weights. However, the variable rule program did not report the coefficients of the underlying generalised linear model, which are on the logit (log odds) scale. The use of factor weights, not coefficients, is one of the reasons why the variable rule program and its analyses were considered obscure within the sciences and social sciences. Statisticians argue that coefficients offer the analyst further possibilities for analysis. For example, the coefficients estimated for ordered factors can be used to evaluate whether trends across factor levels are linear or curvilinear (Tagliamonte & Baayen, Reference Tagliamonte and Baayen2012:172, fn. 5).

Significance within Factor Groups

In the GOLDVARB series of programs, there was no automatic procedure for testing for significance within a factor group. Using lme(4) in R makes this an easy process as the values are expressed as coefficients of treatment coding, a method in statistics that is also known as ‘contrast coding’ or ‘dummy coding’ . In this method, a set of binary variables is constructed to represent the categories of a factorial factor group that has more than two factors. The advantages of treatment coding are (1) the coefficients can be interpreted for unbalanced data sets and (2) the coefficients remain transparent when in interactions with other factors and with other factor groups in the model.

These updates to modelling contribute to the process of finding the ‘best’ analysis for your data. On the one hand, you must be driven to find the best fit of the model to the data. This means, in part, combining factors that do not differ significantly from each other. On the other hand, you also want to explain (and demonstrate) how the variation is embedded in the subsystem of grammar as well as in the community. Sometimes this is more effectively accomplished with a more ‘fleshed-out’ model (also known as a ‘maximal model’). It may be more explanatory to show that certain linguistic or social categories pattern similarly (i.e. are not statistically different from one another). The process of finding the best analysis for the data is multi-dimensional and not entirely statistical. It requires the judicious insight of a linguist. Statistical modelling procedures are covered in Chapter 10.

Interaction

The variable rule program did not check overtly for non-independence between factor groups. However, this is something that mixed effects modelling can do easily. Nevertheless, checking for overlaps should begin well before statistical modelling takes place. A simple way to check for the interface between factor groups is to use cross-tabulation by examining the distribution of contexts between them (Tagliamonte & Poplack, Reference Tagliamonte and Poplack1988:fn. 22). Another way is to scrutinise the values of each factor in one statistical model compared to another. If there are notable shifts in values when one factor group (variable) or another is added or subtracted, pay attention. When these changes are small (e.g. if they do not affect the way in which factor effects are ordered by size), then we may generally attribute them to sampling fluctuation. In another case a particular factor group may have minimal effect. Keep an eye on the values across models. If one or more of the changes is large when a factor group is added or removed, then you may suspect that this factor group and the one(s) subject to this change are not independent, that is, they interact.

Using a general linear mixed effects model offers a simple and straightforward way to test for interactions because they can be implemented directly into the formulation of the model itself. This enables the analyst several new perspectives. Of course, it can simply test whether an effect is significant as a combinatory factor group and compare this to its significance when treated as a main effect. However, importantly, it can establish whether a contrast within a factor group is stable or changing over time (see e.g. Denis & Tagliamonte, Reference Denis and Tagliamonte2017:20–21). When you search for and test interaction in your model, you may uncover some of the most important findings. In Chapters 8–10, I will demonstrate interaction factor groups and how to spot interaction.

This discussion also raises a related question: is it better to have one big factor group containing many contrastive factors, or many factor groups, each having a binary contrast? A binary factor group makes a stronger linguistic hypothesis. However, if it is only partially right, the fit of the model to the data will not be as good. If you throw everything into one factor group, it can be termed the ‘kitchen sink effect’. While such a model might fit the data better, it will not tell you as much if it misses linguistically valid generalisations elsewhere. The great thing is that it is very simple to run models; the bad thing is that the analyst must be wary, clever, and skilled to figure out how the variable system operates and then formulate a principled interpretation.

Using R in Variationist Practice

Open R Studio. Your screen will look something like the one in Figure 7.1.

Figure 7.1 Typical R Studio opening screen.

Used with permission from Apple Inc.

There are four main windows: (1) the source editor, also called the script editor; (2) the workspace and history pane; (3) the R console, and (4) the files, plots, packages, and help pane. Much of what you see here can be modified to your own preference. Take some time to explore the many options under the drop-down menus.

The Source Window

The source is the place where you type in the code. You can set up this area with various colours and other features in ‘global options’ under the ‘tools’ menu. In the view in Figure 7.1, I am using ‘iPlastic’ because it highlights the brackets, which must always be matched. A common coding error is that a bracket has been left out.

The Workspace and History Panes

The workspace and history panes are in the top right corner. The workspace, here ‘environment’, displays all the objects that are currently loaded into your R session. In this pane you can view and manage the objects, import other objects, save and load workspaces and remove all objects from the current session and start afresh. The history pane keeps a record of all the commands you have executed in the R console. In this pane you can review past commands, reuse them, save the history of your steps, or find commands that you have used before. These panes facilitate your ability to manage your projects.

The Console Window

The console window, bottom left, is the place where the program outputs the results of the code. You will also find error messages here, which will help you figure out what went wrong with your code so you can revise it.

The Environment Window

The environment window contains a lot of information, including all the files you have created in the current environment, data files that you have produced, and a listing of the variables and functions present in the current R session.

The Data File

To work with your data file, you must import it into R. Once imported to R, it exists as a data frame that can be viewed, revised, and saved under a new name if desired. As discussed earlier, the elaborated version of your extraction and coding of the data can remain as an Excel (or other spreadsheet) file, complete with preceding and following contexts and other annotations. I use the designation ‘ROOT’ in the file label to keep this enriched data file for reference on my computer. I save a second file in ‘tsv’ format for importing to R. In each of these formats – xlsx and tsv – the data file comprises individual instances of a linguistic variable in rows. This is the listing of each context in the original data where the individual had a choice (the dependent variable) and the choice that was actually made in that context (the variant). Each associated independent variable is recorded in columns. These individual occurrences are often referred to as tokens, and this is why researchers sometimes refer to the data file as the ‘token’ file.

I will use ‘tidyverse’, an R programming package, which is a collection of R packages designed for data science and statistical modelling. Note that in R there will be many ways to arrive at the same information. What you will find in this book is the best practice for arriving at the answers from my experience. Feel free to revise, elaborate, and refine as you develop your skills and gain more experience in analysing variation. If you have questions about a specific package, refer to its documentation either inside R or on the internet.

What’s in the Data File?

As discussed earlier, in Chapter 2, sample design is a critical baseline for any variationist study. Sample design refers to the systematic process of selecting the data that will go into the analysis. In every case, a sample is a subset of data points from a larger population or body of materials. It is important for a scholar to divulge in detail what that sample design is and how it came about: what is the population and the description of the individuals or elements that make up the population? What was the sampling method and sampling size? What type of sampling was involved? How representative is the sample? These questions are the basis of understanding how generalisable the findings are to the broader population from which the data have come. Variation analysis is typically based on populations or groups of speakers in a speech community, defined by geographic or social criteria. When someone comes to me to discuss studying a linguistic phenomenon, I get out a pen and paper and draw a grid. What is the data going to be? What are the most important dimensions: place, age, social network, type of work? Deciding on these questions requires systemic consideration of research objectives, characteristics of the data required to achieve the research objectives, and sheer practicality, that is, what is possible in the available time and affordable.

When a study begins you construct the ideal design, but when you get into the ‘mud’ of extracting and coding, the sample you end up with may not be exactly as planned: individuals, groups, or texts may be difficult to find, particularly in some of the categories you defined at the outset. Tokens of variants may, for example, be unexpectedly limited or badly distributed. Before the analysis begins, it is a good idea to remind yourself what you have in the data file. The code in (7a) probes the contents of the hwat data file.

(7a)

Figure 7.011


hwat 
count(comm1,gender,dec) 
complete(comm1,gender,dec, fill = list(n=0)) 
pivot_wider(names_from = dec,values_from = n)

It is time to introduce the functionality of tidyverse. One of its tools is the pipe operator, ‘%>%’. This linking device enables the analyst to string together a series of functions. Think of it as ‘and’. Notice how the tidyverse pipe ‘%>%’ connects the lines of code in (7a) (shaded).

The coding string begins with the name of the data file, hwat. The first function is ‘count’. Count asks for the counts in each of the factor groups listed: community ‘comm1’, ‘gender’, and decade of birth, ‘dec’. Note that the labels of each of the factor groups in the data file you are using must match the labels in the code. The next line applies the function ‘complete’, which fills in any missing combinations of factors with zero thereby ensuring there is a row for each possible combination. ‘Pivot_wider’ displays the counts with the factors from decade across the top of the table and the counts in the columns, producing a table that is more in line with variationist practice rather than the default that R would produce. The code outputs the results in (7b).

(7b)

Figure 7.012


# A tibble: 4 × 10
comm1 gender `1880` `1890` `1900` `1910` `1920` `1930` `1940` `1950`
<fct> <fct> <int> <int> <int> <int> <int> <int> <int> <int>
1 Parry Sound woman  116 68 73 110  54 33
2 Parry Sound man 65 69 69 58 60 43 118 
3 Ottawa Valley woman 59 35 72 55 133 23 56 59
4 Ottawa Valley man 68 52 121  56 147 108

This view shows the total counts of variable (hwat), revealing that despite our best efforts to fill the key community and generational cells, there are some with no tokens (shaded). Our original plan was to find as many tokens of words which could alternate with [w] and [ʍ] as possible. We focused on two communities where the aspirated variant occurred relatively frequently, Parry Sound and several small towns in the Ottawa Valley. However, the variants were unevenly distributed. In Parry Sound, there were none from women in the 1880s and the 1930s nor from men in the 1950s. Similarly, there are empty cells in the Ottawa Valley for men in the 1910s and 1950s. The number of existing tokens per cell varies from 23 to 147. This is the inevitable outcome when studying language. It is inherently lumpy, both for system-internal as well as external, real-life, reasons. In this case, words with these phonemes are relatively rare in English, and for whatever reason some of the individuals in the sample happened not to use them. To some extent these issues can be resolved by adding more individuals (if you have them), but in the end you have to make the best of the data you have (see Labov, Reference Labov1994).

Box 7.04NOTE

Labelling practice in this chapter will often use ‘hw’ for [ʍ] and ‘w’ for [w]. When referring to the variants, I will use abbreviations and parentheses to indicate I am referring to the variables themselves, for example variable (hwat) and variable (adj_pos).

Next, using variable (adj_pos), let’s do the same computation, (8a–b). Notice how the code stays the same, but the labels change. In this case, the data come from a single community, Toronto, so the code queries the number of tokens by decade of birth and gender, the two main determinants of variability that have been reported in the literature on this variable. We employ the same coding sequence but change the labels to be compatible with the adj_pos data file, (8a), which produces the table in (8b).

(8a)

Figure 7.013


adj %>%
count(dec2,gender) %>%
complete(dec2,gender, fill = list(n=0)) %>%
pivot_wider(names_from = dec2,values_from = n)

(8b)

Figure 7.014


# A tibble: 2 × 10
gender `1910_s` `1920_s` `1930_s` `1940_s` `1950_s` `1960_s` `1970_s` `1980_s` `1990_s`
<fct> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 man 37 60 72 86 221 169 341 523 93
2 woman 49 112 30 115 319 177 324 831 142

The output in (8b) shows some of the same issues that were evident in the hwat data file: the oldest generation has less data (undoubtedly because there is less data overall) and the decade/gender cells range from a count (i.e. tally) of 37 to 831. From this view alone, you might make some preliminary observations. It appears that individuals in the 1950s through the 1980s use more adjectives of positive evaluation than those in earlier decades. However, this view does not consider the total number of words from which the adjectives were drawn. The same is true for the observation that in most decades women use more of these adjectives, sometimes substantially more. However, because women tend to talk more than men on average in sociolinguistic interviews, their interviews are longer, and this can further confound any interpretation. Both hypotheses are compelling but require further information to be substantiated. In this study, we were interested in the choice of adjective by social and linguistic factors, so we did not purse the question of the total number of words contained in the full interview, that is, the total number of words produced by the speakers in each cell.

Practice

Today, statistical tools are more readily available than ever before. Many of these tools are available for free download, along with documentation, from the internet. All you have to do is double-click and get going.

In the next section, I review the foundational aspects of quantitative analysis using variationist methods and show how these operate in practice in the R environment. For illustration, I will use two types of variables. The first one, (9) is formulated as a binary variable, the choice between [w], a voiced labial-velar glide, as in witch and [ʍ], a voiceless labial-velar fricative that is heard in the word which for some speakers (Needle & Tagliamonte, 2022).

(9) Variable (hwat) There was a mill in Carp too, where ([wɛɚ]) they made – you know the farmers used to bring their oats and wheat ([ʍiːt]) and that you know, put it through the mill. (ODP, eashford, woman, 90, b. 1887, recorded 1977)

The second one, (10) is formulated as a multi-variant variable, the choice of adjective in the semantic field of positive evaluation, including great, amazing, wonderful, very good, and so on focusing on a large archive from Toronto Canada (Tagliamonte, Reference Tagliamonte2016b; Tagliamonte & Pabst, Reference Tagliamonte and Pabst2020).

(10) Variable (adj_pos) And um there’s this woman down there who sells amazing pakoras. They’re awesome! (ODP, nwold, woman, 28, b. 1976, recorded 2003)

These data files (hwat_2-13-24.tsv and adj_1-2-24.tsv) are available for download at www.cambridge.org/tagliamonte for the purpose of practising the methods in this book. They must not be used for any other purpose without permission.

Using Excel for Data Extraction and Coding

An ideal way to extract and code data for variation analysis is to use Excel or a similar spreadsheet editor. Excel offers flexibility for coding and documenting the contexts of variation, and its many features enable you to sort and filter the data, add notes, and much more. Extract and code into an ‘xlsx’ file, include the dependent variable and all the independent variables as well as the preceding and following context. A snippetFootnote ¹ view of the Excel data file for variable (hwat) is shown in Figure 7.2.

Figure 7.2 Excel data file containing the initial extraction of variable (hwat). Source: Excel software.

Used with permission from Microsoft.

Figure 7.2 shows eleven columns, which have been filtered for tokens of variable (hwat). Each column contains key information about the variable. The first column records the individual, here represented by the first initial and last name of the pseudonym (i.e. ‘astarz’ is ‘Alfred Starz’). Columns B and C coded the individual’s gender (‘gender’) and year of birth (‘yob’). Columns D and I show the example that has been extracted with the preceding context in column D, preceding context, ‘prec’, and the following context in Column I. Column E is the word containing the target phoneme. The dependent variable, dep_var, in Column F is coded as ‘0’ and ‘1’, a numeric factor (as we will discover, this can be recoded to be a factorial dependent variable later). Columns G and H record the phonological form of the preceding and following context. Column J records the community, and Column K records whether the word was a content or a grammatical word, ‘gram_cat’. In my lab, we use conventional labelling for factor groups. This procedure is both practical and efficient. It uses labels that are short and ideal for use in R, and they serve a mnemonic purpose as well. If year of birth is always yob, it is easy to remember and easy to reuse in coding sequences in R.

Box 7.05TIP

My advice is to keep your labels short with lower case and to use underscore, not spaces or dashes because these do not translate well into R. This practice will ensure your code will work under most circumstances. While I understand that there are ways not to have to constrain your practice in this way, I know that taking these precautions will ensure that your code is likely to work.

Using the standard Excel format for data extraction and coding (i.e. ‘xlsx’) is best due to its functionality; however, when you save the file for import to R, save it as a ‘csv’ or ‘tsv’ file, which turns the data file into a form readable by R. I usually delete the context columns and any that contain extraneous information, notes, and so on, so that these do not clutter up the R window. Note that when the xlsx file is saved as .tsv or .csv, any additional sheets are automatically removed, so make sure you keep your xlsx file safely saved and stored.

Beginning an R Session

I will illustrate using an R markdown file procedure. New users to R should download the latest version of the software and then add R Studio, ‘an integrated development environment (IDE) for R’.Footnote ² At the time of writing I am working with R-4.3.1-x86_64.pkg, a version for older Intel Macs and R Studio Version 2023.06.2+561. Further discussion of the R Studio set-up and functions will be covered in Chapter 8.

Open R Studio and create an R markdown file, aka an ‘RMD’ file using the following path (File/New File/R Markdown file) and visualised in Figure 7.3. You can select various formats. Choose the one you are most comfortable with.

Figure 7.3 Filepath to create an RMD file in R Studio.

Used with permission from Apple Inc.

An initial step in the R environment is to first create a ‘chunk’ in your RMD file for the set-up of your session. Wait until you have set your directory to save it. Inside the set-up chunk, call in the packages that are relevant to the types of analyses you will conduct. These packages are held in different ‘libraries’, which must be loaded each time you intend to work with your RMD file.

Next, install the libraries by going to the Packages menu of the lower right-hand window of the R Studio screen. Scroll through the packages and click the box for those indicated below. Load in the libraries by inserting your cursor on the line in the set-up chunk and pressing ‘command enter’. The libraries can also be loaded by typing the commands into the chunk and then running the chunk, visualised in Figure 7.4.

Figure 7.4 Set-up chunk for an RMD file in R Studio

A chunk begins with three back ticks, an opening curly bracket followed by ‘r’, then a space, and then the label for the chunk and a closing curly bracket, that is, ‘```r label’, enclosed with curly brackets. Each chunk must begin and end with three ticks, ‘```’ and they must appear on the far left. The labels must not have spaces and should not be too long, and every chunk in the RMD file must have a unique label. The chunk labels are intended for the human analyst to organise their RMD files. They are very useful for navigating. I even think they help organise one’s argumentation.

‘Knitr’ provides extra features for ‘knitting’ RMD documents to an output file. The ‘knitr::opts_chunk$set(echo = TRUE)’ line in Figure 7.4 sets the options for how the RMD file will be displayed after knitting. When ‘echo = TRUE’ is set, the code chunks in your document will display the code you used as well as its output. This is useful for learning purposes because you will be able to see the code and the results of the code in the output file.

Box 7.06NOTE

Every analyst will have their own predilections about how to go about running R, how to set up and run chunks, and so on. Take the directions in this book as a model, but then adjust to your own practice.

I will use many different R packages as we step through the various stages of analysis. For now, these are brief descriptions that will become clear in practice. The above-mentioned ‘tidyverse’ includes a bundle of packages, ‘tidyr, dplyr, readr’, ‘forcats’, and so on. The ‘janitor’ package is used for several functions, including ‘glimpse, tabyl, adorn_, clean_names’.Footnote ³ ‘Partykit’ is used for conditional inference trees and random forest models. ‘Lme4’ is used for linear mixed effects regression. ‘Ggeffects’ is used for plotting regression results. ‘Broom.mixed’ is for outputting regression model fits. ‘Cowplot’ combines plots together. There are many R packages, and you may find others that work well for your purposes. Also, be warned that R packages are updated all the time. These can be updated by searching under Tools/Check for Package Updates.

Box 7.07NOTE

What is the difference between an R ‘package’ and an R ‘library’? A package is a collection of R functions, data, and compiled code with related functions. A library refers to the directory on your computer where R packages are installed. When you install a package, it is downloaded to your R library. To use the library load it into your R session with the ‘library(package)’ command, as in the set-up chunk.

Loading Helper Functions

The next step is to load the helper functions that we will use over the course of analysis. The code lines that must be run are raw text that can be copied from the source links and pasted into the appropriate spot in a chunk in an RMD file. They can be copied into your working environment from the external sources on the internet or run locally. Click on the lines of code in Figure 7.5 or use command-enter or run the chunk.

Figure 7.5 Coding string to add in useful helper functions

Note that you can ‘comment out’ certain lines using ‘#’, preventing them from being executed. To activate them, all you need to do is remove the ‘#’. In this case, the lines that begin with ‘#’ are notes meant to support readers in understanding the code.

Next, load in the ‘varimp_helper’ (Figure 7.6), a support for visualising random forests (see Chapter 9).

Figure 7.6 Coding string to add in varimp_helper

Set Working Directory

Set your working directory so that the commands you issue will find and access the appropriate files on your computer. When you run this line of code, R will change the working directory to the one you specify. To get the right ‘path’, go to the folder where you are intending to store all your files. On a Macintosh computer, right click on the data file at the same time as you hold the shift and option key. ‘Copy filename as Pathname’ will appear under the options. On a Windows computer, hold down the shift key and then right click on the file. ‘Copy as path’ will appear under the options. Copy and paste it into your RMD file. Use it to formulate the ‘setwd’ command. Once you have set the working directory, all subsequent file operations or data loading will come from the specified directory. My data files are stored in my Dropbox folder, in a folder labelled ‘JEREMY’, in a folder labelled ‘VSLX_COOKBOOK_ASV’, (5a). Inside that folder are my data files ‘hwat_2–13–24.tsv’, (5b) and ‘adj_1–2–24.tsv’. These are the most recent versions of my data files. I typically end up with many data files and you will too.

Loading your data for use in the R environment can be accomplished in various ways, all of which require several steps. Label a chunk with the designation ‘PREPARE_AND_LOAD_DATA_SETS’ in your RMD file. In this chunk, change the path so that it corresponds to the path on your own computer using the set working directory function, ‘setwd’. Once this is set, one way to load your data file is to use the ‘file_path’ command. If you have set the correct coordinates on your computer, you should be able to copy and paste the file_path sequences in (11) into your RMD file.

To be clear, you must change the working directory in your RMD to correspond to your own working directory on your own computer.

(11)

Figure 7.015


setwd(“~/Dropbox/JEREMY/VSLX_COOKBOOK_ASV”)
file_path_adj <- “ adj_1-2-24.tsv”
file_path_hwat <-“ hwat_2-13-24.tsv”

Once you have changed the working directory for use on your own computer, you will be able to save your RMD file to this folder. The files will now appear in the Environment pane in the top right R Studio window.

Load Data Files – Basic

At this point, you are ready to load data files and specify how the factor groups should be treated. R reads in data files according to what it thinks are the type of factor groups in the data file. Some factor groups, that is, the columns in your Excel file, will be factorial variables such as grammatical function, ‘gram_cat’, with the levels, that is content and function, for variable (hwat). Some factor groups will be continuous variables such as year of birth (yob). However, when you import a data file to R, the factor groups may all be designated as ‘character’ variables, ‘<chr>’. The types must be changed so that all factor groups are ‘factors’ or ‘doubles’ depending on the type of factor group and how you want to analyse your data.

To read in the hwat data file and to view the factor group specifications, use the code in the first line of (12a). Follow it with ‘spec’, which will provide a list of the factor groups (independent variables) and their factors (levels) in (12b). The factor groups are columns in Excel; the factors are the categories in each column. Note that the ‘spec’ command must be run on a data frame before that data frame is modified in any way.

(12a)

Figure 7.016

hwat <- read_tsv(file_path, col_names=TRUE)
spec(hwat)

(12b)

Figure 7.017

cols(
indiv = col_character(),
word = col_character(),
dep_var = col_character(),
dep_var1 = col_double(),
word1 = col_character(),
pre_seg = col_character(),
fol_seg = col_character(),
gram_cat = col_character(),
comm = col_character(),
comm1 = col_character(),
comm2 = col_character(),
gender = col_character(),
yob = col_double(),
age = col_double(),
dec = col_double(),
edu = col_character(),
edu1 = col_character(),
occ = col_character(),
occ1 = col_character(),
pre_vc = col_character(),
yoi_estimate = col_double(),
dep_var2 = col_character()
)

In (12b), notice that the factor groups have been loaded as characters and doubles. The dependent variable, labelled ‘dep_var1’, is the numeric version of the dependent variable, so it is a ‘double’. You can also see that community has three configurations: ‘comm’, ‘comm1’, ‘comm2’. The individual is ‘indiv’; perceived gender is ‘gender’; and the other version of the dependent variable dep_var is ‘dep_var2’. All of these come through as ‘col_character’ variables. Year of birth, ‘yob’ and year of interview ‘yoi’, ‘dec’, and ‘yoi’ (an estimate of the year of interview) are doubles, that is, continuous factor groups. Once you examine the specifications of factor groups, you can modify them accordingly, that is, change ‘character’ to ‘factor’, by reissuing the code as modified with a snippet view in (12c). Continue to replace ‘character’ with ‘factor’ for the full list of factor groups (not shown in this snippet view).

(12c)

Figure 7.018

cols(
indiv = col_factor(),
word = col_factor (),
dep_var = col_factor(),
dep_var1 = col_double(),
word1 = col_factor(),
SNIP!
)

After all that is done, rerun the code, repeated in (12d), again in snippet view here. If you are copying and pasting this code into your RMD file, remove ‘SNIP!’ and include the entirety of the output. Then, use the ‘summary’ command to view the contents of the data file (not shown).

(12d)

Figure 7.019


<- read_tsv(file_path_hwat, col_names=TRUE,
cols(
indiv = col_factor(),
word = col_factor(),
dep_var = col_factor(),
dep_var1 = col_double(),
word1 = col_factor(),
pre_seg = col_factor(),
fol_seg = col_factor(),
SNIP!
))
summary(hwat2)

The output of the data file that is produced will now have the changed factor groups. Check to be sure. Notice that I have given this second data file a different name, ‘hwat2’. Creating unique labels for different versions of your data file and recodes of factor groups ensures that you can use one or the other configuration any time. The different versions will appear in the Environment pane. They should be identical except for the formulations of the factor groups in each one. To be very specific, if you can get an error message that reads something like this: ‘Error in “count()”:! Must group by variables found in “.data” X Column “GENDER” is not found’, you know you have to change the label ‘GENDER’ in your code to match the label that is in your data file.

Box 7.08Important Note

Should you discover that the name of the data file (e.g. hwat, hwat2), or the factor groups in the data file do not correspond to those in the book or what you see on your screen (e.g. ‘comm1’ not ‘comm’), simply adjust your code so that the labels match. R commands require precision, and despite my best efforts there may still be discrepancies in the code.

Load Data Files – Guess Max

An alternative procedure is to read in the data file by using the ‘guess_max’ function, which guesses at the type of factor groups. In this procedure, after the file_path, you would issue the code in (13a) for the adj_pos data file. View the data by issuing summary or glimpse. In this case, the ‘guess_max’ is set to 10,000, so that the function will consider up to 10,000 unique values in determining the nature of the factor group.

(13a)

Figure 7.020


adj <- read_tsv(file_path_adj, col_names=TRUE, guess_max=10000)
summary(adj)
glimpse(adj)

To ensure that the factor groups in the data are read into the R environment as factorial variables, use the code in (13b), for each of adj and hwat.

(13b)

Figure 7.021


adj <- adj %>% mutate(across(where(is.character), as_factor))
hwat <- hwat %>% mutate(across(where(is.character), as_factor))

Check your data using summary and glimpse to be sure it is the way you want it to be.

It is time to examine what is in your data file. Using the summary command (14), inspect it with care. In the examples that follow, the commands are presented one after the other; however, you can also put them into the same chunk. Then you can run them all at once. How you run your code will depend on your own predilections.

(14)
Figure 7.022
Figure 7.022
```
summary (hwat)
summary (adj)
```

The summary command will produce an output of all the headers of each column and the factors in each column. A snippet of the variable (hwat) data is shown in Figure 7.7.

Figure 7.7 Screenshot of a summary of the data file for variable (hwat)

Thoughtful examination of the results displayed in Figure 7.7 will enable you to determine how to reconfigure factor groups and factors. You may want to collapse factors in one factor group or another or relabel columns or derive alternative categorisations of the factors in each group. Before embarking on your analysis, it is useful to handle all the reformatting you think is necessary. For example, in this view the factor groups ‘prec_seg’ and ‘fol_seg’ appear with their most elaborated coding and in due course will need to be recoded for further analysis. Methods for recoding factor groups will be presented in Chapter 8.

In this data file, yob is continuous (see output in Figure 7.7), with a minimum year of birth of 1884 and a maximum of 1958. Using mutate, you can change the year of birth factor group into increments, such as ten-year increments (15a) or twenty-year increments (15b). The function ‘fct_relabel’ relabels the factors with an added ‘_s’ for comprehensibility.

(15a)

Figure 7.023


10 year increments:
hwat <- hwat %>% mutate(
dec1 = yob - (yob %% 10),
dec1 = as.factor(dec1),
dec1 = fct_relabel(dec1,~ paste0(., “_s”))
)
hwat %>% count(dec1)

(15b)

Figure 7.024


20 year increments:
hwat <- hwat %>% mutate(
dec2 = yob - (yob %% 20),
dec2 = as.factor(dec2),
dec2 = fct_relabel(dec2,~ paste0(., “_s”))
hwat %>% count(dec2)

The utility of one or the other categorisation schemas for yob will depend on your data set, that is, the number of tokens per cell in each time period. With fewer data points, larger intervals are better. Be careful to use new labels (e.g. ‘dec1’ and ‘dec2’) so that the original factor group is not replaced. In each case the count command, shown in (15a) and (15b), requests the number of tokens per increment, shown in the tibbles (15c). Try dec2 on your own.

(15c) Tibble for counts of dec1

Figure 7.025


# A tibble: 8 × 2
dec1 n
<fct> <int>
1 1880_s 192
2 1890_s 272
3 1900_s 330
4 1910_s 186
5 1920_s 370
6 1930_s 213
7 1940_s 336
8 1950_s 81

Box 7.09NOTE

When recoding or relabelling, the new label comes first in the coding string and the old label comes second. In the recode in (15a), ‘yob’ is the original label and ‘dec’ is the new one; in (15b), ‘dec2’ is the new label, and so on. In order to keep the labels consistent, I maintained the same label even though twenty-year increments is a generation not a decade. You can do it whatever way you wish.

Recoding labels of factor groups and factors for clarity will produce more comprehensible tables and figures for audiences unfamiliar with your unique labelling system. For example, the factor group ‘dec’ could be relabelled (e.g. ‘decade’). Then, you can save the recodes to a new data file with the ‘write_tsv’ command and this version will be saved for future use. Alternatively, save a table with the counts for the new factor group ‘decade’ by outputting the new factor group ‘decade’, producing the tibble in (16a). Save the tibble, with the ‘write_tsv’ command and give it a label, for example ‘decade_table_10-yr-increments’ (16b). Retain the output for future reference.

(16a)

Figure 7.026

adj <- adj %>% mutate(
dec = yob - (yob %% 10),
dec = as.factor(dec),
dec = fct_relabel(dec,~ paste0(., “_s”))
)
adj <- adj %>% rename(
decade = dec)

adj %>% count(decade)

adj %>% count(decade) %>%
write_tsv(“decade_table_10-year-increments.tsv”)

(16b)

Figure 7.027


# A tibble: 9 × 2
decade n
<fct> <int>
1 1910_s 86
2 1920_s 172
3 1930_s 102
4 1940_s 201
5 1950_s 540
6 1960_s 346
7 1970_s 665
8 1980_s 1354
9 1990_s 235

Now examine the contents of the variable (adj_pos) data file, using summary (Figure 7.8). The glimpse command will provide a more abbreviated view (not shown).

Figure 7.8 Screenshot of a summary of the data file for variable (adj_pos)

The ‘adj_pos’ data file also records the year each individual was interviewed. This is coded under the label ‘yoi’, for year of interview. You can see under ‘yoi’ in Figure 7.8 that the time span represented in the data is between 2002 and 2014.

Notice that the data set retains an earlier coding of yob as a factorial variable, ‘bymo’. In my lab ‘B’ means ‘adolescents’, while ‘Y’ is young adult, ‘M’ is middle-aged, and ‘O’ is older. This recode ensures consistency with previous research that used the same age groups (e.g. Labov, Reference Labov2001b). Depending on the nature of your data set, it may be possible to disentangle an individual’s age from their year of birth, a key issue in the field (e.g. Sankoff, G., Reference Sankoff and Brown2006). In this case the twelve-year time interval is likely not enough to make a difference, but that is an empirical question. Because yoi is coded, you could easily check this out.

Many additional data adjustments may be desirable and can be done in the R environment at the beginning of your RMD file in a chunk labelled ‘DATA-ADJUSTMENTS’. Additional modifications can be implemented at any point in your process. As you work through an analysis, further insights lead to an updated understanding. For example, a well-justified way to recode yob is to break the timeline according to the specific junctures obtained from a conditional inference tree analysis. Further explanation of how this is accomplished will be covered in Chapter 8. In the meantime, set up your RMD file to your own liking. Perhaps you would like each coding string illustrated in this book to be a separate chunk; perhaps you prefer to group them together, as I have done in most cases. This part of the process is where each analyst can do their own thing.

Box 7.010NOTE

I used to say I could tell a lot about a person by looking at their GOLDVARB files, especially the condition files. Now, it is even more revealing to see how different researchers set up their RMD files.

Choice of Data Files

As outlined earlier, my practice is to do all the extracting and coding of data into a Excel file, but any spreadsheet will be equally useful. Make this version of the data file the most elaborate, the place where the most detailed information is contained. For example, I include in this file the coordinates of the datum, either the time stamp in the audio-recording or numerically ordering the tokens according to position in the transcription file as a separate factor group. Use whatever referencing system works for your data. I also include the preceding and following context with as much information as seems necessary in the extraction phase (see Chapter 6). Together this information ensures you can always get back to the original context. Users of ELAN or other data transcription methods will have a different procedure. Being able to find the original context, transcription, and audio record enables coding of independent factor groups in the initial phase of your research, and at later phases facilitates understanding what the coding strings represent while at the same time keeping you close to your data. The spreadsheet file can also include a column for comments. During the extraction and coding phase, relevant notes can be inserted as guideposts to support and inform data processing at a later stage. However, none of these columns need to be imported into R, and if they are removed the work that is done in R is easier to see. (Compare the listing in Figure 7.9 with that in Figure 7.10 for readability.)

7.9 Excel data file with preceding and following context and other details.

Figure 7.10 Excel data file for import to R.

Practices of data processing that make things easier for the researcher are a tremendous support for simplifying procedures and for understanding and interpreting the results of an analysis.

Exclusions

Every variation analysis involves excluding certain contexts for one reason or another during the extraction and coding phase. It is important to keep a record of these tokens. When you get to the point of writing up your methods section later, this practice will make it possible for you to illustrate the types of contexts that are ‘don’t count’, that is, the ones that are outside of the variable context as you have defined it. It is important for replicability to record some examples of these types for exemplification purposes. Do it during the extraction phase so that you don’t have to go back to find them.

Box 7.011NOTE

Another useful technique is to include a column for germane examples or super-tokens in your root data file in xlsx format so that they are readily found when needed; for example, writing an abstract or reporting preliminary results. When studying a community, I believe it is important to use examples that are true to the culture of the place while illustrating variation, for example, It was a damn bear come out of the den like that (Parry Sound, jcartwright, age 86). In this case, the example illustrates preterit come in northern Ontario. The example also includes a characteristic ‘it’ cleft with a zero relative pronoun. The fact that it describes an encounter with a bear is both unique and true to the authenticity of the place.

Coding for Individual

It is critical to code individuals separately so that the results from each one can be checked, compared, and contrasted. As illustrated earlier, knowing the patterning of variants by individual can provide key insights into the variable under study. You will also notice that I use both pseudonyms and age. The names are true to the place as well as to the ethnicity of the original name. This is part of my mission to acknowlege the people who participate in my studies as central to the study of their language.

Where to Keep the Metadata

In my practice, each individual’s metadata is recorded in a separate database. Their label is their short-form pseudonym, the last name preceded by the initial of their first name (e.g. ‘astarz’, ‘Alfred Starz’). This makes it possible to add the external factor groups year of birth, gender, education, and so on to the data file (either in Excel or R) without having to separately code for them in a spreadsheet. To make this possible, always include a factor group specifying the individual’s unique label so that their metadata can be efficiently linked to them.

The Variationist Lab Book

The steps involved in conducting a variation analysis are complex and involve many decisions, revisions to decisions, and even more revisions to decisions. This is all part of the process. If you do not make any mistakes in this lengthy procedure, you are probably not paying enough attention. Beginning with the judgements that go into circumscribing the variable context right through to checking for interaction, you will be engaged in a long process of observation, revision, and problem-solving. How do you keep track of it all? Although each lab or individual will develop their own procedures, some practices are tried and true. An important practice is to have a central place where you can record procedures and exceptions, and document your decision-making process and what changes as it evolves. You want be able to refer to what you did so that you can remember how you did it. This is where the ‘lab book’ comes in, although it does not have to be a physical book.

In the lab book, record your research process in minute detail – in part so that you maintain consistency, in part so you can replicate the process or improve upon it. Just as a chemist or an inventor records the steps of each experiment – traditionally in ink not pencil so that the actual process undertaken could not be amended or erased from the record – so too in variation analysis. Get yourself a lab book, make yourself an observation file, or create a database. Whatever method suits your inclinations. The important thing is to have a place to keep track of everything. Date and record your decisions. When you discover a means to do something better, refine your process and record the revision. When you notice an anomaly or exception, record the example, with reference, so that you know what types of things you should exclude as you move forward. When you come to the same problem again, go back to the lab book to find out what you decided to do about it. Apply the update consistently every time. I still go back to my earlier lab books to understand what I did at an earlier phase in the research.

The lab book can also be used to maintain a list of data files and their labels. It is a good idea to name your files so that you will be able to glance at them and know what they are and what they were configured to do. I use a ‘naming protocol’, which I apply in the same way for each analysis. For example, each corpus in my archive has a three-place code (17a). Each variable has a two- or three-place code (17b). This makes it easy to name token files consistently (17c). To this, you can always add dates, to keep track of the most updated version of files (17d). Whatever naming protocol you use will go a long way to helping you remember what each token file contains. It also makes global searches for files much easier.

(17a)
a. York English Corpus YRK
b. Devon English Corpus DVN
c. Toronto English Corpora TOR

(17b)
a. Variable (-ing) -ing
b. Variable (-t,d) Td
c. Variable (have/have got) Got

1. a. YRK_ing.xlsx
2. b. DVN_-s.xlsx
3. c. ROOTS_got.xlsx

1. a. YRK_ing_8_24_04.xlsx
2. b. DVN_-s_2_6_97. xlsx
3. c. ROOTS_got_12_10_01. xlsx

The lab book serves many purposes, not the least of which is to remember where you left off. If even a day or two go by, you will forget what you did the last time you worked on your data. The lab book also ensures consistency in circumscribing the variable context, in coding the factors and in producing the output files. It should also hold the coding schema for your variable and a list of each individual’s unique label; as you produce results these files can be included as well. I recommend this practice to you; it will save you time and frustration. It may even help you figure out what explains your variable. Sometimes the observations you have noted down, scribbled months previously, hold the key to understanding and explaining the data later.

Box 7.012NOTE

You will accumulate an unbelievable number of files when you conduct variation analysis. Of course, you will keep your Excel root file, complete with coding schema, and your data files. However, your RMD files of whatever type are invaluable; this is where your code is. You will need them to recreate analyses in the future.

Some Other Results You May See

There are several results that you will likely come across while conducting statistical analysis. I detail these in the next sections.

Zero Values

When dealing with badly distributed data, you may find 0 per cent values or a 100 per cent values in one of the cells in your analysis.

Let’s take variable (hwat) as an example. The code in (18a) probes the data for how it is distributed by the three main factor groups, community ‘comm1’, gender ‘gender’, and decade of birth ‘dec’. Look carefully and you will notice that some values are missing; for example there are no tokens for women born in the 1880s, but there are sixty-five tokens for men.

(18a)

Figure 7.028


hwat %>% count(comm1, gender, dec10)
# A tibble: 27 × 4
comm1 gender dec n
<fct> <fct> <dbl> <int>
1 Parry Sound woman 1890 116
2 Parry Sound woman 1900 68
3 Parry Sound woman 1910 73
4 Parry Sound woman 1920 121
5 Parry Sound woman 1940 54
6 Parry Sound woman 1950 22
7 Parry Sound man 1880 65
8 Parry Sound man 1890 69
9 Parry Sound man 1900 69
10 Parry Sound man 1910 58
# … with 17 more rows
# Use `print(n = …)` to see more rows

To provide the missing values with the count of zero required for doing maths, you need to use the code ‘complete’, which fills the listing with zero where the cell has no tokens (18b). The result in (18c) makes the zero values overt. Add in the ‘print(n = 50)’ command to see the full list (not shown).

(18b)

Figure 7.029


hwat %>%
count(comm1,gender,dec) %>%
complete(comm1,gender,dec, fill = list(n=0)) %>%
print(n=50)

(18c)

Figure 7.030


# A tibble: 32 × 4
    comm1       gender dec n
<fct> <fct> <dbl> <int>
1 Parry Sound woman 1880 0
2 Parry Sound woman 1890 116
3 Parry Sound woman 1900 68
4 Parry Sound woman 1910 73
5 Parry Sound woman 1920 121
6 Parry Sound woman 1930 0
7 Parry Sound woman 1940 54
8 Parry Sound woman 1950 22
9 Parry Sound man 1880 65
10 Parry Sound man 1890 69
# … with 22 more rows
# Use `print(n = …)` to see more rows

Where zero values exist, the analyst must decide how to reconfigure the data with good judgement. Notice that this exercise has revealed other zero values; for example there are no tokens from women born in the 1930s. In most cases, these categorical or near categorical values can be handled by removing them or recoding them in a sound linguistically justified way. In this case, recoding the year of birth of individuals into broader categories will likely provide representation of men and women in each time frame. Try it with twenty-year increments, as earlier in (15b). Is the distribution sufficient to capture the trends in the data?

Singletons, Hapax legoma

A singleton or hapax legoma means that there is one item of its type. Recall the multiple single tokens in the use of adjectives of positive evaluation in (13a) (e.g. grand, splendid, glorious, peachy, and swell). The same issue may arise within factor groups as well. Singletons such as these often arise in variation analysis, but for modelling it will be necessary to collapse them with other similar categories or remove them before performing statistical analysis. See Chapter 8.

Footnotes

¹ A ‘snippet’ view is an abbreviated version of what will appear in your console if you run the code. I’ve shortened many of the examples for brevity.

² RStudio: https://posit.co/products/open-source/rstudio/ (accessed 25 July 2023).

³ See Rdocumentation (www.rdocumentation.org/packages/janitor/) for data cleaning examples and others.

Figure 7.01

Figure 7.02

Figure 7.03

Figure 7.04

Figure 7.05

Figure 7.06

Figure 7.07

Figure 7.08

Figure 7.09

Figure 7.010

Figure 7.1 Typical R Studio opening screen.

Used with permission from Apple Inc.

Figure 7.011

Figure 7.012

Figure 7.013

Figure 7.014

Figure 7.2 Excel data file containing the initial extraction of variable (hwat). Source: Excel software.

Used with permission from Microsoft.

Figure 7.3 Filepath to create an RMD file in R Studio.

Used with permission from Apple Inc.

Figure 7.4 Set-up chunk for an RMD file in R Studio

Figure 7.5 Coding string to add in useful helper functions

Figure 7.6 Coding string to add in varimp_helper

Figure 7.015

Figure 7.016

Figure 7.017

Figure 7.018

Figure 7.019

Figure 7.020

Figure 7.021

Figure 7.022

Figure 7.7 Screenshot of a summary of the data file for variable (hwat)

Figure 7.023

Figure 7.024

Figure 7.025

Figure 7.026

Figure 7.027

Figure 7.8 Screenshot of a summary of the data file for variable (adj_pos)

7.9 Excel data file with preceding and following context and other details.

Figure 7.10 Excel data file for import to R.

Figure 7.028

Figure 7.029

Figure 7.030

Accessibility standard: Unknown

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

Accessibility compliance for the HTML of this chapter is currently unknown and may be updated in the future.

a. York English Corpus	YRK
b. Devon English Corpus	DVN
c. Toronto English Corpora	TOR

a. Variable (-ing)	-ing
b. Variable (-t,d)	Td
c. Variable (have/have got)	Got

Book contents

7 - Why Use Statistics?

Summary

Keywords

Information

Theory

Statistical Modelling

A History of Statistics in Language Variation and Change

Where Did Variable Rules Come From?

The Null Hypothesis

Models and Link Functions

The Likelihood Criterion

The Choice Process in Linguistic Data

Examining the Choice Process

Issues with the Variable Rule Program

Factor Weights vs Coefficients

Significance within Factor Groups

Interaction

Using R in Variationist Practice

The Source Window

The Workspace and History Panes

The Console Window

The Environment Window

The Data File

What’s in the Data File?

Practice

Using Excel for Data Extraction and Coding

Beginning an R Session

Loading Helper Functions

Set Working Directory

Load Data Files – Basic

Load Data Files – Guess Max

Choice of Data Files

Exclusions

Coding for Individual

Where to Keep the Metadata

The Variationist Lab Book

Some Other Results You May See

Zero Values

Singletons, Hapax legoma

Footnotes

Accessibility standard: Unknown

Why this information is here

Accessibility Information

Save book to Kindle

Save book to Dropbox

Save book to Google Drive