Exploratory Modelling

Sali A. Tagliamonte

doi:10.1017/9781009403092.010

9 - Exploratory Modelling

Published online by Cambridge University Press: 19 June 2025

Sali A. Tagliamonte

Show author details

Sali A. Tagliamonte: Affiliation:
University of Toronto

Book contents

Summary

This chapter will cover new techniques beyond probing empirical data for data exploration. It will show you how to use a conditional inference tree (ctree) and random forest (cforest) to understand complex data interactions, pinpoint difficulties in research design, and discover data anomalies.The focus will be on techniques for resolving data and linguistic problems in preparation for statistical modelling

Keywords

conditional inference trees random forests data interaction preparing for statistical modelling

Information

Type: Chapter
Information: Analysing Sociolinguistic Variation , pp. 193 - 224

DOI: https://doi.org/10.1017/9781009403092.010 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2025

9 Exploratory Modelling

Box 9.01

This chapter will cover new techniques beyond probing empirical data for data exploration. It will show you how to use a conditional inference trees (ctree) and a random forest (cforest) to understand complex data interactions, pinpoint difficulties in research design, and discover data anomalies. The focus will be on techniques for resolving data and linguistic problems in preparation for statistical modelling.

Now that you have some basic understanding of how to explore empirical data using distributional analysis and cross-tabulation, let’s consider relatively new techniques for further data exploration. Recent developments in statistics have introduced statistical tools such as conditional inference trees and random forests, or as I will sometimes refer to them – ‘ctrees’ and ‘cforests’. They are important for the type of data sets typical of sociolinguistic studies, such as small, highly imbalanced cells that, crucially, involve individuals with varying social characteristics and dates of birth. Conditional inference trees and random forests provide complementary results, particularly regarding interactions that are sometimes difficult or impossible to obtain with a linear model. They are the ideal first step in modelling data because of their simple specifications and fewer structural assumptions. They help check general hypotheses and patterns, which can then be used to construct more complex regression modelling with mixed effects.

Conditional Inference Trees

A conditional inference tree analysis (henceforth, ctree analysis) reveals how interactions and predictors operate in tandem (Hothorn et al., Reference Hothorn, Hornik and Zeileis2006). The visualisations show the hierarchical organisation of the variable grammar (social and linguistic) laid out in panoramic relief. The algorithm estimates the likelihood of the value of the dependent variable for the data set based on a series of binary questions about the values of variants.

For instance, for variable (hwat), we have already discovered that grammatical function is a key contrast. A ctree analysis could assess whether splitting the data into function and content words would result in the creation of one set of data points where hw is used more often and another set where w is used more often. The algorithm works through all predictors, splitting it into subsets where this is justified statistically, and then recursively considers each of the subsets, until further splitting is not justified. Similarly, for variable (adj_pos), a ctree analysis could assess whether a split into attributive and predicative types is justified, that is, one set of data points where one of the adjectives (e.g. cool) is used more often and another set where the remaining cohort is used more often. Here too, the algorithm works through all predictors, splitting by pertinent subsets, and recursively considering each one until further breaks are not justified. In this way, the algorithm partitions subsets that are increasingly homogeneous with respect to the variants (Tagliamonte & Baayen, Reference Tagliamonte and Baayen2012:159).

The result of this recursive binary splitting is a conditional inference tree. At any step of the recursive process of building such a tree, for each predictor, a test of independence of that predictor and the response is carried out. If the test indicates independence, then that predictor is useless for predicting the use of the variants. If the null hypothesis of independence is rejected, the predictor can be considered useful. If there are no useful predictors, the algorithm stops. If there is more than one useful predictor, the predictor that has the strongest association with the response is selected, the p-value of the corresponding test is recorded, and a binary split based on that variable is implemented. Ctree analyses implement safeguards to ensure that the selection of relevant predictors is not biased in favour of those with many levels (multiple factors in a factor group), or biased in favour of numeric predictors (e.g. age or year of birth of the individuals).

For example, certain predicators may be relevant for a subset of the individuals, and that set may be further restricted to yet another subset of individuals. Complex interactions, such as these can be difficult or even impossible to capture adequately with a mixed effects logistic linear model. To capture these differences with a mixed effects model would involve a combination of random effects and interactions that are quite onerous to interpret. Rather, the ctree analysis provides a foundation from which a more complex random effects structure can be justified so long as the model is not stretched beyond its limits, for example due to data sparsity or the kitchen sink effect. As Guy (Reference Guy and Labov1980:31) determined many years ago, subdividing your data too finely ‘is inherently self-defeating’. Indeed, for highly unbalanced designs and complex interactions, conditional inference trees and random forests are more flexible and often provide superior models (see Tagliamonte & Baayen, Reference Tagliamonte and Baayen2012:171).

It is also good practice to return to a ctree analysis after a best model is obtained using mixed effects modelling to visualise how the most important predictors work together.

Beyond ‘Old, Middle, Young’

An intrinsic dimension of variation analysis is the age of the individuals in the sample. Virtually every research paper that was written in the 1960s through to the early 2001s had a figure showing variation by the age of the individual, typically comprising two to four categories, and often with the labels, ‘old’, ‘middle’, and ‘young’. Since the logistic regression of the variable rule program could not handle continuous factor groups, there was no other way to model individual age than to group, or bin, the speakers. However, this type of categorisation schema is problematic for many reasons. Age groups across studies are never entirely consistent; how old is ‘old’; how young is ‘young’? The practice of grouping individuals was necessary for methodological reasons as well. Sociolinguistic data sets are notoriously small, involving limited sample designs in many cases, so grouping individuals was necessary. But what justification could be provided for the formulation of the groups?

Another issue arising from using an individual’s age at the time of data collection is that age is intrinsically tied to year of birth. However, as variationist studies have accumulated over about sixty years now, the ages of individuals across studies do not match, that is, individuals with the same age might have different dates of birth depending on the year that the study was undertaken. For example, the individuals making up Labov’s (Reference Labov1963) study of Martha’s Vineyard may have been middle-aged adults then, but they will be very elderly by 2023. My real-time study of Toronto, with data collected from the same individuals in 2003–2004 and then again in 2018–2019, has a completely different set of ages for the same individuals (Tagliamonte, Reference Tagliamonte2018–2024). Fortunately, there is an easy solution to resolving this issue now that modelling can handle year of birth as a continuous factor group.

Probing Year of Birth and Age

Researchers became increasingly aware of the need to disentangle an individual’s age at the time of data collection from their year of birth (e.g. Sankoff, G., Reference Sankoff, Ammoon, Dittmar, Mattheier and Trudgill2005; Reference Sankoff2019). Ctrees are the ideal means to explore how variation emerges across apparent time using either age or year of birth as a continuous predictor. Moreover, the results of a ctree analysis provide an informed answer to how best to construct the partitioning of a factorial variable by either year of birth or age.

Let’s begin by learning how to inspect the continuous measures of age at time of interview (age) and year of birth (yob). In this process, you will learn how to collapse levels, order levels, and clarify the labels for comprehensibility. Ordering age and yob is important for how an axis will be displayed. Ctrees require a certain amount of terseness in labelling due to how they are visualised and are often much more comprehensible with adjustments to cosmetic attributes of the tree. By examining your data by age and yob, you can gain an understanding of whether the variable in your data is changing or not, and if it is changing how the change is progressing. I like to think of it as finding the break points in the temporal continuum.

Box 9.02NOTE

The R code used to explore age and year of birth with ctrees can be repurposed for other factor groups. However, it is particularly useful for working with these independent variables (i.e. predictors).

Using Conditional Inference Trees for Year of Birth

A sociolinguistic data set is usually designed to examine the contrasts among individuals with different ages or different years of birth. Sometimes researchers design a study with extreme contrasts, like ‘young’ versus ‘old’, sometimes with representation of a span of different ages. The design of the study and the amount of data will dictate how the analyst has designed their data set.

To conduct a ctree analysis, make sure you have loaded the partykit package into R as well as your data set(s) (see Chapter 8). A ctree analysis uses the ‘ctree’ function. First, specify the dependent variable and the factor group(s) you want to include in the analysis. Once you have executed the code, use the print function to output the results or plot it using the plot function. The template for a ctree analysis is shown in (1).

(1)

Figure 9.01

ctree_model <- ctree(dependent_variable ~ factor_group1 + factor_group2), data = your_data)

The code in (1) will change depending on your data and what you want to investigate. The ‘ctree_model’ is the name you assign to each conditional inference tree model. The function is ctree from the partykit package which is used to build ctrees. The dependent variable is the variable you are trying to explain. The ‘~’ specifies the relationship between the dependent variable and independent variable(s). The ‘factor_groups’ are the independent variables, with the plus sign indicating the inclusions of additional factor groups. You can add as many factor groups as you like, but as you will discover, an overly complex tree is relatively useless for understanding your data. ‘Your_data’ specifies the data file you are using. The output will be a ctree model predicting the dependent variables based on the values of the specified factor groups in your dataset.

Variable (hwat)

For variable (hwat), the sample was necessarily limited to individuals in the oldest generation of two communities, Parry Sound and the Ottawa Valley, because younger people do not use the key variant, hw. Let’s examine how hw is influenced by year of birth of the individuals (yob), and community (comm1), using a ctree (2a). The analysis predicts hw. A simple restatement of the model, ‘tree_hwat_yob’, in the second line outputs (2b).

(2a)

Figure 9.02


ctree_hwat_age_comm <- ctree(dep_var ~ yob + comm1, data = hwat)
ctree_hwat_age_comm
plot(ctree_hwat_age_comm)

(2b)

Figure 9.03


Fitted party:
[1] root
| [2] comm1 in Parry Sound
| | [3] yob <= 1925
| | | [4] yob <= 1890: w (n = 134, err = 9.7%)
| | | [5] yob > 1890: w (n = 455, err = 29.0%)
| | [6] yob > 1925: w (n = 347, err = 1.4%)
| [7] comm1 in Ottawa Valley
| | [8] yob <= 1932: hw (n = 798, err = 38.0%)
| | [9] yob > 1932
| | | [10] yob <= 1948: w (n = 187, err = 31.0%)
| | | [11] yob > 1948: w (n = 59, err = 0.0%)
Number of inner nodes: 5
Number of terminal nodes: 6

The output in (2b) provides a summary of the ctree’s structure and the splits in the tree in table format with the node numbers and variable names, number of observations, and threshold values. At the end is a summary of the number of nodes and terminal nodes.

Figure 9.1 shows a ctree of hwat in Parry Sound and Ottawa Valley by year of birth. Each factor group is assigned a p-value indicating the level of significance as well as its relationship to other factor groups in the model. It provides a first demonstration of how valuable ctrees can be. In Figure 9.1 you can see that the communities have significant contrasts. In the Ottawa Valley, hw (the dark bars) remains robust longer, up to individuals born in 1948. Further, you can see a complexity in the comparison: the decline of hw in apparent time is different in each place. The analyst will have to dig deeper in the data to figure out why. But first, make note of the number of tokens at each split in the tree. Do these make sense? Also notice that the divide at 1948 has the least number of tokens and the last split in the tree has only 59 tokens.

Figure 9.1 Conditional inference tree – hwat by year of birth and community

An interesting utility of ctrees is that the number of tokens in the splits of a tree can be controlled. You could use ‘minbucket’, a parameter that determines the minimum number of observations in the terminal nodes of a ctree. Minbucket permits the analyst to set the minimum number of data points required to produce a split. This controls the complexity of the tree and avoids spurious divisions. How to determine the value for the minimum bucket will depend on the data set and the questions you are trying to answer. Try adjusting the ctree in (2a) to set a minbucket in the model, as in (2c) (shaded). What changes? What do the new results add to your understanding?

(2c)

Figure 9.04


ctree_hwat_age_comm <- ctree(dep_var ~ yob + comm1, 
data = hwat, 
ctree_hwat_age_comm
plot(ctree_hwat_age_comm)

When making adjustments to ctrees, use your judicious assessment of tokens/cell and your (socio)linguistic knowledge.

Box 9.03NOTE

It is a good idea to make the predicted variable conspicuous in the ctree by assigning it a dark or dominant colour. This will ensure that its patterns pop out so that you can more easily understand them.

Another interesting way to use ctrees is to explore different subsets of the factor groups to see how they operate in tandem with each other. Let’s probe variable (hwat) again, this time constructing a ctree of all the social factors: yob, community, gender, occupation, and education (3). Notice that I have set a control parameter, a minbucket of 200. A node will not be split if it contains fewer than 200 observations. The output is shown in Figure 9.2.

(3)

Figure 9.05


ctree_hwat_social <- ctree(dep_var ~ yob + comm1 + occ1 + edu1 + gender, data = hwat, control = ctree_control(minbucket=200))
ctree_hwat_social
plot(ctree_hwat_social)

Figure 9.2 reveals that only yob is significant for Parry Sound with a main split at 1925, and that both yob and gender are significant for the Ottawa Valley, with a split at 1932. Overall, individuals born earlier than 1932 use more hw, but in the early days Ottawa Valley women used it more than men. This information adds yet another nuance to the building analysis. In this case, it suggests that the receding variant may have had social prestige, that is, it was not stigmatised at earlier points in time. This is a good example of how one variable can be nested in another. In this case, gender is nested within year of birth in the Ottawa Valley.

Figure 9.2 Conditional inference tree – hwat by social factors

Try making modifications to the ctree model. For example, run a ctree analysis for all the linguistic factors and then for all the factors (linguistic and social). Next run them all together. Make note of how the results change or remain constant.

Variable (adj_pos)

Variable (adj_pos) comprises a sub-sample from the Toronto English Corpus and was constructed to represent successive generations. Let’s look at the effect of year of birth of the individuals using a ctree. For adj, the analysis cannot proceed using the same steps as with hwat because the variable is multiplex. It is untenable to run a ctree with all the adjectives. Even a ctree with the seven-variant dependent variable, would not give an interpretable result, at least not without adjustments. Instead, let’s run a ctree analysis on one of the dependent variables that have been constructed as a binary variable, focusing on great, (4). The plot command produces the output in Figure 9.3.

(4)

Figure 9.06


ctree_great_yob <- ctree(great ~ yob, data = adj)
ctree_great_yob
plot(ctree_great_yob)

Figure 9.3 reveals how the use of great changed over the course of the twentieth century. The main split is among individuals born in 1974 or before and after 1974. However, notice that there are minor splits in the temporal continuum at 1949, 1960, and after 1984, which seems a little unusual, especially given the lower totals at node 6 (n = 163) and node 10 (n = 162) and the rising and falling pattern of usage.

Figure 9.3 Conditional inference tree – great by year of birth

You could probe this in a variety of ways. As ever, the analyst must always balance the complexity of the model with its predictive accuracy.

Box 9.04TIP

Since the advent of conditional inference trees in variationist research, many practitioners have started using them. However, using ctree analysis informatively is another story! Do not just throw a ctree into a presentation. Figure out how it can be used to explain the variation first.

Let’s simplify the ctree in Figure 9.3 by setting the minbucket to 500, minbucket = 500, (5) and then plot the result (Figure 9.4).

(5)

Figure 9.07


ctree_great_2_yob <- ctree(great ~ yob, data = adj, control=ctree_control(minbucket=200))
ctree_great_2_yob
plot(ctree_great_2_yob)

Figure 9.4 shows terminal nodes with more than 500 tokens each. The tree has changed, but the main split is still 1974 and the temporal continuum is slightly different. Now, the use of great rises to a height among individuals born in the 1950s, remains frequent among those born in the 1960s, and then declines. This suggests that one or some of the ‘other’ adjectives may be implicated. We already know from the distributional analyses in Chapter 8 that awesome might be one of them. Before we turn to the other adjectives, let us first explore year of birth.

Figure 9.4 Conditional inference tree – great by year of birth, minbucket 500

How to Partition Year of Birth

You can use the results from ctree analysis to partition year of birth into age groups that reflect the breaks that have been identified in ctree model. For example, based on the results in Figure 9.4 you could create a new factor group with the precise age breaks: 1949 and earlier, 1950–1960, 1961–1974, 1975–1984, and 1985 or later. The code in (6a) uses ‘cut’ to categorise yob based on specified breaks; ‘breaks’ defines the intervals for the categories; ‘labels’ provides the names for the categories; ‘right’ indicates whether the intervals are right-closed using the default ‘TRUE’, and ‘include.lowest’ specifies whether the lowest internal is inclusive. In this case the default is ‘FALSE’, so specify ‘TRUE’. Use count to view the new factor group in the output in (6b).

(6a)

Figure 9.08


adj <- adj %>%
mutate(ctree_great_ages = cut(yob,
breaks = c(-Inf, 1950, 1961, 1975, 1985, Inf),
labels = c(“before 1950”, “1950-1960”, “1961-1974”, “1975-1984”, 
“1985 or later”),
right = TRUE,
include.lowest = TRUE))
adj %>% count(ctree_great_ages)

(6b)

Figure 9.09


# A tibble: 5 × 2
ctree_great_ages n
<fct> <int>
1 before 1950 571
2 1950-1960 561
3 1961-1974 503
4 1975-1984 1110
5 1985 or later 956

Let’s try a ctree for great using these age breaks and now including type of adjective. The code to produce the ctree is shown in (7) and the plot of the ctree in Figure 9.5.

(7)

Figure 9.010


adj_ctree_great_ages <- ctree(great ~ ctree_great_ages + type2, data = adj, control =ctree_control(minbucket=250))
adj_ctree_great_ages
plot(adj_ctree_great_ages)

Figure 9.5 shows that the use of great using the previously established generational splits in the data. You can now observe a linguistic systematicity to the trajectory in time. By the 1960s when great is at its height, it is more likely to occur with attributive adjectives, a great idea. As it declines in usage, this pattern is maintained up to the early 1980s, but among individuals born after 1985 this constraint is no longer significant. The new finding exposed in this model is that adjective type seems implicated in the uptake of a new adjective of positive evaluation. This trend can now be tested with other adjectives that are rising and falling in usage.

Figure 9.5 Conditional inference tree – great, ctree partitions for age group

Let’s conduct another ctree analysis this time focussing on awesome by year of birth, (8). The plot is shown in Figure 9.6.

(8)

Figure 9.011


ctree_awesome_2_yob <- ctree(awesome ~ yob, data = adj, control=ctree_control(minbucket=100))
ctree_awesome_2_yob
plot(ctree_awesome_2_yob)

Figure 9.6 reveals that awesome emerges among individuals born after 1977. But here you can easily see an anomaly: there is a split after 1977 that sets apart 1978 and 1979 from 1980 and beyond. This is suspicious and could be the result of specific individuals.

Figure 9.6 Conditional inference tree – awesome by decade of birth

We can delve deeper into the data by probing the individuals in the relevant sector, (8b). The ‘filter’ function isolates the individuals born in 1978 and 1979 (shaded), producing the results in (8c) and (8d). Notice that in some individuals (e.g. kconfetti), there are no tokens (i.e. ‘NA’, no use of awesome).

(8b)

Figure 9.012


adj %>% count (yob, indiv, awesome) %>%
pivot_wider(names_from=awesome, values_from = n) %>%
filter(yob == “”)
adj %>% count (yob, indiv, awesome) %>%
pivot_wider(names_from=awesome, values_from = n) %>%
filter(yob == “”)

(8c) Individuals born in 1978

Figure 9.013


# A tibble: 6 × 4
yob indiv other awesome
<dbl> <fct> <int> <int>
1 1978 dfriesen 62 4
2 1978 jclarin 30 1
3 1978 jtayles 5 1
4 1978 lpalintini 28 NA
5 1978 
6 1978 rhanson 59 NA

(8d) Individuals born in 1979

Figure 9.014


# A tibble: 5 × 4
yob indiv other awesome
<dbl> <fct> <int> <int>
1 1979 amaksimowski 26 NA
2 1979 
3 1979 fflynn 27 4
4 1979 kconflitti 18 NA
5 1979 kfaherty 19 1

The results from (8c–8d) reveal that there are two individuals who make use of awesome to a much higher degree than others in their cohort: rgruensten and aodwyer. At this point, a more qualitative analysis of the individuals may be in order or a focus on further quantitative analysis of the other adjectives and their patterns. There are many more adjectives to consider and any number of paths to take. Varied choices is what makes doing research so interesting!

Use count and filter to find out who is using awesome, (8e). Print enables you to view all the output, since R will produce only ten lines as a default. Then, when you find out who the individuals are you can focus in on individuals born before 1970 who are men in blue-collar occupations, (8f). Adding edu, occ2 and gender will enable you to probe for the other social characteristics. Whoever they are, they are individuals who go against the more general trend. Perhaps they share certain social characteristics or maybe they are ‘oddballs’ (Chambers, Reference Chambers2003:93–110).

(8e)

Figure 9.015


adj %>% count (yob, indiv, awesome) %>%
print(n= 300)
adj %>%
count(yob, indiv, awesome, edu, occ2, gender) %>%
filter(awesome == “awesome” & yob < 1970)

(8f)

Figure 9.016


# A tibble: 8 × 7
yob indiv awesome edu occ2 gender n
<dbl> <fct> <fct> <fct> <fct> <fct> <int>
1 1943 kvalentin awesome Y Bman 3
2 1944 bblackwell awesome NBman 1
3 1952 anappa awesome NBman 1
4 1954 skempt awesome NBwoman 2
5 1959 lrowoldt awesome N W woman 1
6 1959 ralbin awesome Y W man 3
7 1963 asilvera awesome Y B man 1
8 1968 rburkett awesome Y W woman 2

I will leave discussing the comparison with results from a random forest model on the same configuration of the data until later.

Using Conditional Inference Trees for Age

Conditional inference trees are also the ideal means to probe your data for the effect of age (rather than year of birth) of the individuals.

Variable (hwat)

Recall that the sample for variable (hwat) was limited to individuals in the oldest generation of two communities. The code in (9) probes for a ctree analysis of individual age at the time of interview. Use plot to visualise it (Figure 9.7).

(9)

Figure 9.017


tree_hwat_age <- ctree(dep_var ~ age, data = hwat)
tree_hwat_age
plot(tree_hwat_age)

Comparing year of birth in Figure 9.3 with age Figure 9.7 reveals somewhat different divides; year of birth at about 1974, but age at 50 (which would mean year of birth about 1968 if we calculated from the time of data collection). In this case, the discrepancy is modest and can be explained by the fact that the data include so many legacy materials that were recorded in the 1960s and 1970s. However, it illustrates the fact that with data collected at different points in time, age can be unreliable.

Figure 9.7 Conditional inference tree – hwat by age

Modifying Conditional Inference Trees

Conditional inference trees can be modified in a myriad of ways using colour and other adjustments, all of which assist in data exploration. Let’s look at a ctree by gender using the seven-member dependent variable for adjectives, dep_var2, with a minbucket of 200, (10). The plot command produces the visualisation in Figure 9.8.

(10)

Figure 9.018


Ctree_dep_var2_2 <- ctree(dep_var2 ~ gender, data = adj, control=ctree_control(minbucket=200))
Ctree_dep_var2_2
plot(ctree_dep_var2_2)

In Figure 9.8 both men and women use the full range of seven variants and there seems to be very little difference in the probability of forms. Notice, however, the substantial white space in the plot and between in the terminal nodes. The latter makes it difficult to see contrasts between men and women. Further, adjustments can be made to support better visualization.

Figure 9.8 Conditional inference tree – seven-way adj by gender

Recall that the names of the labels can always be changed (look back at the width required of Figure 9.5). In Figure 9.8 you will notice a similar issue. The tree must be plotted with a very long width to support the labels of the adjectives, particularly ‘intensifier + good’. You can always change the labels! The code in (11a) changes the labels of dep_var2 to shorter labels. Check it with count, which prints out the tibble in (11b). This can then be plugged back into the code in (10) to get a ctree with shorter or alternatively more interpretable labels depending how you choose to visualise the ctree, not shown.

(11a)

Figure 9.019

adj <- adj %>% mutate( 
dep_var2 = dep_var2 %>% fct_recode(
‘great‘ = ‘great‘,
‘good‘ = ‘intensifier+good‘,
‘other‘ = ‘other‘,
‘awesome‘ = ‘awesome‘,
‘cool‘ = ‘cool‘,
‘amazing‘ = ‘amazing‘,
‘lovely‘ = ‘lovely‘))

adj %>% count(dep_var2)

(11b)

Figure 9.020


# A tibble: 7 × 2
dep_var2 n
<fct> <int>
1 great 884
2 good 687
3 other 591
4 awesome 214
5 cool 905
6 amazing 340
7 lovely 80

Many adjustments can be made to enhance the visualisation of a ctree. For example, the spaces between the bars inside the terminal nodes can be adjusted using ‘tp_args’. These are tuning parameter arguments that specify different aspects of the tree to adjust its configuration. Another adjustment is to change the rotation of the labels on the columns for the levels of dep_var2. Finally, you can change the colours of the columns. Let’s try it for the ctree. Use the adjustments in (11c) and redo the model in (10). First, tp_args = list sets the columns to be ‘beside’, ‘beside = TRUE’, the rotation of the labels, ‘rot’ to 45 degrees, the maximum value of the y-axis ‘ymax’ to ‘.3’, and the ‘gap’ between columns to ‘0’. Then ‘fill’ specifies the colour of each column, seven colours for seven columns. Were we to have colour in this book, we would get a visual with colours; however, Figure 9.9 shows the grey-scale version. Try it on your own computer to see the colours.

(11c)

Figure 9.021


ctree_dep_var2_2 <- plot(ctree_dep_var2_2,
tp_args=list(beside=TRUE,rot=30, ymax=.3, gap=0,
fill = c(‘grey‘, ‘black‘, ‘orange‘, ‘blue‘, ‘green‘, ‘pink‘, ‘purple‘)))

Figure 9.9 shows the impact of making adjustments to the parameters of the code in (11c). On your computer it will produce coloured terminal nodes, offering a clearer picture of the adjectives of positive evaluation used by men and women. Minor differences become more obvious: men use more great. Women use more intensifier + good (now recoded as good) and more lovely. Keep the coding sequence in (11a) close by so that you can use it as a model to relabel as you see fit.

Figure 9.9 Conditional inference tree – seven-way adj_pos by gender, adjustments

Troubleshooting Conditional Inference Trees

In some cases, ctree analysis does not provide unequivocal results, especially when dealing with small data sets. Gries (2018: 17) points out that a ctree analysis run on a small versus a large data set ‘may not return an optimal tree in terms of accuracy, parsimony, and effect interpretations simply because one strong predictor chosen for the first split may overpower everything else’, something that will happen in a Zipfian distribution of a data set.Footnote ¹ As we have already seen in the counts for adj_post in Chapter 7, Zipfian distributions happen often in sociolinguistic data sets. Also, we are warned that ‘a classification tree makes its splits based on local best performance’ (Baayen et al., Reference Baayen, Janda, Nesset, Endresen and Makarova2013:265). When you use these exploratory tools, keep this in mind.

Gries’ (2018) more general advice is to test different hyperparameters of the analysis, such as (1) the minimum decreases in deviance that define when trees stop splitting, (2) the minimum sample size per node, or (3) the tree depth. We have already begun this kind of exploratory analysis earlier when we set the minimum bucket size due to linguistically unmotivated splits in ctree analysis (see e.g. Figures 9.4–9.5).

To set the minimum criterion for deviance, use ‘mincriterion’ and then specify the threshold that you want. A small value (e.g. mincriterion = 0) will make the tree split even when there is minimal improvement in deviance, a measure of the goodness of fit, that is, a more permissive tree. A larger value will make a split when there is a substantial improvement in deviance, a more conservative tree. I will show an example with ‘mincriterion = .995’ (12). The names of the ctree will print out the results in table format and the plot command produces the visualisation in Figure 9.10.

(12)

Figure 9.022


ctree_hwat_social_2 <- ctree(dep_var ~ yob + comm1 + occ1 + edu1 + gender,
data = hwat,
control = ctree_control(mincriterion = .995))
ctree_hwat_social_2
plot(ctree_hwat_social_2)

Figure 9.10 exposes a new nuance for hwat. The main splits remain community and year of birth; however, now you can see new distinctions between the communities. In Parry Sound the variable is contoured by occupation with the somewhat surprising result that white-collar workers use more hw. In the Ottawa Valley there is a more complex development that shifts over time. Individuals born in 1932 and before exhibit a parallel pattern of more hw with white-collar workers; however, only among those with less post-secondary education (node 8), and the count is low. After 1932, the pattern reverses and the single relevant contrast is with occupation: blue-collar workers are the ones who are predisposed to hw (node 13).

Figure 9.10 Conditional inference tree – hwat, social factors

Be warned, even small differences in mincriterion can make a dramatic difference. In this case, when the mincriterion is adjusted to ‘mincriterion = 1.0’, the returned ctree shows no significant factor groups. As ever, best practice is to try it and find out what works best to describe the data. Also be aware that an overly complex tree is useless for understanding your data.

Let’s now consider the measure of tree depth using the ‘maxdepth’ parameter which controls the number of branching levels of the tree, using a maxdepth of 4, (13).

(13)

Figure 9.023


ctree_hwat_social_4 <- ctree(dep_var ~ yob + comm1 + occ1 + edu1 + gender, data = hwat, control = ctree_control(maxdepth = 4))
ctree_hwat_social_4
plot(ctree_hwat_social_4)

Figure 9.11 shows that main splits remain community and year of birth. However, this model returns a slightly different perspective. Because the depth is set at four branches, the ctree is getting more complex. Now you can see that there are further differences between Parry Sound and the Ottawa Valley. Comparing the two places down from node 2 for Parry Sound and node 10 for the Ottawa Valley reveals that education level is not significant for Parry Sound, but it is for the Ottawa Valley. Further down the branches, occupation emerges as significant in the Ottawa Valley, which is supportive to the interpretation that education once influenced the use of hw, but that in the later decades of its use, it receded to blue-collar workers (node 15). Note that in both places there is virtually no hw in the last time period (see nodes 8 and 17).

Figure 9.11 Conditional inference tree – hwat adjusted with maxdepth 4

These different ctrees offer alternative perspectives on variable (hwat). The job of the analyst is interpretation and explanation. Which results are most compelling? What patterns are relevant? Importantly, which ones provide the best evidence for the explanation?

This exercise, following through on Gries’ (2018: 17) suggestions, exemplifies how to conduct different types of analyses on the same data set and the impact this has on the visualization of the results. It also finely illustrates Baayen’s (Reference Baayen, Janda, Nesset, Endresen and Makarova2013:265) point that ctree analysis is sensitive to local parameters. The advantage of the ctree is that the analyst can see the parameters, further up in the tree and further down in the tree, depending on how the analysis is configured. It is important is to test out different ways of visualising the data using ctrees as a prequel to other types of analyses. In this process, use deductive reasoning and your analytic and linguistic judgement. Watch for nodes with small numbers, splits in the tree that do not make linguistic sense, overly complex trees with many branches, and so on. No single tree with all factors thrown in indiscriminately will provide the answer, so build and modify and find out where the main results hold true. Then move on to random forests and mixed effects modelling to triangulate all the evidence.

Saving Conditional Inference Trees

You will want to save your ctree visualisations. One way is to use the ‘export’ command in the plots window (see Figure 9.12), where you see a drop-down menu with ‘Save as Image … ’. This method enables you to choose the dimensions of the file and the type. Figure 9.12 required export of the plot as 1200 cm wide and 400 cm high using the ‘Save as Image … ’ option under ‘Plots,’ and ‘Export’ to get an ideal view.

Figure 9.12 How to save a ctree to your computer

You can also save plots with the first line of code, (14), which specifies a png file, the label of the file and specifications of height, width, type, units (e.g. inches), and resolution in dots per inch (DPI). The ‘png’ at the beginning creates the file type.

(14)

Figure 9.024


png(‘ctree_dep_var2_2.png‘,height=6,width=10,units=‘in‘,res=300)
ctree_dep_var2_2a <- ctree(dep_var2 ~ gender, data = adj, control=ctree_control(minbucket=200))
ctree_dep_var2_2a
ctree_dep_var2_2 <- plot(ctree_dep_var2_2a,
tp_args=list(beside=TRUE,rot=45, gap=0,
fill = c(‘grey‘,‘black‘, ‘orange‘, ‘blue‘, ‘green‘, ‘pink‘, ‘purple‘)))
plot(ctree_dep_var2_2a,
terminal_panel = node_barplot, tp_args=list(beside=TRUE,rot=45, gap=0,
fill = c(‘grey‘,‘black‘, ‘orange‘, ‘blue‘, ‘green‘, ‘pink‘, ‘purple‘)),
main = ‘ADJECTIVES OF POSITIVE EVALUATION‘)
dev.off()

In the coding sequence beginning with ‘plot’, a title is added: ‘main = “ADJECTIVES OF POSITIVE EVALUATION”’. Remember to close the file type after adding the plot with ‘dev_off’ (15).To view the plot, search for the file in your directory with the label ‘ctree_dep_var2_2.png’.

Random Forests

A random forest (henceforth, cforest) is a computationally intensive but high-precision non-parametric classifier (Breiman, Reference Breiman2001). I will employ the partykit package to demonstrate them. Cforests are an ensemble learning method that works through a data set by trial and error to establish whether a factor group is a useful predictor of variant choice or not. According to its official description, it is ‘an implementation of the random forest and bagging ensemble algorithms utilizing conditional inference trees as base learners’.Footnote ² In so doing, they construct many ctrees to create the cforest. The advantage of the cforest is that the model works with samples of factor groups, which obviates one of the thorny methodological issues in variationist research. Earlier I said that inexperienced researchers often throw many factors into the same analysis, running the risk that some may be orthogonal. With a cforest this does not matter. Similar factor groups can be run in the same analysis. The analyst can easily view which ones are more important than others.

Where’s the Forest?

A random forest, which uses the cforest function, should be able to use any formula specification that ctree function does (i.e. the same code), including a multinomial dependent variable. In general, the cforest is the same as the ctree for any given model, except that it avoids overfitting by trying many slightly reduced or randomised versions (different subsets of variables and data). This produces a reliable analysis of variable importance.

The power of the cforest technique for variation analysis comes from the fact that you can put into the analysis the full set of predictors that had been coded into the data. The analysis returns a visualisation of the relative importance of the predictor variables when all of them are considered simultaneously without fear of orthogonality. Many different variables, even those that seek to capture similar underlying phenomena but use different factor levels (configurations), can be included and explored together. This is not possible in logistic models.

I still recommend coding all factors (predictors) hypothesised to affect linguistic variables in as elaborated a fashion as possible and to then ‘hone the analysis’ down to the best model of the data. The reason for this is the likely extensive covariation across factor groups, empty cells, and extreme differences in cell counts that is typical of natural speech data. Now, with the methodological assistance of a cforest analysis, the analyst does not have to be concerned about overlapping factor groups because it is possible to throw even interacting ones into the analysis at the same time and let the analysis evaluate their relative importance. Of course, such a strategy should not be substituted for the goal of finding a linguistically reasoned model. The adage of ‘garbage in, garbage out’ applies nonetheless; however, this new tool offers the analyst at the very least a preliminary view on the nature of the data set and the relative influence of the predictors.

Variable Importance

To establish the relative importance, that is, strength of factor groups, it is a good idea to begin with the factor groups that have been established based on results arising from ctree analyses. After conducting a series of ctree analyses, you will have a much clearer view of your data. Some factor groups may have been collapsed; some factors may have been too sparse to consider; others may have made unmotivated divisions.

Make sure you have loaded the partykit package into R as well as your data sets. Using the cforest function, specify the dependent variable and the factor group(s) you want to include in the analysis. Once you have executed the code, print the result, and plot it. The examples that follow (16–20) will not directly generate analyses from the example data sets we are using. Those procedures will come in the next section.

The template for a cforest is shown in (15).

(15)

Figure 9.025


cforest_model <- cforest(dependent_variable ~ factor_group1 + 
factor_group2 + factor_group3, data = your_data)
cforest_model
plot(cforest_model)

In a cforest analysis, all factor groups must be factorial. You can save your random forest model as an RDS file (an R data file). This ensures that the analysis, which is often lengthy, does not have to be rerun. The name of the file should end with ‘RDS’, (16).

(16)

Figure 9.026

saveRDS(forest_model, “forest_model.RDS”)

Use the following helper codeFootnote ³ to support formatting the data frame that results from the analysis. The varimp_helper converts the variable importance output, varimp, into a data frame that can be used for plotting with ‘ggplot’, an R package for creating graphics. Visualisation with ggplot will be introduced in Chapter 11 . It allows easy reproduction of the ‘dotplot’ plot for the variable importance values from the random forest model. This is useful because ggplot requires a data frame as input; however, the ‘partykit::varimp()’ output is a labelled vector (17). Note that if you have already loaded the varimp_helper in the set-up of your RMD file, you do not need to do it again.

(17)

Figure 9.027


varimp_helper = function(varimp_vector){
varimp_df = data.frame(
variable = names(varimp_vector),
importance = varimp_vector)
varimp_df = varimp_df %>% arrange(importance)
return(varimp_df)
}

When you calculate the variable importance with a cforest analysis, use ‘conditional = TRUE’ if your data are complex and have interactions, (18). However, this procedure will take much longer, sometimes many minutes.

(18)

Figure 9.028


forest_model_varimp = forest_model %>%
varimp(conditional = TRUE)

If the varimp took a very long time, my advice is to save the results as a tsv file, using write_tsv, (19).

(19)

Figure 9.029


forest_model_varimp %>% varimp_helper() %>%
write_tsv(‘forest_model_varimp.tsv‘)

If ‘varimp(condition = TRUE)’ fails with an error message, you can substitute ‘varimp(conditional = FALSE)’ in (18). This method is less precise but takes far less time. For less complex models, this method is often little different from the lengthier method. Try both and observe what happens.

Using Random Forests

Cforest analyses probe many slightly different trees and data subsets to find the most robust patterns. Using the ‘variable importance’ provides guidance for further modelling, and an additional perspective on the contrasts visible using ctrees. The overall goal is to identity strong patterns in your data, that is, how a factor group contributes to the predictive performance of the model.

Variable (hwat)

Let’s look at hwat using a random forest analysis. First, remind yourself of what is in the data file, the factor groups, and their levels, to determine what to put into the analysis using summary, (20), which produces the output summarised in Figure 9.13.

(20)
Figure 9.030
Figure 9.030
```
summary(hwat)
```

Figure 9.13 Screenshot of hwat summary

The factor groups you put into the random forest should be based on good linguistic sense. For example, you would not want to put the two versions of the dependent variable, dep_var and dep_var1, nor the multiple categorisations of community, education, and occupation into the same analysis. Recall that alternative versions of occupation and education have recoded ‘NA’ values in ‘edu1’ and ‘occ1’, such that they are collapsed into the other categories. This technique supports more robust analysis and is based on knowledge of the communities, for example, the NA values are likely to exist among less-educated and blue-collar individuals. The ‘pre_seg’ and ‘fol_seg’ have many levels. Also in earlier chapters, ‘pre_seg’ was recoded to ‘pre_vc’. If you were to do a quick count of the ‘word’ factor group in variable (hwat), it would return a count of sixty-two levels. It is not a good idea to put such an elaborate factor group into the random forest either, as it will make it unnecessarily complex. Of course, you can always try it and see what happens.

A first random forest, ‘hwat_forest1’, could put all the social factor groups into an analysis, (21a).

(21a)

Figure 9.031


hwat_forest1 <- cforest(dep_var ~ gram_cat + yob + comm1 + gender + edu1 + occ1 + pre_vc, data = hwat)
hwat_forest1_varimp <- varimp(hwat_forest1)
imp_min_value = hwat_forest1_varimp %>%
min() %>% abs()
hwat_forest1_varimp %>% varimp_helper() %>%
arrange(-importance) %>%
kable(caption = ‘Table of Variable Importance for hwat_forest1‘, digits = 3)

The coding sequence in (21a) first runs the random forest in the hwat data. Each factor group that is added requires a plus marker. Then, the code extracts the variable importance from the model; calculates the minimum value in the range of variable importance; finds the minimum absolute value of variable importance using the varimp_helper to arrange the factor groups by variable importance. Finally, kable creates a table of the results and ‘caption’ adds a title, (21b).

(21b)

Figure 9.032


Table: Table of Variable Importance for hwat_forest1
| |variable | importance|
|:-----------|:-----------|-----------:|
|comm1 |comm1 | 0.488|
|occ1 |occ1 | 0.143|
|edu1 |edu1 | 0.098|
|gram_cat |gram_cat | 0.042|
|gender |gender | 0.036|
|pre_vc |pre_vc | 0.004|

The variable importance values are a measure of the contribution of each factor group to the performance of the model, indicating those that are more and less influential. These values must be interpreted in the context of the model and should not be compared across models. As ever, refer to the documentation for further information.

Next, let’s create a ggplot version of the varimp plot. More information and details on plotting with ggplot is found in Chapter 11. For now, simply use the code as presented, (21c). The visualisation is shown in Figure 9.14. Note that you can include the name of your fitted model for future reference.

(21c)

Figure 9.033


hwat_forest1_varimp_plot <- hwat_forest1_varimp %>% varimp_helper() %>%
ggplot(aes(x=importance,y=variable %>% fct_reorder(importance))) +
geom_col(width=0.8) +
geom_vline(xintercept = imp_min_value,color=‘dark 
grey‘,linetype=‘dashed‘,linewidth=1) +
theme_bw() +
labs(x = “Variable Importance”,
y = “Predictors”,
subtitle = “hwat_forest1”, title = ‘Variable Importance for 
hwat_forest1‘
)
plot(hwat_forest1_varimp_plot)

Figure 9.14 shows that community and year of birth are the most important factor groups, followed by occupation, education, gender, and grammatical category (i.e. content versus function). The effect of a preceding segment is below the point of minimal importance.

Figure 9.14 Random forest analysis of hwat – social factor groups

The command ‘ggsave’ will save the most recent plot. Save it in the format desired (e.g. png or tiff), (21d). For publications, the style sheet will typically ask for high-resolution files.

(21d)
Figure 9.034
Figure 9.034
```
ggsave(‘hwat_forest_1.tiff’)
```

You can also inspect one of the trees in the forest, with ‘gettree’, (21e), but I will not print it here.

(21e)
Figure 9.035
Figure 9.035
```
hwat_forest1 %>% gettree()
```

Variable (adj_pos)

Let’s look at variable (adj_pos) using a random forest analysis. Recall that this is a multi-variant linguistic variable, with ‘dep_var2’ having been recoded as a seven-way dependent variable. You can probe it using count (22a). This time I’ve added ‘arrange’ with the added stipulation ‘(-n)’, which arranges the output in descending order of frequency, and adorn_totals which adds the total (22b).

(22a)

Figure 9.036


summary(adj)
adj %>% count(dep_var2) %>%
arrange(-n) %>%
adorn_totals()

(22b)

Figure 9.037


dep_var2 n
cool 905
great 884
good 687
other 591
amazing 340
awesome 214
lovely 80
Total 3701

The data file also has binary dependent variables set up for each of the main adjectives. Keep in mind that while conditional inference trees are ideal for examining continuous variables, random forests must be run on factorial variables, those that have categories. Factor groups such as yob, age_estimate, and dec_1 cannot be put into the model. However, we can recode any one of these into categories (e.g. dec_1 could be recoded into ten-year increments and labelled dec10, (22c)).

(22c)

Figure 9.038


#re-code yob, 10 year increments
adj<- adj %>% mutate(
dec10 = yob - (yob %% 10),
dec10 = as.factor(dec10))

In the case of factorial variables, it is also sometimes important to reorder them appropriately. To reiterate the relevelling procedure, let’s take the ‘bymo’ factor group, which could be ordered consecutively, ‘old’, ‘middle’, ‘young’, ‘adolescents’ (22d). Note that you can use ‘tabyl’ or count to view the factor group. Then reorder it using mutate and fct_relevel.

(22d)

Figure 9.039


#reorder bymo, ordered factorial variable
adj %>% tabyl(bymo)
adj %>% count(bymo)
adj <- adj %>% mutate(
bymo = fct_relevel(bymo, ‘O‘, ‘M‘, ‘Y‘, ‘B‘))

There is an important difference in modelling an unordered (factorial) versus an ordered (numeric) predictor (Tagliamonte & Baayen, Reference Tagliamonte and Baayen2012:172, fn. 9). In the former, the classification tree will try all possible splits, and there could be many. With an ordered factor group, however, the model is much more constrained, due to the intrinsic order of the factor levels. This means that if the order is appropriate, the result of the analysis will be more linguistically sensible.

Another way to partition the yob factor group would be to use a recode of yob that is justified by the breakpoints discovered through ctree analysis using the format in (29a), Chapter 8; however, for the adjective awesome, there is no straightforward way to justify that decision due to the bifurcated trajectory by yob. As you recall, there was an anomaly among individuals born in the 1990s (Figure 9.6). However, for other variables, using the breakpoints from a conditional inference for data exploration and modelling may prove to be ideal.

Box 9.05NOTE

If you happen to notice that your random forest results are different from the ones illustrated in this book, remember that no two random forests will be precisely the same. Why? Because they are random. Random forests use bootstrap sampling to generate multiple subsets of the data for building each tree so that each random forest model is built out of its own unique set of trees.

Cartesian Coordinates

‘Coord_cartesian’ specifies the scale of the x-axis of the random forest. Let’s try it with a simple random forest of the adj_post data probing the factor groups on the use of four of the adjectives as a binary variable versus other adjectives; see the counts in (22b) for reference.

To illustrate, let’s compare across random forest analyses of the binary dependent variables for (adj_pos). I have created a factorial factor group for decade into ten-year increments (‘dec10’). The other factor groups are occupation (‘occ2’), ‘gender’, education level (‘edu’), and type of adjective (‘type’). The dashed grey line marks the minimal importance divide. Let’s begin with the adjective great (23a), which represents 884 tokens.

The random forest is produced by a series of code. First, create the random forest with specifications for the factor groups included, specifying which data file is to be used. Then construct the variable importance, varimp, which arranges the factor groups in order of importance from the most important to the least. The code also asks for a dotted line at the point of the factor group with least importance. Minimum importance is determined relative to the values of the model and must be interpreted in the context of the model. Finally, kable outputs the results in (23b), specifying three decimal places.

(23a)

Figure 9.040


adj_forest_great <- cforest(great ~ dec10 + gender + edu + occ2 + gender + type, data = adj)
adj_forest_great_varimp = varimp(adj_forest_great)
imp_min_value = adj_forest_great_varimp %>% min() %>% abs()
adj_forest_great_varimp %>% varimp_helper() %>% arrange(-importance) %>%
kable(caption=’Table of Variable Importance for adj_forest_great’,digits = 3)

(23b)

Figure 9.041


Table: Table of Variable Importance for adj_forest_great
| |variable | importance|
|:-------|:---------|-----------:|
|dec10 |dec10 | 0.217|
|occ2 |occ2 | 0.138|
|gender |gender | 0.109|
|edu |edu | 0.103|
|type |type | 0.019|

The next step is to send the findings to ggplot for visualisation (23c). The line that begins with a hashtag and coord_cartesian controls whether the output will be on an absolute scale or relative scale. When the hashtag is removed and the code rerun, the visualisation in Figure 9.15 is produced.

(23c)

Figure 9.042


adj_forest_great_varimp %>% varimp_helper() %>%
ggplot(aes(x=importance,y=variable %>% fct_reorder(importance))) +
geom_col(width=0.1) +
geom_vline(xintercept = imp_min_value,color=’dark 
grey‘,linetype=‘dashed‘,linewidth=1) +
 +
labs(x = “Variable Importance - GREAT”,
y = “ALL Predictors”,
tag = ‘‘,
title = ‘Variable Importance for GREAT‘,
subtitle = “Dashed grey line at minimum importance.”)

Figures 9.15 and 9.16 illustrate the difference coord_cartesian makes to the output. In Figure 9.15 all the white space is filled up for a visually pleasing display, with a range from .00 to about 0.25

Figure 9.15 Random forest analysis of great cartesian co-ordinates, default

In Figure 9.16, the hashtag has been removed, allowing the coord_cartesian line to be processed. The output will then display the entire scale from 0 to 1. Recall that the hashtag renders the line of code invisible to R processes.

Figure 9.16 Random forest analysis of great cartesian co-ordinates, 0–1.

The contrast between the two visualisations is stylistic. In Figure 9.15 the relative ranking is more visible, and in Figure 9.16 there is more blank space. However, the option in Figure 9.16 is the best way to visualise the relative importance of factor groups across analyses of different adjectives.

Consider the same cforest model for the adjective lovely, one of the least frequent of the adjectives (Figure 9.17). It is a simple matter of replacing the code in (23a) to focus on lovely, which produces the visualisation in Figure 9.17. The visualisation shows that the predictors for lovely operate at a very low level compared to those in Figure 9.16. While the decade of birth of the individuals is the most important in both analyses, notice that it is modest for lovely compared to the results for great, and type of adjectives is relatively high in importance for lovely but not for great. These findings add to the building picture of the adjectives of positive evaluation. However, it will be important to carefully assess the differences between the cforest analyses to those arising from ctree analysis, and importantly from mixed effects modelling (Chapter 10). Try conducting a random forest for one of the other adjectives (e.g. cool) to determine what the results are in comparison with great and lovely.

Figure 9.17 Random forest analysis of lovely cartesian co-ordinates, 0–1

Troubleshooting Random Forests

Random forests should not be interpreted based solely on one model, nor without scrutiny and comparison. Gries (2018:21) raises important issues regarding indiscriminate use of random forest analyses. He suggests that some studies using random forests have interpreted variables with high importance scores as main effects even though the high importance score could have been due to the variable participating in an interaction which was not explored. He also considers it problematic when random forests are interpreted from a single tree, arguing that ‘tree-based approaches can in fact be much less good at being parsimonious and at detecting interactions than is commonly assumed’. The recommended antidote for these potential pitfalls is to run the random forest with conditional = TRUE. This computation does a much more careful job of evaluating interactions. However, it takes a very long time. For example, in Figure 9.14 the computation conditional = TRUE took so long I didn't wait for it to finish conditional = FALSE took only a few minutes. The visualisation produced by the two analyses for with ‘adj_forest2_great_varimp’ were pretty much the same (not shown). Note that attempting to run a ‘condition = TRUE’ may not be feasible on a laptop. The procedure also requires an enormous amount of processing.

Another alternative is to conduct many or at least several random forests and find out where the similarities and differences lie. This can potentially highlight where the interactions are and support how the analyst should construct a mixed effects model. I will advocate for a comparative approach, the strategy of triangulating across different tools. Compare the results of conditional inference trees analysis with the results from random forest analysis. Use the results to understand the data more comprehensively. Then, utilise all the findings from these exploratory endeavours to inform mixed effects regression. The random forest is part of a suite of tools that should ideally be considered together.

In working with highly unbalanced designs and complex interactions – as with most sociolinguistic samples – conditional inference trees and random forests are more flexible and may yield superior models than statistical modelling (see Tagliamonte & Baayen, Reference Tagliamonte and Baayen2012). However, for large data sets with multiple random effect factors with many levels, time to completion can be long, making them less useful on personal computers.

Saving Random Forests

Use the ggsave command on the next line to save the random forest plot. Choose the best format (e.g. tiff or png), and add specifications for height and width, (24). A multitude of other ggplot cosmetics could also be added. Check online user manuals for instructions.

(24)

Figure 9.043


ggsave(‘forest_modal_varimp_plot.tiff‘, height = 4, width = 8)
ggsave(‘forest_modal_varimp_plot.png‘, height = 2, width = 6)

Footnotes

¹ A Zipfian distribution is when words are inversely proportional to their rank: for example, a small number of words occur very frequently, and many function words occur very rarely, and there is a regular pattern of frequency.

² Cforest: Conditional Random Forests in partykit: https://rdrr.io/cran/partykit/man/cforest.html (accessed 27 November 2023).

³ Varimp_helper was written by Jeremy Needle of Whippletree Insight.

Figure 9.01

Figure 9.02

Figure 9.03

Figure 9.1 Conditional inference tree – hwat by year of birth and community

Figure 9.04

Figure 9.05

Figure 9.2 Conditional inference tree – hwat by social factors

Figure 9.06

Figure 9.3 Conditional inference tree – great by year of birth

Figure 9.07

Figure 9.4 Conditional inference tree – great by year of birth, minbucket 500

Figure 9.08

Figure 9.09

Figure 9.010

Figure 9.5 Conditional inference tree – great, ctree partitions for age group

Figure 9.011

Figure 9.6 Conditional inference tree – awesome by decade of birth

Figure 9.012

Figure 9.013

Figure 9.014

Figure 9.015

Figure 9.016

Figure 9.017

Figure 9.7 Conditional inference tree – hwat by age

Figure 9.018

Figure 9.8 Conditional inference tree – seven-way adj by gender

Figure 9.019

Figure 9.020

Figure 9.021

Figure 9.9 Conditional inference tree – seven-way adj_pos by gender, adjustments

Figure 9.022

Figure 9.10 Conditional inference tree – hwat, social factors

Figure 9.023

Figure 9.11 Conditional inference tree – hwat adjusted with maxdepth 4

Figure 9.12 How to save a ctree to your computer

Figure 9.024

Figure 9.025

Figure 9.026

Figure 9.027

Figure 9.028

Figure 9.029

Figure 9.030

Figure 9.13 Screenshot of hwat summary

Figure 9.031

Figure 9.032

Figure 9.033

Figure 9.14 Random forest analysis of hwat – social factor groups

Figure 9.034

Figure 9.035

Figure 9.036

Figure 9.037

Figure 9.038

Figure 9.039

Figure 9.040

Figure 9.041

Figure 9.042

Figure 9.15 Random forest analysis of great cartesian co-ordinates, default

Figure 9.16 Random forest analysis of great cartesian co-ordinates, 0–1.

Figure 9.17 Random forest analysis of lovely cartesian co-ordinates, 0–1

Figure 9.043

Accessibility standard: Unknown

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

Accessibility compliance for the HTML of this chapter is currently unknown and may be updated in the future.

Book contents

9 - Exploratory Modelling

Summary

Keywords

Information

Conditional Inference Trees

Beyond ‘Old, Middle, Young’

Probing Year of Birth and Age

Using Conditional Inference Trees for Year of Birth

Variable (hwat)

Variable (adj_pos)

How to Partition Year of Birth

Using Conditional Inference Trees for Age

Variable (hwat)

Modifying Conditional Inference Trees

Troubleshooting Conditional Inference Trees

Saving Conditional Inference Trees

Random Forests

Where’s the Forest?

Variable Importance

Using Random Forests

Variable (hwat)

Variable (adj_pos)

Cartesian Coordinates

Troubleshooting Random Forests

Saving Random Forests

Footnotes

Accessibility standard: Unknown

Why this information is here

Accessibility Information

Save book to Kindle

Save book to Dropbox

Save book to Google Drive