1. Introduction
Anaphora is a very important part of language use. It is a specific cohesive relationship, and with this relationship, the meaning of one linguistic item can be precisely interpreted based on the meaning of the other linguistic item (Carter, Reference Carter1987: 33). During utterance production, speakers commonly have a variety of forms available for the anaphor. In Mandarin, there are several forms for speakers to choose and expect for the intended referent to be understood, including the pronoun, noun phrase and zero anaphora, as illustrated in Example (1).

Several studies hold that in a certain context, the choice of anaphoric forms is influenced by motivators varying from the syntactic level to the discourse level. At the syntactic level, the syntactic position of antecedents could influence the anaphoric choice (Arnold, Reference Arnold2001; Guan & Arnold, Reference Guan and Arnold2021; Jiang, Reference Jiang2004; Lam & Hwang, Reference Lam and Hwang2022). At the semantic level, semantic information is considered to influence speakers’ choice, such as predictability (Weatherford & Arnold, Reference Weatherford and Arnold2021) and gender (Arnold & Griffin, Reference Arnold and Griffin2007; Chiriacescu, Reference Chiriacescu2015; Fukumura et al., Reference Fukumura and Van Gompel2011). At the discourse level, topicality (Xu, Reference Xu2003a; Zhang, Reference Zhang2019), distance (McCoy & Strube, Reference McCoy, Strube, Cristea, Ide and Marcu1999; Xu, Reference Xu2003: 103) and repetition (Ariel, Reference Ariel1988; Tyler, Reference Tyler1994) are argued to influence the choice of anaphoric forms. But previous research mainly focussed on anaphora resolution, and most conclusions were drawn from research on anaphora resolution, lacking in direct evidence from research on anaphoric forms.
In deciding on a referential choice, motivators do not work alone but compete in their influence on the anaphoric choice. Ariel (Reference Ariel1988, Reference Ariel1990), Hegarty et al. (Reference Hegarty, Gundel and Borthen2001), Xu (Reference Xu2003b) and Gundel et al. (Reference Gundel, Hegarty and Borthen2003) have carried out introspective methods and corpus-based descriptive analysis to examine the competition of motivators. Few studies have applied quantitative analysis to approach motivators’ competition and interactive effect at the same level or between different levels.
Specific competing motivators on such linguistic levels are discussed based on accessibility. Accessibility is the relative ease with which the speaker can refer to an entity, and different referential forms reflect or mark the accessibility in the mental space (Ariel, Reference Ariel1990: 11). Studies have tried to verify the theory with empirical studies (e.g. Cowles et al., Reference Cowles, Walenski and Kluender2007; McCoy & Strube, Reference McCoy, Strube, Cristea, Ide and Marcu1999; Poirier et al., Reference Poirier, Walenski and Shapiro2012) and extended the theory based on corpus analysis (e.g. Gutman, Reference Gutman2004; Xu, Reference Xu2003b). Yet, previous studies have focussed on the influence of a single factor or the interactive effect in verifying accessibility theory, and few have analysed multiple factors in natural discourse and constructed a systematic frame containing significant factors.
In the studies of anaphoric forms, a growing number of studies have focussed on anaphora in causals. Causals are sentences that express a cause-and-effect relationship (Lyons, Reference Lyons1977: 488–492). In causals, one event or state is described as the cause or result of another event or state, as shown in Example (1). Researchers have analysed causals to reveal the influence of factors such as predictability (e.g. Garnham et al., Reference Garnham, Vorthmann and Kaplanova2021; Koornneef & Van Berkum, Reference Koornneef and Van Berkum2006; Zhang, Reference Zhang2019), topic structure and topicality (e.g. Lam & Hwang, Reference Lam and Hwang2022; Xu, Reference Xu2003a; Zhang, Reference Zhang2019). Generally speaking, approaching studies on anaphoric forms via causals can provide researchers with a context that more motivators can be identified and investigated.
To probe into motivators that influence the choice of anaphoric forms, this study adopts a machine learning method with causals in Mandarin as the major object to address the following two questions:
-
1) What is the distribution of different anaphoric forms in Mandarin causals?
-
2) What motivators can significantly influence the choice of anaphoric forms? And how do these motivators influence the choice of anaphoric forms?
2. Predictors of anaphoric form
In this section, we summarized the possible motivators that could influence the choice of anaphoric forms at the syntactic, semantic and discoursal levels. These motivators were retrieved through literature analysis on anaphora and data processing in the bottom-up manner, which is the foundation of the annotation framework in Section 4.
2.1. Syntactic motivators
Syntactic position is a motivator of concern in this study. Some studies found that pronouns were preferred when the antecedent was the subject compared to the object (Arnold, Reference Arnold2001; Guan & Arnold, Reference Guan and Arnold2021). Studies on anaphora in Mandarin concluded that when the antecedent was the subject, more zero pronouns were used (Jiang, Reference Jiang2004; Lam & Hwang, Reference Lam and Hwang2022). In this study, the syntactic position of antecedents was annotated to explore their effect on anaphoric forms (annotated as NpSyn).
2.2. Semantic motivators
2.2.1. Animacy of antecedent and relation of animacy
The animacy of the antecedent may also constrain the choice of anaphoric forms. X. Xu (Reference Xu2020) and Schwenter (Reference Schwenter2015) noted that antecedents as objects characterized by high animacy were likely to be expressed as pronouns and zero pronouns. Fukumura and van Gompel (Reference Fukumura and Van Gompel2010) also found in their experimental research that animacy influenced anaphoric choice. When an antecedent and its competitors differed in animacy, the anaphor was more likely to take the form of a pronoun; conversely, the anaphor was more likely to be a noun. Previous studies indicate that not only the animacy of antecedents but also the relation of antecedents and their competitors in terms of animacy influences anaphoric forms.
However, the results of previous studies contradict each other in the context where the antecedent is inanimate and its competitor is animate, in that anaphoric forms should be noun phrases according to Lu (Reference Lu2002) and X. Xu (Reference Xu2020), but pronouns according to Fukumura and van Gompel (Reference Fukumura and Van Gompel2010). Therefore, this study analyses the animacy of the antecedent (annotated as Npani) and the relation of animacy (annotated as Reani).
2.2.2. Potential referential interference
Potential referential interference refers to the semantic information of competitors that creates ambiguous anaphora (Fukumura et al., Reference Fukumura, Van Gompel and Pickering2010). Gender cues are investigated as a potential referential interference. Some studies indicated that noun phrases were preferred when gender information represented by antecedents and their competitor was the same and that pronouns took a large proportion of the output when gender information was different (Arnold & Griffin, Reference Arnold and Griffin2007; Garnham et al., Reference Garnham, Oakhill and Cruttenden1992). However, different results were found by Fukumura et al. (Reference Fukumura and Van Gompel2011) and Chiriacescu (Reference Chiriacescu2015), indicating that noun phrases were the preferred expressions regardless of gender differences as long as antecedents denoted human beings. Furthermore, Fukumura et al. (Reference Fukumura, Van Gompel and Pickering2010, Reference Fukumura and Van Gompel2011) found that visual cues were another potential referential interference. When the antecedent and its competitor were visually similar – such as both being on horseback – speakers produced a greater number of noun phrases; in contrast, when the two were visually distinct, speakers tended to use more pronouns. Yet, this result has not been verified in discourse. Therefore, the potential referential interference is studied to explore the influence on the choice of anaphoric forms (annotated as PotRefe).
2.2.3. Predictability
Predictability denotes the linguistic phenomenon that the semantic information in the preceding clause influences the probability of an entity that appears in the subsequent clause (Koornneef & Van Berkum, Reference Koornneef and Van Berkum2006). Many studies have argued that the semantic structure of verbs contributed to predictability, influencing which entity people were most likely to refer to next (e.g. Garvey & Caramazza, Reference Garvey and Caramazza1974; Guan & Arnold, Reference Guan and Arnold2021). When completing sentence fragments in Example (2a), people tend to start their completion by referring to Xiao Ming as the subject of a subsequent clause (e.g. because he/she/Xiao Ming needs some help), whereas when completing a sentence in Example (2b), they tend to refer to Xiao Li (e.g. because he/she/Xiao Li is sloppy). Such completion preferences have been assumed to occur because verbs such as da3 rao3 ‘disturb’ have a semantic bias that attributes causality to the first mentioned noun phrase (NP1), which is termed NP1-biased verbs, whereas verbs such as tao3 yan4 ‘hate’ have an opposite bias that attributes causality to the second mentioned noun phrase (NP2), which is termed NP2-biased verbs. Of interest is whether the choice of anaphoric forms is influenced by such predictability. The degree to which predictability influences anaphoric forms has been under significant debate (cf. Arnold, Reference Arnold2001; Bott et al., Reference Bott and Solstad2023; Jaeger, Reference Jaeger2010; Kaiser et al., Reference Kaiser, Li, Holsinger, Hendrickx, Devi, Branco and Mitkov2011; Rohde & Kehler, Reference Rohde and Kehler2014; Weatherford et al., Reference Weatherford and Arnold2021). In this study, the degree of predictability is explored (annotated as Pred).

2.3. Discoursal motivators
2.3.1. Competition
Competition refers to the number of competitors in the reader’s mental space (Ariel, Reference Ariel1988). Low competition was assumed to lead to reduced expressions, while high competition presupposed the encoding of complex forms. In this research, the competitor was identified as being in the same sentence with the antecedent and relating to the same topic, including the explicit and implicit forms (annotated as Comp), for example:

As illustrated in Example (3), the antecedent is an implicit form, while its competitor is an explicit form (Comp = 1).
2.3.2. Repetition
Repetition indicates the reoccurrence of the antecedent. Previous studies found that when an antecedent was mentioned more frequently, referential expressions were more likely to be represented in reduced forms, such as pronouns and zero pronouns (Ariel, Reference Ariel1988; Xu, Reference Xu2003: 86; Tyler, Reference Tyler1994). In this study, we identify the repetition of antecedents (annotated as Rep).
2.3.3. Distance
Ariel defined distance as the distance between the anaphor and the last-mentioned antecedents (Ariel, Reference Ariel1990: Reference Jiang28). Distance is considered one of the crucial motivators determining the choice of anaphoric forms, and long distance is assumed to contribute to complex forms in theoretical and empirical research (Ariel, Reference Ariel1990; McCoy & Strube, Reference McCoy, Strube, Cristea, Ide and Marcu1999). But Xu (Reference Xu2003: 52–55) argued that an entity was mentioned repeatedly in natural discourse, and the corresponding referential expressions appeared one by one in sentences, varying from zero pronouns to noun phrases, which obscured the effect of distance. Therefore, to analyse the influence of distance in deciding an anaphoric form, this study annotates the distance between the antecedent and anaphor as Dis.
2.3.4. Topic and topic structure
Ariel (Reference Ariel1990: 28–29) proposed that when an entity was the topic, its salience was higher, facilitating the processing of reduced referential expressions. Xu (Reference Xu2003) explored anaphora from the perspective of topic continuity, arguing that when the antecedent was new information, its topic continuity was at its highest, making its anaphor more likely to be expressed as a zero pronoun. Xu (Reference Xu2003a) also emphasized the influence of topic structure on salience. He found that noun phrases serving as topics before the conjunction possessed higher salience, leading to greater accessibility and a higher likelihood of using zero pronouns (Xu, Reference Xu2003a, Reference Xu2018; Zhang, Reference Zhang2019). These studies indicate that both topic and topic structure can influence anaphoric forms, but the joint effects of topic and topic structure remain unclear. To annotate the topic and topic structure, we adopted the definition from Xu (Reference Xu2003a). He defined the topic in the topic structure as a marked topic, which could be achieved through two main methods. The first method is changing word order, where nouns or prepositional phrases are moved from their unmarked syntactic position to the beginning of the sentence. The second method involves the addition of grammatical markers. This includes adding a pause after the topic or using modal particles such as ‘啊’ (a), ‘吧’ (ba) and ‘嘛’ (ma). In the absence of any marker, the subject can be considered an unmarked topic. In this study, we annotate the antecedent as marked topic, unmarked topic or non-topic (Npsig).
2.4. Text type
In multifactorial research, it is essential to thoroughly mark phenomena in the data to comprehensively explore potential motivators and minimize interference from other motivators (J. Xu, Reference Xu2020). One supplementary motivator analysed in this study is text type, relating to the degree of formality of the register Fludernik (Reference Fludernik2000). Our data encompass various discourse types, including news, literature, dialogue and scientific discourse (annotated as TextType).
3. Theoretical framework
In this section, we analysed the accessibility theory and proposed a supplemented framework for analysis in this research based on accessibility theory and previous studies analysed in Section 2.
Accessibility is the ease of retrieving specific entities from a speaker’s memory system during discourse generation or comprehension (Ariel, Reference Ariel1988). She further argues that natural languages primarily provide speakers with the means for processing the accessibility of the referent, which is the determining principle for the choice of anaphoric forms (Ariel, Reference Ariel1990: 68–69). She categorizes referential expressions based on accessibility into three levels: low-level markers (e.g. proper noun phrases and definite noun phrases, which refer to entities that are not present in discourse but exist in encyclopaedic knowledge); mid-level markers (e.g. demonstrative expressions, referring to specific, visible entities in the current context); and high-level markers (e.g. pronouns and gaps, referring to previously mentioned entities) (Ariel, Reference Ariel1988). The level of accessibility is affected by cognitive motivators such as consistency, distance, competition and salience (Ariel, Reference Ariel1990: 28–29). Consistency refers to whether the antecedent and the anaphor are in the same discourse segment or paragraph, perspective, viewpoint or cognitive framework; distance pertains to the distance between the antecedent and the anaphor; competition refers to the number of referential expressions that can compete as antecedents; salience concerns the prominence of the antecedent within the discourse or paragraph (Ariel, Reference Ariel1990).
These four motivators analysed by Ariel (Reference Ariel1990: 28–29) could include lower-level motivators, which compete in endowing different levels of accessibility to the referred entity in speakers’ mental space. According to the analysis in Section 2, motivators at syntactic, semantic and discoursal levels can be classified into the aforementioned cognitive level, with TextType as the supplementary motivator, as shown in Figure 1. Dis is annotated to signify the distance between antecedents and anaphors. Competition originally concerns the number of competitors (Ariel, Reference Ariel1990: 28). It is argued that the competition between preferred antecedents and other entities should also be attributed to the semantic features of competitors, such as animacy (Fukumura & van Gompel, Reference Fukumura and Van Gompel2010; Lu, Reference Lu2002; X. Xu, Reference Xu2020), gender (Arnold & Griffin, Reference Arnold and Griffin2007; Chiriacescu, Reference Chiriacescu2015; Fukumura et al., Reference Fukumura and Van Gompel2011) and visual cues (similarity between the antecedent and competitor) (Fukumura et al., Reference Fukumura, Van Gompel and Pickering2010, Reference Fukumura and Van Gompel2011). Based on previous research, this study focuses on the competition between antecedents and their competitors in number and semantic features. Comp, Npani, Reani and PotRefe are competition-related motivators. Salience, as termed by Ariel (Reference Ariel1990: 28), concerns the importance of topicality in choosing an antecedent. Studies have approached the importance of topicality in assigning antecedents from the perspective of the syntactic structure and topic structure (Xu, Reference Xu2003a; Zhang, Reference Zhang2019), topic continuity and predictability (Chen & Xie, Reference Chen and Li2018; Kaiser et al., Reference Kaiser, Li, Holsinger, Hendrickx, Devi, Branco and Mitkov2011; Weatherford et al., Reference Weatherford and Arnold2021). Therefore, we classify NpSig, Pred, Npsyn and Rep as salience-related motivators, because they contribute to the prominence of the antecedent. It should be noted that in this study, the antecedent and anaphor are in the same discourse segment, and the influence of consistency is beyond the reach of this study.

Figure 1. Grouped motivators influencing anaphoric forms.
4. Methods
To get a full profile of the motivators that influence the choice of anaphoric forms, this research employed a multifactorial analysis (cf. Figure 2).

Figure 2. Research process.
4.1. Data retrieval
The corpus utilized in this study originates from the Beijing Language and Culture University Corpus Center (BCC Corpus Center). It has a size of about 15 billion characters and a balanced distribution, covering a comprehensive portrayal of contemporary linguistic practices in society, which can effectively fulfil the requirements of this research.
Previous studies frequently adopted validated implicit causal verbs to manipulate stable predictability through their combination with conjunctions (e.g. Weatherford et al., Reference Weatherford and Arnold2021). This study applied the same method by employing verbs known for implicit causal relationships (e.g. Ferstl et al., Reference Ferstl, Garnham and Manouilidou2011).
Step 1: Implicit causality verbs were identified from previous literature (e.g. Garnham et al., Reference Garnham, Vorthmann and Kaplanova2021; Zhang, Reference Zhang2019), and 52 verbs were selected as displayed in the first column of Table 1.
Table 1. Proportion of anaphoric choice

Step 2: Each verb was set as a query to retrieve 8,000 concordances in the BCC.
Step 3: Noise was eliminated. The types of noise included concordances in which the given word was used as a non-verb, and those that were not causals.
Step 4: After filtering the noise, a total of 2147 concordances constituted the research materials.
4.2. Questionnaires for predictability
According to previous research, predictability has been measured effectively by questionnaires (e.g. Fukumura & van Gompel, Reference Fukumura and Van Gompel2010; Weatherford et al., Reference Weatherford and Arnold2021; Zhang, Reference Zhang2019). In this study, we adopted this method to obtain predictability values.
4.2.1. Questionnaire design and data collection
A total of two questionnaires were designed. The materials in the first questionnaire consisted of a sentence followed by a causal clause. In Example (4a), participants were asked to supplement the causal according to their understanding, using the noun phrase from the preceding sentence as the subject in the continuation. The second questionnaire consisted of a sentence and a following result clause, as shown in Example (4b). In addition, the questionnaire design and task description were consistent with those of the first questionnaire.

A total of 50 undergraduates participated in this experiment. They are native Chinese speakers, and none of them reported a history of neurological or cognitive impairment. Additionally, one participant was excluded for not producing any usable responses, leaving 49 participants (25 females) and 98 questionnaires for analysis.
4.2.2. Analysis and results of questionnaires
First, we annotated the continuation results from Questionnaire 1 and Questionnaire 2 based on Zhang (Reference Zhang2019): if participants chose the noun phrase in the subject position as the antecedent, it was recorded as 1; if they chose the noun phrase in the object position as the antecedent, it was recorded as 2. Second, the proportion of anaphoric choices in both questionnaires was calculated. Finally, the predictability annotation indices were summarized based on the results. For example, in Questionnaire 1, if 50%–60% of participants chose 1 (the noun phrase in the subject position), the noun phrase in the subject position was considered to have weak predictability, and when serving as the antecedent, it was labelled Pred = min; if 60%–80% of participants chose 1, the noun phrase in the subject position was considered to have moderate predictability, and thus, it was labelled Pred = mid; if more than 80% of participants chose 1, it was considered to have strong predictability, and thus, it was labelled Pred = max; if fewer than 50% of participants chose 1, it was considered to have no predictability, and thus, it was labelled Pred = none. The annotation of predictability is listed in Table 1.
4.3. Data annotation
4.3.1. Coding terminology
A total of ten motivators and one dependent variable were recruited in this research, and the tagging scheme of this study is presented in Table 2. The first column displays the general linguistic layer on which the motivators are based. The second column shows the motivators examined in our research, with the items in parentheses being their tagging terms in the process of annotation and modelling. The third column represents the levels of motivators in the form of tagging items. The last column lists the detailed specification of levels.
Table 2. Motivators examined in the choice of referring expression

4.3.2. Encoding process
Among the eleven variables, TextType was tagged by machine and the others were tagged manually, with manual checking and confirmation. To ensure the validity of the annotations in this study, we randomly selected 430 corpus entries and invited annotators to independently annotate these entries according to the tagging scheme. We used the {psycho} package (v2.4.6–26, Makowski, Reference Makowski2018) in R version (4.2.3) to calculate the Kappa value (Cohen’s kappa), yielding a value greater than 0.8 for each motivator, which indicates a high level of inter-rater agreement and coding reliability.
4.4. Statistical analysis
This study employed machine learning methods for data analysis. The logistic regression model identified significant motivators and their influence direction. Following this, random forests and eXtreme Gradient Boosting (XGBoost) were utilized to measure motivator strength, with results contributing to an expert voting system for ranking.
The multifactorial logistic regression model is a statistical method for examining the relationship between independent variables and a dependent variable. Given our focus on anaphor forms (zeroform, proform and nounform), we applied a multinomial logistic regression model to identify significant motivators and their influence direction. For variable selection and optimization, we adopted a stepwise selection method, using the {nnet} package (v7.3–19, Venables & Ripley, Reference Venables and Ripley2023) in R version (4.2.3).
The random forests model builds multiple independent decision trees through bootstrap sampling and aggregates predictions via voting. We used the {scikit-learn} package (v1.5.1, Pedregosa et al., Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos and Cournapeau2011) in Python version (3.11.9), dividing the dataset into 70% for training and 30% for testing. To balance decision tree count and computational intensity, we set the number of trees to 1000 (ntree = 1000) and determined splits using five randomly selected variables (mtry = 5). Model accuracy was evaluated with the test set. It is important to note that random forests belong to bagging algorithms and can show variability across different implementations. XGBoost was employed for a more objective evaluation of motivator strength. This algorithm enhances the gradient boosted regression tree (GBDT) model. Utilizing the {xgboost} package (v2.1.3, Chen & Guestrin, Reference Chen and Guestrin2016) in Python version (3.11.9), we processed the training data for model construction, including residual calculation, new model training and hyperparameter tuning to optimize results. An expert voting system ranked variable importance based on average rankings from random forests and XGBoost analyses.
5. Results
5.1. Data distribution
The results of the crosstab analysis are listed in Table 3, which illustrates the distribution of the data. The initial column presents the motivators, while the subsequent column depicts the corresponding levels of each motivator. The third to fifth columns present the cross-data of the dependent variable and the motivators. It is evident that the distribution of noun phrases, pronouns and zero pronouns is markedly imbalanced, with 778 instances of zero pronoun, 1027 instances of pronoun and 342 instances of noun phrase.
Table 3. Results of crosstab analysis (total: 2147)

5.2. Results of logistic regression model
The outcomes of the classification are presented in Table 4, and the specific results of logistic modelling are displayed in Table 5; the accuracy of the model is 79%, and the C value of the model is 0.90. In accordance with the established criteria for evaluating the predictive power of a model in linguistic research, a C-value exceeding 0.8 signifies that the model exhibits robust predictive capacity, indicating that the model is devoid of overfitting and multicollinearity. The results presented in Table 5 indicate that nine single variables significantly contribute to the model, and they are NPAni, NPSyn, Comp, Rep, PotRefe, Reani, Pred, NpSig and Dis.
Table 4. Results of classification

Table 5. Results of logistic regression modelling

Note: * p < 0.05, ** p < 0.01, *** p < 0.001.
The multinomial logistic regression analysis reveals significant relationships between predictor variables and the likelihood of belonging to the NounForm and Zeroform categories, with Proform as the baseline. For NPAni, Pro significantly decreases the log-odds for NounForm (B = −1.888, p = 0.031), while Movnoun significantly decreases the log-odds for ZeroForm (B = −16.180, p < 0.001). Regarding Comp, high antecedent competition (Comp = 3) significantly increases the log-odds for NounForm (B = 3.977, p = 0.001) but decreases that for Zeroform (B = −15.873, p < 0.001). The high repetition of antecedents (Rep = 4) raises log-odds for Zeroform (B = 2.960, p < 0.001). As for NpSyn, the sub level increases log-odds for ZeroForm (B = 1.999, p < 0.001). PotRefe suggests that noexpot significantly decreases the log-odds for NounForm (B = −20.626, p < 0.001) but increases that for Zeroform (B = 1.291, p = 0.015) compared to Proform. In terms of NpSig, the noexisg level significantly boosts the likelihood of NounForm (B = 1.220, p < 0.001), while the xxexsig level significantly increases the likelihood of Zeroform (B = 2.310, p < 0.001). Regarding Dis, short distances (Dis = 1, 2) notably decrease the log-odds for Zeroform (B = −1.697, p < 0.001; B = −1.329, p < 0.001), and long distances increase the log-odds for Nounform (B = 2.345, p < 0.001). Additionally, for Pred, the min level significantly decreases the log-odds for ZeroForm (B = −1.227, p = 0.008), and the none level significantly increases the log-odds for NounForm (B = 2.462, p < 0.001) and decreases the log-odds for ZeroForm (B = −2.096, p < 0.001).
5.3. Results of random forests and XGBoost
Table 6 demonstrates the predictive power of random forests and XGBoost model, and Table 7 presents the variable importance value and rank based on expert voting system. In Table 7, the third and fifth columns (i.e. boost, rf) list the variable importance yielded by XGBoost and random forests, and the fourth and sixth columns show the corresponding rank. And the last column is the average ranking status in variable importance values. Based on the two tables, Pred is the most important variable.
Table 6. Results of classification

Table 7. Variable importance exploration

6. Discussion
6.1. Distributions of different anaphoric forms
Results presented in Tables 3 and 5 demonstrate that the distribution of anaphoric forms is not balanced.
Considering the linguistic characteristics, the distribution of anaphoric forms is found to be influenced by motivators. Statistically, pronouns are observed to occur with greater frequency than noun phrases and zero pronouns. The results indicate the diversity of pronoun usage, under the principle that it should not cause ambiguity or hinder communication. Specifically, pronouns are more likely to occur in the context where antecedents have predictability (Pred = max), the moderate level of repetition (Rep = 2, 3) and short distance between their anaphors (Dis = 1, 2), which is in line with Xu’s (Reference Xu2003: 137–138) findings. He argued that when using pronouns, authors generally tend to have a certain number of clauses in between and a certain degree of continuity. Zero pronouns, defined as reduced expressions of pronouns, are used less frequently than pronouns. As can be found in Tables 3 and 5, zero pronouns tend to be used in the context where antecedents have high predictability, high degree of animacy (NpAni = pro, appnoun), short distance and high level of repetition (Rep = 3, 4). Among the defined motivators, Pred is the most significant factor according to Table 7. As for noun phrases, they exhibit the lowest frequency of occurrence in our data. They tend to appear in the context where antecedents have low predictability (Pred = none), low level of animacy (NpAni = nomovnoun), low salience (Npsig = noexsig) and potential referential interference.
From the perspective of a discourse, noun phrases, pronouns and zero pronouns emerge in an alternating sequence, thereby forming an anaphoric chain. As illustrated in Table 3, the distance between the antecedent and its anaphoric reference increases, resulting in a corresponding increase in the repetition of the antecedent. This leads to the formation of an anaphoric chain throughout the discourse. This chain can be conceptualized as a network of scattered referential expressions, akin to a string, connecting various nodes within the text, thereby facilitating coherence (Xu, Reference Xu2003: 104–106). In Example (5), there are altogether eight anaphoric nodes. Within the anaphoric chain of the discourse, the distribution of nouns, pronouns and zero pronouns exhibits a cyclical pattern. The introduction of a character typically occurs through a noun, followed by one or more pronouns or zero pronouns, while noun phrases appear subsequently. This leads to a pattern where complex forms and simplified forms appear alternately, intertwining the clauses of the discourse into a coherent whole.

In summary, considering the linguistic character, pronouns are observed to occur with greater frequency than noun phrases and zero pronouns and have less restrictions in usage. And from the perspective of discourse, different anaphoric forms appear alternatively, connecting clauses to form a coherent discourse.
6.2. The influence of significant motivators
This section discusses the influence of significant motivators on the choice of anaphoric forms, based on the features and functions of motivators examined and illustrated in Figure 1.
6.2.1. The influence of competition-related motivators
As shown in Table 7, the competition-related motivators, namely Comp, Npani, Reani and PotRefe, have been demonstrated to significantly influence the choice of anaphoric forms in natural discourse.
In terms of the relationship between the antecedent and the number of competitors, the number of competitors (Comp) contributes to the choice of anaphoric forms. When the number of competitors is low, the anaphoric expression tends to use a simplified form. Based on Tables 3 and 5, when there is no competitor, it shows a tendency to use zero pronoun forms for anaphoric expressions, as shown in Example (6). However, when there are three or more competing entities, the frequency of using noun phrases increases, surpassing pronouns and zero pronouns in Example (7).


From the perspective of the semantic relationship between the antecedent and its competitors, potential referential interference (PotRefe) is an important motivator. As discussed above, potential referential interference refers to situations where the antecedent and its competing elements have similar semantic information or consistent gender information. Results show that when there is a competing element that causes ambiguity, the anaphoric expression is often a noun phrase, which is consistent with previous research that speakers will add modifiers or use more specific referential expressions to avoid ambiguity and ensure successful communication (Brown-Schmidt & Tanenhaus, Reference Brown-Schmidt and Tanenhaus2006).
When there is no semantic ambiguity between the antecedent and the anaphoric expression, motivators such as the animacy (Npani) and relation of animacy between antecedents and competitors (Reani) exert a significant influence on the degree of accessibility. As shown in Tables 3 and 5, antecedents with a high degree of animacy, such as pronouns, typically lead to the use of high-accessibility markers, such as pronouns and zero pronouns in Example (8). However, for lower-animacy antecedents, like inanimate nouns, noun phrases and pronouns are preferred anaphoric forms, as shown in Example (9). Regarding the animacy of competitors, when antecedents and competitors differ in animacy, high-accessibility markers are preferred, aligning with findings that antecedents with higher animacy are more accessible (Fukumura & van Gompel, Reference Fukumura and Van Gompel2010).


6.2.2. The influence of salience-related motivators
NpSig, Pred, Rep and NpSyn are the salience-related motivators that contribute to the choice of anaphoric forms.
Regarding predictability (Pred), results indicate a positive correlation between antecedent’s predictability and the simplification of its anaphoric form (see Tables 3 and 5). In instances where the antecedent is highly predictable, it is inclined to use pronouns and zero pronouns. When the antecedent is less predictable, the anaphoric form tends to be a noun phrase. The dominance of predictability is motivated by the principle of language economy, which emphasizes that speakers and listeners favour linguistic forms that minimize cognitive processing costs while ensuring successful information transfer (Zipf, Reference Zipf1935: 19; Haiman, Reference Haiman1983). When an entity becomes highly predictable, it is activated in the mental space of both speakers and listeners. Therefore, to minimize the cognitive cost, more efficient, reduced forms like pronouns or zero pronouns are preferred. This aligns with extensive research showing that speakers systematically choose less informative expressions to conserve cognitive resources (e.g. Mahowald et al., Reference Mahowald, Fedorenko, Piantadosi and Gibson2013; Weatherford & Arnold, Reference Weatherford and Arnold2021). From a production standpoint, speakers pre-instantiate this referent in their mental space and reduce the cost of expressions for an expected referent (Jaeger, Reference Jaeger2010; Kaiser et al., Reference Kaiser, Li, Holsinger, Hendrickx, Devi, Branco and Mitkov2011). From a comprehension standpoint, listeners can easily process a reduced form because the predictable entity is already highly accessible in their mental space. To illustrate, in Example (10), the verb xi3 ai4 ‘like’ is an NP2 verb when followed by a causal, which means the usage of the verb xi3 ai4 ‘like’ enhances the predictability of shi1 ge1 ‘the poem’ in this context. The NP2, shi1 ge1 ‘the poem’, has a high probability of being mentioned again in the subsequent sentence, making it more accessible in memory structure. Therefore, in the following clause, the anaphor is identified as a high-level accessibility marker. Through experience and observation of language use, both speakers and listeners collaborate in constructing expectations for entities in specific contexts, thereby the reduced form of anaphor can be understood (Demberg et al., Reference Demberg, Kravtchenko and Loy2023; Bott & Solstad, Reference Bott and Solstad2023). Furthermore, the strength of this effect in our study is likely amplified by the rich context of the corpus data. Previous experimental research has demonstrated that rich, authentic contexts could strengthen the cognitive representation of referents, making the effects of predictability more readily observable (Weatherford & Arnold, Reference Weatherford and Arnold2021). Therefore, by examining anaphoric choice within natural linguistic data, this study captures the dominant role of predictability.

Results also demonstrate that the topicality of the antecedent influences the choice of anaphoric forms, as indicated by NpSig. In the case of antecedents that are topics, simplified expressions are preferred; conversely, for antecedents that are not topics, a higher probability of noun phrases is presented. This finding concurs with Ariel’s idea that reduced forms are considered as high-level accessibility markers, relying on the context for comprehension and conveying salient information in short-term memory, especially the current topic (Ariel, Reference Ariel1990: 55–56). Moreover, in addition to content-related salience (i.e. topic), structural salience (i.e. topic structure) also significantly influences the choice of anaphoric forms. As illustrated in Example (11), a topic can be structurally marked through changes in word order or the addition of grammatical markers, thereby enhancing its salience within the syntactic and structural context. This structural significance serves to enhance the accessibility of the topic, while also promoting the use of high-accessibility markers such as pronouns and zero pronouns.

Additionally, repetition (Rep) has been demonstrated to influence the choice of anaphoric forms. As evidenced in Tables 3 and 5, noun phrases and pronouns are preferred with minimal repetitions of the antecedents.
Regarding the syntactic positions of antecedents, Tables 3 and 5 indicate that when an antecedent occupies the subject position, the preference for anaphoric forms is: zero pronouns> pronouns > noun phrases. Conversely, when the antecedent is in the object position, the order of preference for anaphoric forms is: noun phrases > pronouns > zero pronouns. These findings indicate that the syntactic position of the antecedent exerts an influence on the choice of anaphoric forms, aligning with those reported by Fukumura and van Gompel (Reference Fukumura and Van Gompel2010) and Kehler (Reference Kehler and Rohde2013). Furthermore, this suggests that in Mandarin, antecedents in the subject position are more accessible than those in the object position. This is inconsistent with studies recognizing the effect of syntactic position and denying the effect of predictability (e.g. Arnold, Reference Arnold2001; Guan & Arnold, Reference Guan and Arnold2021). In contrast, the present study shows the influence of both predictability and syntactic position. Although our results suggest that predictability (Pred) has a significant advantage over syntactic position, it is important to note that the ranking itself is exploratory in nature. So far in this research, it can be concluded that both the syntactic position and the predictability significantly affect the choice of anaphoric forms.
6.2.3. The influence of distance
This study demonstrates that distance influences the choice of anaphoric forms. A combination of data distribution and analysis results from Tables 3 and 5 reveals that the inclination of anaphoric forms varies from simplified to more complex and then back to simplified forms. When the anaphoric distance is 0, it indicates that the antecedents and anaphors are situated within the same clause and that zero pronoun is the predominant choice. As the distance between the antecedent and the anaphor increases, the probability of choosing noun phrases rises. This outcome aligns with the accessibility theory, which posits that the closer the linear distance between the antecedent and the anaphor, the more likely the entity remains in a highly activated state within memory, leading to a preference for simplified forms; as the distance increases, the accessibility of antecedent reduces, and speakers tend to use more explicit forms such as noun phrases, to ensure clarity.
To conclude, competition-related motivators, salience-related motivators and distance can significantly influence the choice of anaphoric forms, by influencing the accessibility of referred entities.
6.2.4. The interplay of motivators
While our model primarily quantifies the main effects of each motivator, the interaction between these motivators is also important for a comprehensive understanding of anaphoric form choice.
First, a notable interaction emerges between distance and repetition. Our results indicate that the frequency of zero pronouns can increase even when the intervening distance surpasses three clauses. Table 3 demonstrates that high repetition (Rep = 4) becomes more probable as the intervening distance grows, peaking when the distance is four clauses (Dis = 4). This finding suggests that the increase in distance reduces the accessibility of the antecedent, but the increase in repetition continuously enhances accessibility, leading to a pattern where simplified forms intertwine the clauses of the discourse into a coherent whole.
Second, results show the interaction between competition and salience. As shown in Table 3, the presence of potential referential interference (PotRefe) or multiple competitors (Comp ≥ 3) strongly motivates speakers to choose noun phrases to ensure clarity in communication. However, this tendency is modulated by the salience of the antecedent. Results indicate that predictability (Pred) is the most important factor across all models. This implies that even in a highly competitive environment with multiple referents, if an entity possesses high salience via its predictability (Pred = max) or is the explicitly marked topic (NpSig = xxexsig), its high salience can override the competitive pressure. In such cases, a speaker might still use a pronoun, as the listener can rely on the cue of salience to resolve possible ambiguities.
Third, text type serves to regulate the effects of other motivators. This regulatory role is directly related to the formality of the register. In highly formal and information-dense texts, such as news reports and technical discourses, clarity and precision are the guiding principles (Biber & Conrad, Reference Biber and Conrad2019: 121, 302). Consequently, noun phrases are more frequently used in tech and news than in talk and art, as shown in Table 3. In such contexts, even when an antecedent possesses a degree of salience, the speaker may still opt for a more explicit noun phrase to prevent any potential misunderstanding. Conversely, in informal and interactive contexts, speakers can rely on shared knowledge, real-time feedback and multimodal cues (Biber & Conrad, Reference Biber and Conrad2019: 89–91). As a result, they more frequently use reduced forms, such as pronouns and zero pronouns, with the confidence that listeners can resolve the reference using these salient cues (Weatherford & Arnold, Reference Weatherford and Arnold2021).
Taken together, the motivators investigated in this study constitute a dynamic context. Within this context, an entity’s accessibility is not static; it can be increased through repetition and salience or decreased by new competitors and distance. Consequently, the choice of an anaphoric form is a process of making rapid and likely unconscious trade-off based on certain contexts. When the context provides clear cues, such as high salience, low competition and short distance, reduced forms such as pronouns or zero pronouns are preferred. When the context is ambiguous due to low salience, high competition or long distance, more informative noun phrases are employed to ensure successful conversation. This process is further modulated by text type, which sets the communicative norms. Ultimately, the choice of anaphoric form represents a sophisticated process of seeking a balance between informativity and economy within a given context.
6.3. A supplement to accessibility theory
It has been demonstrated that the accessibility theory can explain some influencing motivators, including the distance and the number of competitors. As discussed in Section 3, the theory also permits the incorporation of additional elements within the existing theoretical framework. Building on the significant motivators identified in Sections 6.1 and 6.2, this section attempts to modify the accessibility theory.
First, we modify competition in terms of the semantic relationship between antecedent and competitors and the semantic information of the antecedent, as shown in (11). The semantic information of antecedents refers to the animacy of antecedents, and antecedents with a low level of animacy prefer the usage of low-level markers. The semantic relationship between antecedents and competitors signifies whether competitors share key semantic information with antecedents, such as animacy and gender. Competitors with similar semantic information can cause potential interference in processing antecedents and low accessibility of antecedents. Second, the modification of salience is made in terms of structure-related salience and content-related salience, as shown in (12). The structure-related salience encompasses the topic structure, the syntactic position of antecedent and repetition of antecedent. The accessibility of antecedents is enhanced when they are situated in topic structure, occupy a subject position or exhibit a relatively high level of repetition. Content-related salience refers to topicality and predictability of antecedents. When antecedents are topic or have a relatively high level of predictability, the high-level markers are preferred. Third, regarding distance, our findings are in line with the accessibility theory, as summarized in (13). We confirm that longer distances between antecedents and anaphors contribute to low level of accessibility.


To sum up, this study has tried to modify the accessibility theory mainly in terms of competition and salience based on the results. Competition refers to not only the quantitative relationship but also the semantic relationship between antecedents and their competitors. And salience encompasses structure-related salience and content-related salience.
6.4. Limitations and implications
As discussed above, the current study uses instances from corpus to identify motivators influencing anaphoric form choice. However, several topics remain for future exploration.
First, our framework centred on antecedent properties, yet motivators related to anaphor are also important, such as the syntactic position of anaphor. Furthermore, other antecedent-related motivators, such as focus and focus structure, should also be investigated. Future research can develop a more comprehensive model incorporating the features of both the anaphor and the antecedent to achieve a more holistic understanding.
Second, the measure of distance warrants further refinement. In this study, we have annotated the distance of every antecedent in a sentence, but this may not fully disentangle the effects of distance from memory reinforcement. Future studies could define distance solely from the most recent mention of an antecedent to delve into the interplay of distance and repetition. Furthermore, a comparative study contrasting these two annotation methods would be valuable for investigating the independent role of distance and repetition.
Finally, future work could explore more sentence types with effective measuring of predictability to provide a more comprehensive understanding of anaphoric forms.
7. Conclusion
This study investigates the distribution features of anaphors and the motivators that influence the choice of anaphoric forms in Mandarin and modifies the accessibility theory.
Altogether, ten motivators at the syntactic, semantic and discoursal levels were grouped into categories of competition, salience, distance and text type. A multifactorial analysis demonstrates that NPAni, NPSyn, Comp, Rep, Reani, Pred, Dis, NpSig and PotRefe are significant motivators for the choice of anaphoric forms, thereby illustrating the intrinsic complexity of determining a referential expression. The variable importance ranking indicates that the Pred is the most important motivator. Under the influence of these motivators, zero pronouns, noun phrases and pronouns exhibit significant differences in distribution. With these findings, we verify accessibility theory and modify the theoretical framework. Specifically, competition can be defined as the numerical relationship between antecedents and competitors, the semantic information of antecedents, and the semantic relationships between antecedents and competitors. Salience refers to content-related salience and structure-related salience. The former encompasses predictability and topicality, and the latter encompasses topic structure, syntactic position and repetition.
Data availability statement
The data are available at the Open Science Framework Repository: https://osf.io/7er2c/?view_only=fa87997afb254f979ae14d20ec6188d6.
Competing interests
The authors declare none.


