Empirical Chapters (Case Studies)

doi:10.1017/9781009425407.009

Part II Empirical Chapters (Case Studies)

5 The Application of Partial Least Squares Structural Equation Modeling in Studies of EMI An Empirical Demonstration and Application Prospect

Introduction

Structural equation modeling (SEM) is a popular approach to developing and employing theoretical models in predictive research within diverse fields and disciplines. With it, researchers can both construct and validate comprehensive models representing relationships of relative dependence and independence among latent constructs and observed variables through a combination of factor analyses and regressions.

Structural equation modeling is amenable to diverse data types (experimental, nonexperimental, cross-sectional, longitudinal) and is used to identify and map latent variables, that is, factors or constructs that are unobserved and thus not measured directly, through the interrelations of observed variables. Observed variables undergo reliability and validity testing and the resulting measurement model undergoes confirmatory factor analysis prior to model fit evaluation. This is followed by path modeling, whereby simultaneous multiple regression analyses take place and can include moderation, mediation, and other forms of variable relationships, with consideration of observed indicators revealing structural relationships of latent variables (for a recent succinct overview of procedural approaches and related statistical parameters, see Dash & Paul, Reference Dash and Paul2021).

Covariance-based SEM (CB-SEM) is one of the most prominent SEM methods and is widely adopted to explore latent phenomena. In education, this includes inquiry into students and teachers’ perceptions, attitudes, or intentions and their influence on learning or teaching outcomes.

Partial least squares SEM (PLS-SEM) is an alternative SEM approach that has been shown to offer certain advantages over CB-SEM. These include greater flexibility in acceptable sample sizes and data types, including nonnormal and nonmetric data (Hair Jr. et al., Reference Hair, Ringle and Sarstedt2011; Rigdon et al., Reference Rigdon, Sarstedt and Ringle2017), and advantages in handling complex mediation models (Sarstedt et al., Reference Sarstedt, Hair, Nitzl, Ringle and Howard2020) and measurement model confirmation, such as with the confirmatory composite analysis proposed by Hair Jr. et al. (Reference Hair, Howard and Nitzl2020) as a confirmatory factor analysis alternative. In a review article, Chin et al. (Reference Chin, Peterson and Brown2008) offer guidelines for SEM use in marketing that points to advantages of PLS-SEM. They note that PLS overcomes some SEM constraints, specifically by requiring only path specifications and measurement groupings into constructs, providing easier results interpretation based on regression as opposed to model fit indices, and embracing models with more indicators or latent variables relative to sample size (Chin et al., Reference Chin, Peterson and Brown2008, p. 295).

Partial least squares SEM has already been put to wide use in various disciplines, such as psychology, sociology, nursing, (Memon et al., Reference Memon, Cheah, Ting, Chuah and Cham2021), and business fields like marketing (Chin et al., Reference Chin, Peterson and Brown2008), accounting (Lee et al., Reference Lee, Petter, Fayard and Robinson2011), and information systems (Stieglitz et al., Reference Stieglitz, Dang-Xuan, Bruns and Neuberger2014), to name a few. More recently, educational researchers have utilized PLS-SEM to study issues such as learning beliefs (Ji et al., Reference Ji, Yue and Zheng2021), e-learning (see Lin et al., Reference Lin, Lee, Liang, Chang, Huang and Tsai2020, for review), behavior and achievement in flipped classrooms (Wang, Reference Wang2019), and educational service assessment (Schijns, Reference Schijns2021). Given both its operational and analytical attractiveness and possible pitfalls from misapplication, Ghasemy et al. (Reference Ghasemy, Teeroovengadum, Becker and Ringle2020) produced a review of PLS-SEM results reporting in higher education research, offering guidelines to facilitate effective, rigorous adoption in service of realizing its research promise.

CB-SEM and PLS-SEM Comparability and Debate: When and How to Employ PLS-SEM

There has been significant debate about the relative merits of CB-SEM and PLS-SEM, which has yielded scholarly exploration of and elaboration on the research potential of both approaches and their relative limitations and advantages (Dash & Paul, Reference Dash and Paul2021; Rigdon, Reference Rigdon2016; Rönkkö et al., Reference Rönkkö, McIntosh, Antonakis and Edwards2016; Sarstedt et al., Reference Sarstedt, Hair, Ringle, Thiele and Gudergan2016).

Comparisons of procedures and results of CB-SEM and PLS-SEM have been conducted over several decades in fields such as business and management, marketing, and information systems (see Rigdon et al., Reference Rigdon, Sarstedt and Ringle2017, for a summary of conceptual comparisons). More recently, scholars have shifted the focus of discussion to relative appropriateness and method application based on research design and context, as the methods “assume fundamentally different measurement philosophies” (Rigdon et al., Reference Rigdon, Sarstedt and Ringle2017, p. 7). Even when employing identical theoretical models, the differences in statistical models make them noninterchangeable (Rigdon et al., Reference Rigdon, Sarstedt and Ringle2017).

How should scholars decide to adopt the best approach? Several considerations have been offered to date. For example, to encourage appropriate application in marketing research, Hair et al. (Reference Hair, Matthews, Matthews and Sarstedt2017) review the PLS-SEM algorithm and advantages and limitations of the approach and offer several clear distinctions that should determine adoption of CB-SEM or PLS-SEM based on research objectives, proposing that CB-SEM should be used for theory testing and confirmation and PLS-SEM for prediction and theory development.

Elsewhere, Rigdon (Reference Rigdon2016) responded to misconceptions of PLS by clarifying that it is intended for use with composite-based models, in contrast to factor-based models explored with CB-SEM. The author suggests that mixed models may employ consistent PLS-SEM (PLSc-SEM) (Dijkstra & Henseler, Reference Dijkstra and Henseler2015a, Reference Dijkstra and Henseler2015b), which estimates a composite model while also converting estimates for a portion of parameters to bring it into accord with a factor model and was created in response to earlier debates about PLS efficacy.

For further guidance, Dash and Paul (Reference Dash and Paul2021) offer recommendations based on a comparison of CB-SEM, PLS-SEM, and PLSc-SEM, arguing for advantages of PLS-SEM for prediction and PLSc-SEM for model testing and analysis. Rather than attempting to reconcile PLS-SEM with factor model constraints, researchers seeking to “examine … data and evaluate many different configurations … in a situation that is ‘data-rich and theory-skeletal’” (Rigdon, Reference Rigdon2016, p. 13) can apply PLS-SEM for exploratory work for orienting research.

Ultimately, scholars should adopt a method that matches with the model type they wish to estimate and/or based on practical considerations such as variable understanding, data collection experience and instrument use, or stage in the research process (Rigdon, Reference Rigdon2016). To Chin et al. (Reference Chin, Peterson and Brown2008), the researcher’s responsibility is to report sufficient information for replication of analyses, estimates, and fits and provide a list for reporting that they consider essential in this regard (p. 291).

Given both the increasing appeal of PLS-SEM and the historical debate of its relative merits as well as the various dimensions of research design and practice that determine its value in producing rigorous scholarship, we offer this case study of its application in an area of educational research that has thus far seen limited application, that of research in English Medium Instruction (EMI) and related contexts. We proceed this chapter with a case study demonstrating its application in such a context and conclude with a discussion of several existing studies employing PLS-SEM in EMI research (including Choi, Reference Choi2021; Law & Fong, Reference Law and Fong2020; Peng, Reference Peng2022; Tai & Tang, Reference Tai and Tang2021; Zhang et al., Reference Zhang, Wang and Zhou2022) and implications for future inquiry.

Case Study Illustration: Predicting Student Satisfaction in EMI Courses

Over the past decades, the growth of English as a medium of instruction has become one of the most important developments in higher education in countries where English is not the primary language (Wächter & Maiworm, Reference Wächter and Maiworm2014). The two major forces driving EMI growth in higher education are globalization and internationalization (Tsou & Kao, Reference Tsou, Kao, Fenton‑Smith, Humphreys and Walkinshaw2017). EMI is also a result of the increasing emergent status of English as an academic lingua franca of the world (Galloway & Rose, Reference Galloway and Rose2015). Evidence shows English becoming more widely used as a language of university instruction in European countries (Ferguson, Reference Ferguson2007), even in nations with “big” languages like France and Germany (Ammon & McConnell, Reference Ammon and McConnell2002). Asian countries have also adopted the Englishization trend through EMI programs at universities (Yeh, Reference Yeh2014).

With this growth in EMI, how might researchers explore the influence of this trend? One way to consider efficacy is by operationalizing student responses in terms of attitude to EMI programming. Do they feel it meets their needs? The term “customer satisfaction” was first used in marketing to mean a judgment that a feature of a product or service, or the product or service itself, provides a satisfying level of fulfillment related to consumption (Reynoso, Reference Reynoso2010, p. 13). In educational contexts, students are referred as the primary customers of higher education institutions and student satisfaction is determined by the level of expectation and perceived performance (Fernandes et al., Reference Fernandes, Ross and Meraj2013). Regarding the student as customer perspective, there is a new ethical prerogative that students must become customers and therefore, as payers, can reasonably demand that their views be heard and adapted to (William, Reference William2002). In other words, in determining the success of tertiary institutions in delivering products to their primary customers (students), it is essential to measure customer (student) satisfaction (Alemu & Cordier, Reference Alemu and Cordier2017; Karakas, Reference Karakas2017).

As an illustration of the PLS-SEM method for this chapter, this study explores student perspectives regarding current EMI practices in the Taiwan higher education context. Specifically, this study examines student evaluations of their EMI courses and the relationship between EMI effectiveness factors and satisfaction of EMI courses. In that regard, this study asks the following questions:

Q1: How do students evaluate their satisfaction of EMI courses?
Q2: How does gender affect student satisfaction?

Measures

Independent variable: While a variety of factors influence EMI effectiveness, this study adopts an operationalization from Chu et al. (Reference Chu, Lee and Obrien2017), which measures three dimensions, including “student ability to use English to communicate in class – SATUE” (five items), “student perception of internationalization of the college – SPOIC” (nine items), and “student perception of course content – SPOCC” (five items), all which adopt a 5-point Likert scale (from 1 – strongly disagree to 5 – strongly agree).

Dependent variable: Satisfaction (SAT) is measured with eight questions, adapted from previous research by Cheng and Chau (Reference Cheng and Chau2015), to indicate students’ satisfaction toward EMI courses (anchored with the same 5-point scale).

Control variables: In this study, gender was employed as a moderating variable (1 = male, 2 = female).

Participants

A cross-sectional survey was used to obtain data using an online questionnaire distributed by the academic affairs office of a top university in Taiwan to students enrolled in EMI courses at the time of the survey. For the formal survey, 186 valid responses from Taiwanese students were collected for further analysis. The population included 80 male (43 percent) and 106 female (57 percent) students. Their fields of enrolment were business (17.2 percent), humanities and social sciences (33.3 percent), engineering (24.2 percent), and science (19.9 percent).

Normal Distribution of Data

We consider a couple of traditionally well-known (Henderson, Reference Henderson2006) methods for evaluating normality of experimental data, including the Shapiro–Wilk normality test and the Kolmogorov–Smirnov test. Table 5.1 shows that SATUE, SPOIC, SPOCC, and SAT do not fit normal distribution since the p-value is much less than 0.05.

Table 5.1 Tests of normality

	Kolmogorov–SmirnovFootnote ^a			Shapiro–Wilk
	Statistic	dfFootnote ^b	Sig.Footnote ^c	Statistic	dfFootnote ^b	SigFootnote ^c. (p)
SAT	0.105	186	0.000	0.958	186	0.000Footnote **
SATUE	0.094	186	0.000	0.972	186	0.001Footnote *
SPOIC	0.109	186	0.000	0.955	186	0.000Footnote **
SPOCC	0.093	186	0.001	0.939	186	0.000Footnote **

^a Lilliefors significance correction

^b df=degree of freedom

^c Sig=significance; * p<0.01 **p<0.001

Choosing PLS-SEM over CB-SEM for This Case Study

In certain cases, particularly when there is little prior knowledge of structural model relationships or construct measurements, or when the emphasis is more on exploration than confirmation, PLS-SEM is an attractive alternative to CB-SEM. Further, when CB-SEM assumptions are violated with regard to normality of distributions, minimum sample size, and/or maximum model complexity, or when related methodological anomalies occur in the process of model estimation, PLS-SEM is a good methodological alternative for theory testing. In such instances, researchers should consider both CB-SEM and PLS-SEM when deciding on the appropriate analysis approach for structural model assessment.

In this case, given the exploratory nature of our research design, small sample size (N = 186), and the violation of the assumption of normality of distributions, we chose to employ PLS-SEM rather than CB-SEM to perform the analysis.

Data Analysis with SmartPLS

For this study, we conducted the SEM data analysis using the software Smart partial least squares (SmartPLS). The evaluation of the PLS model was done by evaluating first an outer model and then an inner model. The outer model is a measurement model to predict the correlation between indicators or parameters estimated with their latent variables, while the inner model is a structural model for predicting causality between latent variables. In order to answer Q2, multiple group analysis (PLS-MGA) was also conducted and demonstrated.

Model Specification and Path Model Estimation

The PLS path model derived after PLS algorithm calculation with the independent variables (SATUE, SPOIC, and SPOCC), dependent variable (SAT), relationships among variables, and all indicators of variables are shown in Figure 5.1. The reflective nature of variables is indicated by arrow directions.

Figure 5.1 PLS path model

One path coefficient indicates that English communicative ability (SATUE) strongly affects satisfaction (SAT), and its value is 0.390. Another path coefficient shows perceptions of university internationalization (SPOIC) weakly affecting satisfaction, with a value of 0.278.

Evaluation of Measurement Models (Outer Model)

In this study, we used the measurement model approach for analysis to assess the reliability, composite reliability (CR), and average variance extracted (AVE) of the constructs. The factor loading values show the reliability of individual indicators of constructs. It is suggested that the value for factor loadings should exceed 0.6 for acceptability (Fornell & Larcker, Reference Fornell and Larcker1981). The results (Figure 5.1) showed that values for P3 and P16 are 0.324 and 0.551, respectively. These values were dropped for the sake of getting improved results. The factor loading values for all latent variable constructs after eliminating those two indicators are shown in Table 5.2.

Table 5.2 Measurement model

Construct	Item coding	Factor loading	Outer weight	CA	CR	AVE
Student perception of course content (SPOCC)				0.92	0.94	0.76
	P21	0.926	0.244
	P22	0.899	0.238
	P19	0.887	0.238
	P20	0.840	0.234
	P18	0.800	0.194
Student perception of internationalization of the college (SPOIC)				0.836	0.876	0.505
	P1	0.760	0.194
	P13	0.707	0.139
	P14	0.697	0.249
	P2	0.683	0.196
	P4	0.601	0.178
	P5	0.705	0.189
	P6	0.802	0.258
Student ability to use English to communicate in class (SATUE)				0.829	0.886	0.662
	P10	0.883	0.332
	P11	0.852	0.345
	P8	0.684	0.213
	P9	0.820	0.326
Student satisfaction (SAT)				0.928	0.941	0.667
	S1	0.863	0.189
	S2	0.792	0.147
	S3	0.868	0.176
	S4	0.828	0.161
	S5	0.704	0.132
	S6	0.819	0.135
	S7	0.830	0.139
	S8	0.816	0.142

Note: Average variance extracted (AVE); Cronbach’s alpha (CA); Composite reliability (CR).

To measure reliability, we used Cronbach’s alpha (CA) and CR. The results for CA and CR are presented in Table 5.2 for SPOCC (0.920, 0.940), SPOIC (0.836, 0.876), SATUE (0.829, 0.886), and SAT (0.928, 0.941). According to Hair et al. (Reference Hair, Ringle and Sarstedt2011), CA and CR values should be higher than 0.70, and thus values in this study were within an acceptable range.

We assessed both the Fornell and Larcker criterion and heterotrait–monotrait (HTMT) ratio to test the discriminant validity (Fornell & Larcker, Reference Fornell and Larcker1981). The HTMT ratio has recently gained preference over Fornell and Larcker (Baloch et al., Reference Baloch, Meng, Xu, Cepeda-Carrion and Bari2017; Henseler et al., Reference Henseler, Hubona and Ray2016). Fornell and Larcker’s tests in Table 5.3 exhibit values greater than the correlations among the variables. The HTMT ratio results are lower than the 0.090 threshold (see Table 5.4).

Table 5.3 Discriminant validity (latent variable correlation and square root of AVE)

	SPOCC	SAT	SATUE	SPOIC
SPOCC	0.872
SAT	0.670	0.816
SATUE	0.567	0.754	0.813
SPOIC	0.446	0.650	0.617	0.711

Table 5.4 HTMT

	SPOCC	SAT	SATUE
SPOCC
SAT	0.708
SATUE	0.635	0.839
SPOIC	0.468	0.708	0.719

Additionally, we examined the convergent validity to obtain AVE values, and all the values were greater than the 0.50 threshold (for SPOCC, SPOIC, SATUE and SAT, the AVE values were 0.76, 0.505, 0.662, and 0.667, respectively), as suggested by Henseler et al. (Reference Henseler, Hubona and Ray2016; see Table 5.2).

Furthermore, we examined the variance inflation factor (VIF) to assess the problem of multicollinearity in the data. Aiken et al. (Reference Aiken, West and Reno1991) suggested that the values of VIF must be less than 10, and this study found VIF values within the suggested range, depicting no issue of multicollinearity in the data (see Table 5.5).

Table 5.5 Saturated model results

Construct	R²	R² adjusted	VIF	f²
SPOCC			1.507	0.223
SAT	0.693	0.688
SATUE			1.950	0.293
SPOIC			1.651	0.121

Note: Variance inflation factor (VIF); effect size (f²); determination of coefficient (R²).

Evaluation of the Structural Model (Inner Model)

After finding the estimated model satisfied the outer model criteria, we then analyzed measurement testing the structural model (Figure 5.1) by looking at the value of R-square (R²) on the dependent variable. The results of R² values on variables based on the measurement results are shown in Table 5.5, where it can be seen that R² for student satisfaction was 0.693, indicating that the degree of influence of SPOCC, SATUE, and SPOIC on student satisfaction was 69.3 percent.

The value of f-square (f²) depicts the respective contributions of each construct to the relationships found in the outer model. It also reflects the significance of the influence of each construct on the others along with the degree of this influence. The results of f² analysis are shown in Table 5.5. The value of f² should be less than 0.02 in order to indicate a significant relationship. The relation of SPOCC and SAT shows an f² of 0.223, while SATUE and SAT have f² values of 0.293. Similarly, the relation of SPOIC to SAT shows an f² value of 0.121 (Figure 5.2).

Figure 5.2 Structural model

Bootstrapping is a method that is used to check and test the significance of a model. The value of t-statistics reflects significance of path coefficients (Ringle et al., Reference Ringle, Da Silva and Bido2015). The PLS-SEM findings show that SPOCC (β = 0.322, t = 5.480, p < 0.05), SATUE (β = 0.419, t = 6.131, p < 0.05), and SPOIC (β = 0.248, t = 3.985, p = <0.05) have positive effects on student satisfaction (β = −0.115, t = 0.062, p = 0.265) (Table 5.6).

Table 5.6 Path coefficients

	β	Mean	SD	t-value	p-value
SPOCC	0.322	0.319	0.059	5.480	***
SATUE	0.419	0.424	0.068	6.131	***
SPOIC	0.248	0.248	0.062	3.985	***

^*** p < 0.001, ^**p < 0.01, ^*p < 0.05 (two-tailed test)

MGA

The MGA tests whether predefined data groups have significant differences in their group-specific parameter estimates. SmartPLS provides the outcomes of three different approaches, based on the bootstrapping results from every group: Confidence Intervals (Bias Corrected), PLS-MGA, and Parametric Test.

The Confidence Intervals method was used to analyze the influence of gender as this method is a nonparametric significance test for the difference of group-specific results that builds on PLS-SEM bootstrapping results. A result is significant at the 5 percent probability of error level, if the p-value is smaller than 0.05 or larger than 0.95 for a certain difference of group-specific path coefficients. The PLS-MGA method (Henseler et al., Reference Henseler, Ringle and Sinkovics2009) as implemented in SmartPLS is an extension of the original nonparametric Henseler’s MGA method.

Table 5.7 presents the path coefficient and p-value for both gender groups: male and female. In this table, if the p-value is less than or equal to 0.05, or greater than or equal to 0.95, the path coefficient is considered to be significant. In both male and female groups, all the p-values are less than 0.05, so all the path coefficients are significant. Considering the female group, “student ability to use English to communicate in class” has the highest path coefficient value (SATUE – > SAT = 0.453), followed by “student perception of course content” (SPOCC – > SAT = 0.342). The factor of “student perception of internationalization of the college” has the lowest impact on student satisfaction (SPOIC – > SAT = 0.174). In the male group, “student ability to use English to communicate in class” has the strongest impact on student satisfaction (β = 0.370), whereas “student perception of course content” has a slight impact on trust (β = 0.336). The impact of “student perception of internationalization of the college” on student satisfaction is highly significant (β = 0.355).

Table 5.7 Gender relationship with student satisfaction

	Female			Male
	Path coefficients	t-value	p-value	Path coefficients	t-value	p-value
SPOCC – > SAT	0.342	4.177	0.000	0.336	4.596	0.000
SATUE – > SAT	0.453	6.036	0.000	0.370	4.125	0.000
SPOIC – > SAT	0.174	2.524	0.012	0.355	3.864	0.000

Table 5.8 highlights that the impacts of SPOCC, SATUE, and SPOIC on SAT were not significantly different between male and female groups. For SPOCC and SAT the male group is higher than the female group (SPOCC – > SAT = −0.006, SATUE – > SAT = −0.083), whereas for SPOIC and SAT the female group is higher than the male group (SPOIC – > SAT = 0.180). Overall, the male group is higher than the female group in satisfaction (Figures 5.3–5.5).

Table 5.8 Path coefficient difference for gender

	Path coefficients – difference (male vs. female)	t-value (male vs. female)	p-value (male vs. female)
SPOCC – > SAT	−0.006	0.052	0.959
SATUE – > SAT	−0.083	0.717	0.474
SPOIC – > SAT	0.180	1.613	0.109

Figure 5.3 Moderating effect of gender: SPOCC on SAT

Figure 5.4 Moderating effect of gender: SATUE on SAT

Figure 5.5 Moderating effect of gender: SPOIC on SAT

Discussion

In this chapter, we demonstrate the PLS-SEM approach to explore its potential application in EMI research. For a field in which topics of inquiry engage contexts of particular complexity, in offering advantages in service of researchers investigating smaller sample sizes, nonnormal or nonmetric data, complex mediation analysis, and confirmatory factor analysis (Choi, Reference Choi2021; Dash & Paul, Reference Dash and Paul2021; Lin et al., Reference Lin, Lee, Liang, Chang, Huang and Tsai2020), PLS-SEM has demonstrable potential advance future EMI research.

While PLS-SEM has been utilized widely in educational studies (Ghasemy et al., Reference Ghasemy, Teeroovengadum, Becker and Ringle2020; Ji et al., Reference Ji, Yue and Zheng2021; Lin et al., Reference Lin, Lee, Liang, Chang, Huang and Tsai2020; Schijns, Reference Schijns2021; Wang, Reference Wang2019), it has only recently been employed to conceptualize, explain, and predict the dynamic developments within teaching and learning through EMI. With EMI research notably interwoven with interdisciplinary work, we invite researchers to nurture space in their thinking for review of and reflection on (1) recent educational studies employing PLS-SEM, (2) recent EMI or language learning studies employing PLS-SEM, and (3) recent EMI studies that, while not using PLS-SEM, inspire questions and directions of inquiry for which PLS-SEM may appropriately be employed to provide explanation and illumination.

In educational research of the past decade, PLS-SEM has been applied to the examination of satisfaction or learning outcomes in higher education (Ghasemy et al., Reference Ghasemy, Teeroovengadum, Becker and Ringle2020) and the exploration of e-learning and online environments (Lin et al., Reference Lin, Lee, Liang, Chang, Huang and Tsai2020). Through the management lens, PLS-SEM is commonly used for examining service quality in an online education setting (Schijns, Reference Schijns2021). According to Lin et al.’s (Reference Lin, Lee, Liang, Chang, Huang and Tsai2020) review of e-learning, PLS-SEM can support researcher explorations of learning contexts as well as development of theory. For instance, a research team (Wang et al., Reference Wang, Wallace and Wang2017) employed PLS-SEM to explore a reward-based online design competition with a study design that included sociocultural factors among learners. Elsewhere, Wang (Reference Wang2019) used PLS-SEM to construct an online behavior model examining behaviors and achievement in flipped classrooms. PLS-SEM is particularly promising as an approach to addressing a growing need for examinations of interplays among diverse factors. For example, Prihandoko (Reference Prihandoko2021) explored online learning experiences by using PLS-SEM to examine learner perceptions, achievement, and motivation. In terms of general studies in education, using PLS-SEM is also considered a helpful research tool for examining learner belief structures (Ji et al., Reference Ji, Yue and Zheng2021) and self-efficacy (Zhao et al., Reference Zhao, Peng and Liu2021).

EMI studies involve dynamic teaching and learning in diverse educational contexts, necessitating analysis techniques that can manage complexity. Choi (Reference Choi2021) particularly found PLS-SEM advantageous in investigating EMI satisfaction, further highlighting the salience of the aspects of lesson preparation and teaching skills. However, PLS-SEM is still an emerging method in EMI research. It has been seen more commonly in previous research on language-oriented classrooms (Ahmadi-Azad et al., Reference Ahmadi-Azad, Asadollahfam and Zoghi2020), and the approach has much to offer in examining the relationships between factors implicated in how language is taught and learned (Tai & Tang, Reference Tai and Tang2021). Later we present a range of topics related to language learning and teaching that may benefit from deeper exploration with this approach when applied to EMI contexts in the future.

Current research interests center on evaluating student perceptions and proposing explanatory and predictive models to better understand the dynamism of student learning. A recent study from a group of applied linguistic researchers (Law & Fong, Reference Law and Fong2020) used PLS-SEM to investigate undergraduate students’ perceptions of their learning transfer in an English for Academic Purpose context. Their study demonstrated the advantage of PLS-SEM when investigating the relationships between perceived variables, such as how students perceived content, understanding, and applicability of their learning. They further established an empirical research model to uncover learning transfer in language education. Similarly, a group of foreign language researchers (Zhang et al., Reference Zhang, Wang and Zhou2022) utilized PLS-SEM to investigate student perceptions of course satisfaction in an English Speaking Instruction context in order to further develop a speaking system. They examined how students perceived their understanding of course cognition, learning behavior, and knowledge mastery. They also included teacher–student communication in the model to explore its mediation and realize better understanding of aspects of instructional environment and overall effectiveness.

In studying college English learning in Korea, Joo (Reference Joo2022) applied PLS-SEM to analyze students’ perceptions of English learning, including how learners perceived their motivation, interest, attitude, and satisfaction after engaging in different learning materials such as TED talk videos. In another study, Joo (Reference Joo2021) also conducted PLS-SEM, this time examining students’ perceived self-efficacy, such as in terms of their learning achievement, in an English writing class. There has also been some additional research examining factors that affect teaching. For instance, a recent study (Peng, Reference Peng2022) proposed using the PLS-SEM approach to respond to the influencing factors that affect oral English teaching. However, this study also revealed that future research might still need to consider limitations, such as being more careful in reviewing nonconvergence of parameter estimations. As such, Peng (Reference Peng2022) suggests reconsidering the quality of data or conflicts between data.

Potential Applications for EMI-Related Contexts

English Medium Instruction is situated in a complex educational system that involves dynamic contextual factors including the intersection of language acquisition and content knowledge (Lo & Lo, Reference Lo and Lo2014), individual differences among learners and instructors (Yuan et al., Reference Yuan, Lu, Jing, Yang and Hao2022), and resources/affordances of learning environments (Min et al., Reference Min, Wang and Liu2019). Table 5.9 presents an overview of interwoven factors dynamically implicated in EMI. Further exploration of their interrelationships and roles would help illustrate a more comprehensive picture of these dynamics for practitioners and reveal what other learning support and professional development might be provided.

Table 5.9 An overview of factor simplicated in EMI contexts

Individual perceptions

Intervention/process

Evaluation

Teacher

oral language
self-efficacy
anxiety
attitude
beliefs

Student

self-efficacy
anxiety
motivation
attitude
beliefs
interest
engagement
behavior
achievement

Teacher–student

communication
instructional language
interaction
classroom management

Teaching

lesson preparation
pedagogy
knowledge dimensions
instructional design
translanguaging
multimodality
assessments

Learning

learning materials
learning strategies
peer support
learning community
sociocultural factors

Environment

course assistance
online learning
learning management system
cross-cultural factors

course satisfaction
learning performance
teaching effectiveness

Overall, we suggest that the PLS-SEM approach has great potential to address ongoing challenges in research and practice in EMI contexts. We conclude with a suggestion for future researchers to consider three dimensions related to the purpose of their EMI studies for using the PLS-SEM approach:

1. The nature of conducting exploratory research in EMI contexts.
2. The need to examine interrelationships of variables in EMI contexts;
3. the goal of developing theory to conceptualize EMI research.

This chapter provided a demonstration of the application of using PLS-SEM in EMI research by exploring student satisfaction and gender differences with respect to student perceptions of course content, internationalization of college, and the ability to use English. As definitions and operationalizations of satisfaction vary between and in relation to educational contexts, there is still more to explore in this area, as well as many others, in order to deepen our understanding of EMI in various instantiations. We invite researchers to consider their study objectives and design in regard to the advantages and requirements of the PLS-SEM approach.

6 Factor Analysis in Writing Research Investigating Grammatical Complexity in Science Writing from EMI Students in a Hong Kong University

Introduction

The purpose of this chapter is to demonstrate the application of exploratory factor analysis (EFA) in investigating grammatical complexity in writing assignments from Hong Kong (HK) students for an English Medium Instruction (EMI) course in a university. English has been used as an important medium of professional communication in many different areas, for instance, business, science, and education. Substantial emphasis has been put on science education in HK universities since the government launched an educational policy to promote science, technology, engineering, and math disciplines in 2015 (Pun & Thomas, Reference Pun and Thomas2020). In universities, a scientific English course is often provided through EMI to help the students succeed in professional communication in science areas. This chapter focuses on a scientific English course (i.e., English for Science) that is offered by a government-funded HK university as a required course for students in science-related majors, such as physics, chemistry, data science, and computing mathematics. The course aims to improve students’ communicative competence in varied scientific contexts, and a major course project is a scientific report, which involves analyzing a scientific issue, presenting experimental results, and interpreting the results for specialist audiences. This written genre was selected to be investigated due to the importance of scientific reports in professional communication in science. For example, scientific writing is not only a method to assess students’ knowledge in science but also an important way for students to become engaged in science communities.

An important research construct in this chapter is grammatical complexity, which has received substantial research attention in academic writing studies in the 2000s and the 2010s (Lan et al., Reference Lan and Sun2019). The majority of empirical studies on grammatical complexity has been conducted in multiple EMI contexts, for instance, programs of English for Academic Purposes (e.g., Parkinson & Musgrave, Reference Parkinson and Musgrave2014), programs of English as a Second Language (e.g., Bulté & Housen, Reference Bulté and Housen2014), and first-year composition programs (e.g., Eckstein & Ferris, Reference Eckstein and Ferris2018; Lan & Sun, Reference Lan and Sun2019). Among these studies, grammatical complexity has played three important roles, which are (1) tracking language development in academic writing (e.g., Lu, Reference Lu2011; Staples et al., Reference Staples, Egbert, Biber and Gray2016); (2) predicting writing quality in exam writing tasks (e.g., Crossley & McNamara, Reference McNamara, Graesser, McCarthy and Cai2014; Kyle & Crossley, Reference Kyle and Crossley2018); and (3) distinguishing different written genres and/or registers (e.g., Biber et al., Reference Biber, Gray and Poonpon2011; Yoon & Polio, Reference Yoon and Polio2017).

There are numerous synthesized reviews on the representation of grammatical complexity in second language (L2) writing, for instance, Wolfe-Quintero et al. (Reference Wolfe-Quintero, Inagaki and Kim1998), Ortega (Reference Ortega2003), Norris and Ortega (Reference Norris and Ortega2009), Biber et al. (Reference Biber, Gray and Poonpon2011), and Lan et al. (Reference Lan and Sun2019). While we acknowledge the importance of these reviews, it is equally (if not more) important to have a data-driven investigation on how grammatical complexity is represented in seminal research models. An EFA is used to inductively explore the representation of this important research construct in scientific research reports of HK university students in an EMI English course. The findings shed light on how to produce effective instructions on grammatical complexity in scientific English courses. More importantly, we not only consider this chapter an example using EFA to investigate grammatical complexity but also regard it as an opportunity to leverage similar research in different EMI courses in the future.

Research Design

The research method we used is a corpus-driven approach. A corpus-driven approach refers to analyzing corpus data to develop theories (Cheng, Reference Cheng2012). No study has inductively analyzed the representation of grammatical complexity, particularly in scientific research reports of HK university students. This corpus-driven approach will have several characteristics as mentioned in Biber et al. (Reference Biber, Conrad and Reppen1998), for example, using a large collection of authentic texts, analyzing the use of language patterns in specific contexts, and making extensive use of computers for analysis. From the corpus perspective, an existing corpus is used in this study – a corpus of scientific research reports written by undergraduate students in science majors in an HK university. From the statistical perspective, an EFA is applied to provide an empirical description on the relationship between latent factors of grammatical complexity and their associated grammatical complexity measures. The EFA allows us to model the representation because (1) it is performed when researchers aim to group variables based on patterned correlations and (2) there are no apparent theories that will be relied on (Tabachnick & Fidell, Reference Tabachnick and Fidell2013). Therefore, with the corpus of scientific reports, we model the representation of grammatical complexity with an EFA.

Two research questions are answered:

1. How many latent factors of grammatical complexity are identified by the best factor model?
2. How can the pattern matrix of the final factor solution demonstrate the relationship among different measures and the latent factors?

Quantitative Methods

Corpus Building and Cleaning

The scientific reports were collected from a scientific English course with EMI in an HK university. This course is mandatory for all students in the College of Science, including diverse majors such as physics, chemistry, mathematics, and data science, to name a few. The scientific reports are based on a specific structure: Introduction, Methods, Results, and Discussion, and the length requirement is 1,500 words. Students can choose specific scientific topics that they prefer to work on for the reports. The corpus has (1) 600 scientific research articles, (2) a mean length of 1,502 words, and (3) a total of 901,200 words. In terms of corpus cleaning, all the scientific reports were converted into plain text files with AntFileConverter. Next, a manual cleaning was conducted to delete identifiable information of the students (e.g., student names, student electronic IDs, and email addresses) and reference lists at the end. Then, a Python program developed by the researchers was used to fix format issues of the files to enhance the quality of the textual data (e.g., fixing line breaks and removing empty lines). By the end of this step, the corpus is cleaned and ready for further analysis.

Linguistic Model

Grammatical complexity was operationalized to the linguistic model in Lu (Reference Lu2011), which is a seminal study on the topic of grammatical complexity. The model was built based on a synthesized review of previous milestone works, such as Wolfe-Quintero et al. (Reference Wolfe-Quintero, Inagaki and Kim1998), Ortega (Reference Ortega2003), and Norris and Ortega (Reference Norris and Ortega2009). Lu (Reference Lu2010) includes fourteen grammatical measures that are commonly used and also effective to distinguish proficiency and/or development levels of L2 student writers. The following are the fourteen measures: mean length of clause (MLC), mean length of sentence (MLS), mean length of T-unit (MLT), clauses per sentence (C/S), clauses per T-unit (C/T), complex T-units per T-unit (CT/T), dependent clause per clause (DC/C), dependent clause per T-unit (DC/T), coordinate phrases per clause (CP/C), coordinate phrases per T-unit (CP/T), T-unit per sentence (T/S), complex nominals per clause (CN/C), complex nominals per T-unit (CN/T), and verb phrases per T-unit (VP/T).

Lu (Reference Lu2010) developed the tool Second Language Syntactic Complexity Analyzer (L2SCA) to automatically calculate the fourteen grammatical complexity measures. The tool is publicly available and free to use. With the application of the L2SCA, the values of the fourteen complexity measures were generated and imported into Statistical Packages of the Social Sciences (SPSS) for the EFA.

Data Analysis

Exploratory factor analysis is performed when researchers aim to group variables based on patterned correlations. As the name of this quantitative method suggests, EFA is an exploratory statistical technique. It normally serves as a technique to consolidate a large number of observed variables and generate hypotheses at the beginning stage of research (Brown, Reference Brown2015). In this chapter, we will conduct an EFA in SPSS to investigate how the fourteen effective and frequent measures in Lu (Reference Lu2010) represent the construct of grammatical complexity in the scientific research reports in an EMI course in an HK university. The EFA includes all suggested steps of factor analysis in previous literature (e.g., Brown, Reference Brown2015; Tabachnick & Fidell, Reference Tabachnick and Fidell2013), namely examining the dataset, selecting variables, performing factor analysis, extracting factors based on the correlation matrix and deciding the number of factors, rotating the factors, and offering interpretations. To have a clear illustration of the EFA, the process of running this EFA was categorized into three stages: (1) examining the dataset, (2) conducting the EFA, and (3) offering the factor interpretation.

Results Report

Stage 1: Examining the Dataset

It is important to check if a dataset is appropriate for an EFA, so Factorability was examined before running the EFA. Thus, Factorability was examined before the application of the EFA. First, the Kaiser–Meyer–Olkin (KMO) test, which is a measure of sampling adequacy, was performed. As EFA results largely depend on the correlation coefficients among observable variables, an adequate sample size is required for the procedure. The estimations of the correlation coefficient are likely to be more reliable with a larger sample size. The test result ranges from 0 to 1; a higher value indicates a more adequate sampling (Loewen & Gonulal, Reference Loewen, Gonulal and Plonsky2015). Field (Reference Field2009) suggested values between 0.50 and 0.70 as mediocre adequacy, 0.70 and 0.80 as good, 0.80 and 0.90 as great, and above 0.90 as perfect. Table 6.1 demonstrates that the result of the KMO was 0.702, indicating an adequacy of sample size for an EFA.

Table 6.1 KMO and Bartlett’s test

KMO test		0.70
Bartlett’s test of sphericity	Chi-square value	23,944.93
	Degree of freedom	91
	Significance	0.00

Second, the Bartlett’s test was used to measure the correlations among the variables (i.e., the fourteen grammatical complexity features). Statistically, the Bartlett’s test examines the hypothesis that the correlation in a correlation matrix of observable variables is zero, which is often performed for datasets with a very small sample size (Tabachnick & Fidell, Reference Tabachnick and Fidell2013). A significant test result with p < 0.05 indicates that the data are factorable and thus suitable for EFA. Table 6.1 shows that the chi-square value is 23,994.93, with the degree of freedom as 91 and the p-value as 0.00. This suggests that our data are factorable, thus being suitable for an EFA. Third, the communalities table was generated, and all the communalities were observed (see Appendix A). In general, communalities that exceed 0.60 indicate that the EFA results account for a high percentage of variance within the variable (Brown, Reference Brown2015). Tabachnick and Fidell (Reference Tabachnick and Fidell2013) suggested that low communalities require at least 300 cases to conduct EFA. For this study, the communalities of all variables are greater than 0.70, showing that the variables (i.e., the fourteen measures) are likely to explain a high percentage of the construct (i.e., grammatical complexity). Thus, we conclude that the current dataset is suitable for the EFA based on the results of the KMO test, the Bartlett’s test, and the communalities of all variables.

Stage 2: Conducting the EFA

Two specific solutions of the EFA were generated in SPSS: (1) the initial solution of the EFA model and (2) the final solution of the EFA model.

The Initial Solution

First, Principal Axis Factoring was selected as the extraction method, and an oblique rotation method, Promax, was selected for the initial analysis.Footnote ¹ Also, small correlation coefficients with absolute values a less than 0.40 were suppressed so that only the variables that significantly loaded on each factor will be shown in the results. With the current setting, the initial solution was achieved with a four-factor EFA model, based on the following observations: (1) The scree plot shows that beyond four factors, the line’s slope in the scree plot becomes not much less sharp (Figure 6.1); and (2) in terms of the eigenvalues of the extracted factors, only four factors have the eigenvalues greater than 1 (see Appendix B for more information). Both criteria suggest that a four-factor model can be achieved for the EFA as an initial solution. In other words, it is reasonable to conclude that four latent factors have the potential to explain grammatical complexity in our dataset.

Figure 6.1 Scree plot

Second, the pattern matrix was generated to present how the variables load on the four extracted factors. This pattern matrix allows the investigation of relationships between the fourteen measures and the four latent dimensions of grammatical complexity. The pattern matrix reports seven significant loading variables for Factor 1, five for Factor 2, two for Factor 3, and three for Factor 4. The factor loadings range from 1.077 to 0.502. However, three variables (MLS, MLT, C/S) cross-load on two factors. The presence of cross-loading variables suggests that the initial solution does not fit the current dataset well, and further action is needed.

The Final Solution

To make the EFA model fit our data and be interpretable, the initial solution is adjusted to achieve a final solution. First, we further adjusted the four-factor model in the initial solution: The cross-loading variables were deleted one by one to reach clear relationships between the factors and the variables (i.e., the grammatical complexity measures). Two cross-loading variables were finally excluded from the EFA model, which are MLS and T/S.Footnote ² When a variable is excluded, the factor loadings of the remaining variables often change. As a result, with MLS and T/S deleted, we finally achieved a three-factor model with a clear pattern among the remaining twelve measures and their three latent factors of grammatical complexity. This adjusted EFA model explains 90.451 percent of the total variance of grammatical complexity. In other words, this three-factor model with twelve measures can explain 90.451 percent of the variance of this theoretical construct. Second, it is important to explain one tricky variable: MLT. In the adjusted EFA model, eleven of the twelve remaining variables significantly load on individual factors, and the only exception is MLT, which cross-loads significantly on Factor 1 (0.544) and Factor 2 (0.677). We decided to keep MLT and finalized our simple solution as a three-factor model with twelve variables. This decision is based on a reason: MLT has always been the most frequent measure of grammatical complexity in L2 writing since the 1990s (Lan et al., Reference Lan and Sun2019; Lu, Reference Lu2011; Wolfe-Quintero et al., Reference Wolfe-Quintero, Inagaki and Kim1998). The exclusion of MLT will make the EFA model contradict empirical findings in existing writing studies. To handle this tricky variable, MLT was included as a cross-loading variable on Factor 1 and Factor 2 in the final EFA model.

Third, the final solution is presented by the pattern matrix of the three-factor model (see Table 6.2). The finalized model includes three factors and the twelve measures of grammatical complexity features, and the three-factor model suggests that there are three latent constructs of grammatical complexity. Factor 1 includes six measures, which are DC/T, C/T. DC/C, CT/T, C/S, and VP/T. Factor 2 includes four measures, namely CN/C, MLC, CN/T, and MLT. Factor 3 includes two measures, which are CP/T and CP/C.

Table 6.2 Pattern matrix

Variables	Factor 1	Factor 2	Factor 3
DC/T	0.998
C/T	0.989
DC/C	0.910
CT/T	0.800
C/S	0.835
VP/T	0.811
CN/C		0.991
MLC		0.885
CN/T		0.850
MLT	0.544	0.677
CP/T			0.976
CP/C			0.960

Stage 3: Offering the Factors Interpretation

An important aspect of the EFA is its interpretability. After obtaining the final solution, we need to carefully name and interpret the factors (Tabachnick & Fidell, Reference Tabachnick and Fidell2013). In terms of the interpretation of the factors, it is important to understand how the twelve measures of grammatical complexity in the EFA model have been described in existing studies. As Lu (Reference Lu2010) summarized three milestone publications on grammatical complexity in L2 writing, we refer to Lu (Reference Lu2010) to provide a description of the twelve measures. Table 6.3 shows the measure types of the twelve measures, which include four subordination measures (i.e., DC/T, C/T, DC/C, and CT/T), one general clause measure (i.e., C/S), one verb phrase measure (i.e., VP/T), two complex nominal measures (i.e., CN/C and CN/T), two length measures (i.e., MLC and MLT), and two coordination measures (i.e., CP/T and CP/C). The following is our interpretation of the three factors:

Factor 1 represents “clausal (subordination) complexity.”
Factor 2 represents “nominal phrasal complexity.”
Factor 3 represents “coordinate phrasal complexity.”

First, Factor 1 is interpreted as “clausal (subordination) complexity” because four measures focus on the subordinate clauses and one measure focuses on clauses in general. The last measure, VP/T, is technically to evaluate verb phrases, but the number of verb phrases in a sentence is often positively correlated with the number of subordinate clauses in a sentence. The more subordinate clauses, the more main verb phrases in a sentence. As a result, we tend to consider VP/T as being indicative of clauses when it is used to measure grammatical complexity.

Table 6.3 The measure types in Lu (Reference Lu2010)

Factor	Variables	Measure types	Interpretation
1 1 1 1 1 1	DC/T	Subordination	Clausal (subordination) complexity
	C/T	Subordination
	DC/C	Subordination
	CT/T	Subordination
	C/S	Clause
	VP/T	Verb phrase
2	CN/C	Complex nominal	Nominal phrasal complexity
2	MLC	Length
2	CN/T	Complex nominal
3	CP/T	Coordination	Coordinate phrasal complexity
3	CP/C	Coordination	Coordinate phrasal complexity
1 & 2	MLT	Length	A holistic measure

Second, Factor 2 is interpreted as “nominal phrasal complexity,” which is due to the fact that the three measures are all related to phrases. CN/C and CN/T are designed to measure complex nominals. To be clear, nominal structures are generally based on noun phrases.Footnote ³ In addition, MLC is a measure of assessing the lengths of clauses. Although this measure tends to be a clausal complexity at first glance, Norris and Ortega (Reference Norris and Ortega2009, p. 561) pointed out that any increase of a clausal length “can only result from the addition of pre- or postmodification within a phrase … or as a result of the use of nominalizations.” Thus, MLC needs to be perceived as a specific measure that taps complexity subclausally; that is, it is about the nominal phrasal complexity.

Third, Factor 3 is interpreted as “coordinate phrasal complexity.” Both measures (CP/T and CP/C) are described as coordination measures in Lu (Reference Lu2010) because they are both designed based on coordinate phrases. Although both coordinate phrases and nominals are phrasal, they are different from each other. Coordinate phrases are constructed based on two phrases linked by coordinators (e.g., a valid approach and a well-designed study), whereas nominals are generally constructed based on head nouns and pre/post-modifiers (e.g., a well-designed study that is based on a valid approach). Therefore, the two coordinate measures perform differently from the nominal measures, and they are loaded on Factor 3.

Fourth, it is important to discuss the tricky measure (MLT), which is equally loaded on both Factor 1 and Factor 2. MLT measures the mean length of T-units, and it is a length-based holistic measure (Norris & Ortega, Reference Norris and Ortega2009; Ortega, Reference Ortega2003). Any increase or decrease of a T-unit can be achieved by an additional use of both clausal and phrasal structure in a sentence, for example, one more embedded subordinate clause in a T-unit or one more embedded preposition phrase to modify a noun phrase in a T-unit. Due to the holistic nature of MLT, we consider it reasonable to have MLT cross-loaded on “clausal (subordination) complexity” and “nominal phrasal complexity.”

Answers to the Two Research Questions

At the end, the answers to the two research questions can be summarized. First, in our best EFA model, there are three latent factors of grammatical complexity, which are (1) Factor 1: “clausal (subordination) complexity,” (2) Factor 2: “nominal phrasal complexity,” (3) and Factor 3: “coordinate phrasal complexity.” Second, the pattern matrix of the EFA model demonstrates that Factor 1 includes six measures (DC/T, C/T, DC/C, CT/T, C/S, and VP/T); Factor 2 includes three (CN/C, MLC, and CN/T); and Factor 3 has two (CP/T and CP/C). As a holistic measure, MLT cross-loads on both Factor 1 and Factor 2. In general, this EFA model aligns with the recent theoretical argument on grammatical complexity in L2 writing. For example, Norris and Ortega (Reference Norris and Ortega2009) mentioned that grammatical complexity should be studied multidimensionally, including subordination, coordination, and phrasal construction. Biber et al. (Reference Biber, Gray and Poonpon2011) proposed that phrasal complexity, along with clausal complexity, should be integrated into the existing analysis of grammatical complexity.

Implications of Using EFA for EMI

The evidence-based EFA model has both research and pedagogical implications for EMI. English has been used as a medium for English instruction in HK universities. For research implications, this EFA model is developed based on scientific research reports from an EMI English course, English for Science. The three latent factors of grammatical complexity suggest the three major types of grammatical structures that the HK students should handle in English writing for this specific context. Traditionally, as a theoretical construct, grammatical complexity has been primarily observed by clausal complexity in academic writing in L2 writing (Biber et al., Reference Biber, Gray and Poonpon2011). In EMI courses of science subjects, we also suggest investigating nominal phrasal complexity and coordinate phrasal complexity to make the representation of grammar use complete. In addition, as English is used as a medium of instruction in academic contexts in HK universities, it is also suggested to conduct an EFA on grammatical complexity with different written genres to expand our understanding of grammar use in other discipline-specific writing. Some important EMI contexts in HK universities include but are not limited to English in business, English in law, English in journalism, and English in humanities and social sciences.

According to the pedagogical implications, the three latent factors of grammatical complexity align with different stages of writing development. In terms of language development in academic writing, there are four sequential stages from beginning to advanced: (1) the use of coordinate structures (Norris & Ortega, Reference Norris and Ortega2009); (2) the use of finite subordinate clauses; (3) the use of nonfinite dependent clauses; and (4) the use of dependent phrases primarily in compressed noun phrases (Biber et al., Reference Biber, Gray and Poonpon2011). As a result, different latent factors of grammatical complexity should be focused on for students at different stages of language development. In the science-related EMI context, it is suggested that instructors first ensure that students can master clausal subordination structures and then move on to offering instructions on the nominal phrasal structures, for example, compressed noun phrases with head nouns modified by phrasal modifiers (e.g., prepositional phrases as postmodifiers). The focus of specific latent factors of grammatical complexity in EMI will facilitate the students to effectively develop their grammar use.

Conclusion

In this chapter, we conducted an EFA to explore the latent factors of grammatical complexity in a corpus of scientific research reports from an EMI English course for science students. The EFA model includes three latent factors, which are clausal (subordination) complexity, nominal phrasal complexity, and coordinate phrasal complexity. Each latent factor is associated with corresponding measures of grammatical complexity from Lu (Reference Lu2010). The only tricky measure is MLT that has cross-loaded on both clausal (subordination) complexity and nominal phrase complexity. There is a clear pattern between the remaining eleven measures and their associated latent factors. This model has implications on research and pedagogy for academic writing in EMI contexts, including the application of EFA in other EMI contexts in HK universities (e.g., business English) and the instruction on different latent factors (e.g., nominal phrasal complexity) to effectively facilitate the students’ grammar use in the science-related EMI context.

7 Analyzing Questionnaire Data through Many-Facet Rasch Measurement: A Pilot Study of Students’ Attitudes toward EMI in the Chinese Higher Education Context

The successful construction of questionnaires, which is a critical step toward gathering research data, would greatly benefit from the guidance of questionnaire design theories. Researchers have advocated for the implementation of practices such as building item pools and conducting pilot studies (Curle & Derakhshan, Reference Curle, Derakhshan, Pun and Curle2022; Dörnyei & Taguchi, Reference Dörnyei and Taguchi2009). In addition, statistics in support of questionnaire validity, which often include Cronbach’s alpha and item analysis statistics, are often reported together with research results to control the quality of questionnaire items.

Successful interpretation of data collected from questionnaires is indispensable to the presentation of research results. A widely used approach to process questionnaire data is computing the mean and standard deviation of each item. The fundamental assumption is that participants have been using the scales on the questionnaires as interval ones. It is highly possible, however, that the scales are used in an ordinal manner. For example, on a 5-point Likert scale ranging from “1 = strongly disagree” to “5 = strongly agree” participants’ perception of the difference between “1 = strongly disagree” and “2 = disagree” may vary from person to person. An average score of 1.5 does not necessarily indicate an equal distance to “1 = strongly disagree” or “2 = disagree.” In this regard, comparing the average score of 1.5 to 1.75 between two different items, or between two different groups, will not always yield accurate results for interpretation.

To provide possible solutions of transforming ordinal scales into interval scales, this chapter introduces Rasch modeling as a method to explain data gathered from questionnaires. In addition to explicating basic knowledge about Rasch Models and model validation statistics, this chapter also features a pilot study measuring undergraduate students’ attitudes toward English Medium Instruction (EMI) in mainland China. Data collected will be analyzed through Many-facet Rasch Measurement (MFRM), with step-by-step instructions provided for future reference.

Fundamental Facts and Principles about Rasch Models

First introduced by Georg Rasch for analyzing responses to dichotomous items, Rasch Models are grounded on the assumption that different people’s responses to a number of items are stochastically independent. Both person and item parameters, which are unknown to researchers, decide the probability of providing a positive response to a question item. Rasch Models are usually explained as additive logistic models, which give recognition to the latent nature of the person parameter. This latent nature can be understood as “a random effect explaining the covariation among items” (Kreiner, Reference Kreiner, Chirstenson, Kreiner and Mesbah2013, p. 10) and is also called “variable” or “trait” interchangeably at times. With the latent nature being the focus for investigation, Rasch Models are characterized by the presence of both item parameter and person parameter on one logit scale. They take both human factors and item performances into account, thus offering explanations for person ability as well as item difficulty (Bond & Fox, Reference Bond and Fox2001).

Though Rasch Models were first devised to process dichotomous data, Polytomous Rasch Models have been used more frequently, especially for processing Likert scales with values of “strongly agree,” “agree,” “disagree,” and “strongly disagree.” Researchers have mentioned conflicting views on whether to categorize Rasch Models as a one-parameter logistic model within the camp of item response theory (IRT) (Andrich, Reference Andrich, Smith and Smith2004; Shaw, Reference Shaw1991). However, Rasch aligns with IRT in transforming ordinal data through a mathematical framework supported by logistic probability. Besides the prerequisite that all measured constructs are valid and well justified, Rasch Models demand an additional list of requirements to be fulfilled, which include the following:

Unidimensionality. An underlying principle for Rasch Models is unidimensionality, which is the quality of measuring one trait or attribute of one person at one time.
Monotonicity. As the latent trait of a person grows, the expected item score the person receives would also increase.
Homogeneity. All participants would unanimously find the most difficult item the hardest to respond to. In other words, the ranking order of the item parameters would remain the same across different levels of latent trait.

The list also includes two other requirements for researchers’ and practitioners’ consideration. First, items on a measurement scale have to be independent of each other; that is, the response given to one item will not be influenced by responses to other items. Second, items demonstrating differential item functioning (DIF) (i.e., people of different subgroups do not find the items to be equally difficult) fail to meet the requirement of Rasch Models. Researchers and practitioners have to carefully evaluate different model options for data processing when these requirements cannot be fulfilled. The Graphical Loglinear Rasch Model (GLLRM), for instance, is often recommended when locally dependent items or items of DIF need to be included for analysis (Nielson et al., Reference Nielson, Garcia and Alastor2021; Upegui-Arango et al., Reference Upegui-Arango, Forkmann, Nielson, Hallensleben, Glaesmer, Spangenberg, Telsmann, Jukef and Boecker2019).

After considering and weighing the prerequisites for Rasch Models, researchers are also faced with model selection decisions. How are Rasch Models further categorized in terms of their functions? Which models have been used in extant literature for analyzing questionnaire data? Table 7.1 provides a classification scheme of commonly used Rasch Models and features brief explanations for their functioning mechanisms. It also enlists the model’s leniency toward restrictions such as unidimensionality, monotonicity, homogeneity, and local independence.

Table 7.1 Rasch Model categories

Category	Rasch Model	Interpretation	Assumption restriction
Dichotomous Rasch Models	The Rasch Model for Dichotomous Items	The positive responses gathered for items depend on both person parameter (i.e., person ability) and item parameter (i.e., item difficulty). Also, the possibility of obtaining a positive response increases with higher person ability and lower item difficulty.
Polytomous Rasch Models	Rating Scale Model (RSM)	RSM includes “a threshold parameter to represent the relative difficulty of a transition from one category of the rating scale to the next” (Eckes, Reference Eckes2011, p. 11).
	Partial Credit Model (PCM)	PCM is applied when responses for each item are different or when the relative difficulty of each category varies.
	Many-Facet Rasch Measurement (MFRM)	MFRM enables a simultaneous analysis of multiple variables that have potential influence on assessment outcomes in synergy. Each variable is known as a facet, and MFRM is thus also categorized as a facet model.
	Graphical Loglinear Rasch Model (GLLRM)	GLLRM is applied when DIF or local dependence is detected in items to be analyzed.	This model is more lenient toward items that show DIF or lack local independence.

To examine the functioning of the selected Rasch Models, researchers would often report a series of statistics that demonstrate the models’ appropriateness. Related statistics are explained as follows:

(a) Estimates for Model Fit: Researchers might be challenged by the fact that empirical observations do not perfectly fit a selected Rasch Model, which occurs more frequently in statistical calculation. The estimates of model fitness are thus carefully evaluated and are often calculated as log-likelihood chi-square in Rasch analysis output. Model unfitness is demonstrated through a small p-value. Another approach is to observe the difference between expected responses and observed responses, which is usually measured by standard deviation according to Eckes (Reference Eckes2011). Model fitness is guaranteed only if the standard deviation of observed data is within the range between −2 and +2.
(b) Estimates for Person Fit and Item Fit: In contrary to “fit,” “misfit” happens when observed data do not match with expected results generated by the models. A “misfitting” person might be a questionnaire respondent who selected “disagree” across all the items on a Likert scale or chose his or her answer in an unpredictable manner in comparison with other participants. A “misfitting” item, however, could be answered correctly with less difficulty for respondents with lower ability, which contradicts the assumption of monotonicity.
Also included in the report of Rasch model statistics are infits and outfits, which are chi-square statistics that check the association between the model and data and are also considered as important indices in the explanation of “fit.” As Boone et al. (Reference Boone, Staver and Yale2014, p. 164) explained, outfit is “a fit statistic sensitive to outliers” such as guessing or thoughtless errors, while infit “focuses less on outliers but more on responses near a given item difficulty (or person ability).” Among all the fit statistics provided by Rasch Model statistics report, both Boone et al. (Reference Boone, Staver and Yale2014) and Linacre (Reference Linacre2012) recommended a closer investigation of Item Outfit MNSQ (Mean-Square) and Person Outfit MNSQ. Another important estimate is standard residuals, which also reveals the difference between observed data and expected data.
In addition to the commonly used statistics that reflect model fitness, Boone and Staver (Reference Boone and Staver2020) synthesized a collection of more advanced indices, or statistical techniques, to demonstrate the robustness and appropriateness of Rasch Models. These statistics also help provide evidence in support of the fundamental assumptions of the Rasch Models, such as unidimensionality and homogeneity. For example:
(c) Principal Component Analysis of Residuals: The purpose of conducting principal component analysis of residuals (PCAR) is to check the dimensionality of a questionnaire instrument. When applying a Rasch Model for measurement purposes, an ideal situation would be that the latent measure will explain all the data information. This is often checked through the indices of Person Fit and Item Fit. The unexplained part of the data, or the residuals, should conform to a normal distribution after standardization.PCAR results indicate that there are more unexplored reasons that could explain the residuals, and researchers need to consider whether additional factors are playing a role in the dataset.
(d) Item Characteristics Curve: Appearing in the form of a nonlinear curve that plots expected item score (y axis) against item measure (x axis), the item characteristics curve (ICC) is adopted as a practice to control the quality of questionnaire items. The expected item score (y axis) presents the raw score a respondent might obtain, while the item measure (x axis) corresponds to the transformed logit scale through Rasch modeling. The ICC thus functions as a bridge between interval numbers and ordinal scale, where researchers can conduct more analysis on the functioning of questionnaire instruments for measurement purposes.

Depending on various research purposes, Rasch Models have been used in connection with other statistical methods such as factor analysis and effect size calculation. The next section of this chapter includes three studies that focus on the development of scales and examination of their use, where Rasch modeling has assisted researchers in retrieving more relevant information.

The Application of Rasch Modeling in Processing Questionnaire Data – Three Case Reports

In the research field of language testing and assessment, Rasch Models have usually been recognized as a method of examining rater reliability and test item validity (Khabbazbashi, Reference Khabbazbashi2017; Yan, Reference Yan2014). This statistical method has also been frequently applied in processing questionnaire data with a psychometric approach. Three studies are discussed in this section, which focus on the development and validation of scales by using Rasch Models.

Case 1

Purpose: Cross-country/culture scale validation.

Krägeloh et al. (Reference Krägeloh, Medvedev, Hill, Webster, Booth and Henning2018) set a goal to investigate the psychometric property of the Revised Competitiveness Index (RCI; Houston et al., Reference Houston, Harris, McIntire and Francis2002), which is a widely used 5-point Likert scale instrument to assess competitiveness. An important direction for the quality control of measurement scales is to confirm that they function equally across different countries or cultures.

Factor analysis was conducted first in this study, which identified a two-factor structure for the scale: Enjoyment of Competition and Contentiousness. A 5-point Likert scale was adopted, ranging from “1 = strongly disagree” to “5 = strongly agree.” The researchers adopted a polytomous unrestricted PCM to analyze item–person threshold distributions. No significantly disordered thresholds were detected, which provided evidence for the internal validity and reliability of the two-factor scale.

Case 2

Purpose: Comparing differences between Rasch and traditional statistical approaches.

Another strand of research examines the difference between Rasch Models and traditional approaches in processing questionnaires. In Zhao et al. (Reference Zhao, Huen and Chan2017), the researchers measured longitudinal gains in college learning for assessing higher education outcomes. A Student Learning Experience Questionnaire (SLEQ) was administered to all first-year undergraduate and final-year undergraduate students in a university at Hong Kong. SLEQ is an instrument that functions with a 5-point Likert scale (1 = “strongly disagree” and 5 = “strongly agree”) and covers three domains – cognitive, social, and value.

Again, the summative scoring approach fails to take the scale’s ordinal nature into consideration while calculating the means and standard deviation of items. Although Rasch analysis is not significantly different from that of the summative approach in measuring the outcome in cognitive and social domains, it was 3 percent more accurate than the observed mean score in capturing changes in the domain of value.

Case 3

Purpose: Reduce the number of scales for quality control.

Rasch modeling is also of great assistance in deciding the number of scales for an instrument. Boone et al. (Reference Boone, Townsend and Staver2011) advocated for a guiding measurement framework for attitudinal instruments and statistics quality control. The researchers examined the self-efficacy subscales of the Preservice Science Teacher Self-Efficacy Beliefs Instrument (STEBI-B) (Enochs & Riggs, Reference Enochs and Riggs1990). Participants for this instrument were undergraduate students who registered for an elementary science methods class.

The researcher emphasized the importance of predicting the location of thirteen items along the construct of self-efficacy, which helps to prevent over-reliance on raw total scores. Changing the number of scales for an item, such as expanding a 5-point Likert scale to a 7-point one, could accidentally enlarge or squeeze differences between raw total scores. This would lead to potential misinterpretation when parametric methods are directly applied. Boone et al. (Reference Boone, Townsend and Staver2011) thus held an extensive discussion on combining categories within a 6-point scale and finally ended up with a 4-point scale. The study also presents a seven-step guideline for using Rasch Models in processing questionnaire data, which integrates Rasch theory into instrument development, item function monitoring, and data preparation for parametric tests. This descriptive study is more of a reflection over the measurement issues in science education and accurately pinpoints the interrelationship between theory and practice.

The Replication of Japanese English Medium of Instruction Attitude Scale in the Higher Education Context of Mainland China – A Case Study

To further explain the application of Rasch modeling in EMI-related research, this chapter features a replicated use of Japanese English Medium of Instruction Attitude Scale (JEMIAS) (Curle, Reference Curle2018) in the higher education context of mainland China. Similar to the situation in Japan, Chinese universities are also experiencing growth in the development of EMI courses and programs. The challenges encountered by instructors, students, and other stakeholders have been described and analyzed in a wide range of studies, such as the regional disparities rooted in the general English education landscape (Kong & Wei, Reference Kong and Wei2019) and divergent language support provided by different types of EMI institutions in the country (McKinley et al., Reference McKinley, Rose and Zhou2021).

The purpose of conducting this case study was twofold: First, the researcher used Rasch modeling to interpret survey data, thus offering more diverse statistical options for examining the validity and reliability of questionnaires. Also, the JEMIAS instrument implemented among Chinese students was slightly adapted such that the data collected will provide useful information for investigating the efficiency of JEMIAS in measuring students’ attitudes across different contexts. The following are the research questions for this case study :

Research Question 1: How can Rasch Models be applied in analysing questionnaire data with the guidance of step-by-step instructions
Research Question 2: Will students’ attitudes toward EMI be influenced by their academic majors?
Research Question 3: With which JEMIAS items would the student respondents agree or disagree the most regarding their attitudes toward EMI?

Questionnaire Distribution

The JEMIAS was distributed among 102 undergraduate students at a four-year university course based in Shanghai, China, who use Mandarin Chinese as their first language and learn English as a foreign language. The university offers undergraduate students various options in academic majors, with main research areas concentrating on liberal arts, social science, science, medical science, and engineering. All the student participants have registered for at least one course taught in English related to their academic major. When enrolled in these courses, students were expecting to embrace higher education internationalization through EMI (Dafouz, Reference Dafouz, Psaltou-Joycey, Agathopoulou and Mattheoudakis2014) and associate English language with disciplinary knowledge in an academic-exposure environment (Coleman et al., Reference Coleman, Hultgren, Li, Tsui and Shaw2018). More detailed demographic information about the participants is presented in Table 7.2.

Table 7.2 Demographic information of participants

	Gender		Academic major
	Female	63	Liberal arts	15
	Male	39	Social science	45
			Science	17
			Engineering	6
			Medical science	19
Total number of participants	102

According to Curle (Reference Curle2018), factor analysis has demonstrated that the items on JEMIAS can be grouped into six identified categories. Table 7.3 presents a synopsis of the item distribution for the six factors, while detailed information about each factor and the corresponding items can be found in Appendix A. For each factor, a common theme is attached in the explanation of the content of the items.

Table 7.3 Factors underlying JEMIAS

Factor	Item	Theme
1	Item 3, Item 14, Item 39, Item 46, Item 48	Negative impact of EMI
2	Item 2, Item 9, Item 10, Item 11, Item 12, Item 18, Item 23, Item 24, Item 26, Item 44, Item 47	Learning opportunities provided by EMI
3	Item 6, Item 8, Item 22, Item 32, Item 34, Item 31	Personal learning restrictions for EMI
4	Item 7, Item 19, Item 20, Item 27, Item 28, Item 29, Item 30	The connection between EMI and international identity
5	Item 38, Item 40	The use of English in daily life
6	Item 4, Item 13, Item 36	EMI-related teaching practices

As JEMIAS was distributed among undergraduate students in a Chinese university, items related to Japan and Japanese have been changed to China and Chinese. In total, two items were revised based on the occurrence of different national events and varied national policies:

Item 19: I think that Japanese people should improve their English so that they can make a good impression during the 2020 Olympics.
Revised Item 19: I think that Chinese people should improve their English so that they can make a good impression during the 2022 Winter Olympics.
Item 20: I think it is not necessary to learn English in Japan because many Japanese companies are not expanding abroad.
Revised Item 20: I think it is necessary to learn English in China because many Chinese companies are expanding abroad.

In the next section, each of the three research questions will be answered, along with detailed instructions for conducting Rasch analysis on questionnaire data.

Questionnaire Results Analysis and Discussion

Research Question 1: How can Rasch Models be applied in analysing questionnaire data with the guidance of step-by-step instructions

Step 1: Model Selection

Selecting an appropriate Rasch Model is the first question researchers may ponder while analyzing the data collected. A 9-point Likert scale is adopted by JEMIAS, and a Polytomous Rasch Model is a reasonable option. The research question of this case study, however, focuses on the potential influence of students’ academic major on their attitudes of EMI. JEMIAS’ final evaluation results might be impacted by individual participants, their academic majors, and specific factors or items in combination. MFRM is thus rendered as a feasible solution, which considers student participants, their academic major, and the item score they reported as three individual facets. The three facets will be presented on one Wright map after MFRM, the connections among which can also be more closely observed by researchers.

While the major advantage of MFRM is to bring all facets together, researchers still need to clarify an important assumption: Will the relative difficulty between scale categories (e.g., 1 = “strongly disagree,” 2 = “disagree”) vary depending on items? Before confirming the difficulty index of each item rendered by MFRM analysis, researchers would often resort to PCM to process questionnaire data. It is possible that some items on the questionnaire would be more difficult for respondents to make a decision. PCM weighs in the factor of relative difficulty for each item and can be added as an additional layer of order in the programming script of conducting MFRM analysis.

Step 2: Script Preparation

Both Winsteps and facets are commonly used statistical tools for running Rasch Models, while facets is known for processing MFRM. Table 7.4 presents the script used for analyzing data collected from JEMIAS-Factor 6, “EMI-related teaching practices,” to which annotation is also attached.

Table 7.4 Annotated script for MFRM

Script	Annotation
Title = ^Data; Output = Data.out Facets = 3 Positive = 1 Noncenter = 1 Pt-biserial = Yes yardstick = (160, 2, −10, 10) Model =?,?, #, R9	Specify the title of the MFRM analysis. Specify the document name of the output. Identify the three facets in the model: respondent, academic major, and questionnaire item. Noncenter: Except for facet 1 (respondent), elements of all facets are centered, which sets the origin of the measurement scale. Positive = 1. For facet 1 (respondent), higher measures mean higher scores. For the other two facets (academic major and questionnaire item), higher measures mean low scores. Model: Any respondent (“?”) can combine with any academic major (“?”) to produce an item score. “#” specifies a PCM, and the item scale has nine levels.
Labels = 1, Respondent 1–102 *	Labels for the three facets. Label 1: Respondents from No. 1 to No. 102.
2, Major 1 = Liberal arts 2 = Social science 3 = Science 4 = Engineering 5 = Medical science *	Label 2: Academic majors coded from 1 to 5.
3, Item 1 = Item 4 2 = Item 13 3 = Item 36 *	Label 3: Item numbers coded from 1 to 3.
Data = “/Users/admin/Desktop/F/data6.txt”	Specify the location of the data input file.

Step 3: Data Input Excerpt

Table 7.5 provides an excerpt of the data input file, which contains the three items in JEMIAS-Factor 6 “EMI-related teaching practices” for analysis. As is shown in Table 7.5, the data input file for MFRM analysis includes vertically arranged numbers. A new column requires to be added before recording scores for multiple items. The column specifies the number of items and is expressed in the form of “1–X.” The item range is thus specified as “1–3” in Table 7.5, which represents Item 4, Item 13, and Item 36 of JEMIAS. Also, the score for all the JEMIAS items ranges from “1” to “9.” While “1” signifies “strongly disagree,” the score of “9” stretches toward the other end of “agree.”

Table 7.5 MFRM data input file excerpt

			JEMIAS-Factor 6
Respondent no.	Academic major	Item no. range	Item 4	Item 13	Item 36
1	1	1–3	4	5	7
2	1	1–3	5	6	7
3	1	1–3	3	7	6
…	…	…	…	…	…
102	5	1–3	7	2	5

Step 4: Subscales and Dimensions

Factor analysis, which has been implemented during the construction phase of JEMIAS, is pivotal in confirming the structure of a questionnaire instrument. An important function of factor analysis is helping researchers identify subscales, or different groups of items. The significance of extracting subscales from one instrument has also been emphasized in Krägeloh et al. (Reference Krägeloh, Medvedev, Hill, Webster, Booth and Henning2018) and Zhao et al. (Reference Zhao, Huen and Chan2017). The “positive” wording and “negative” wording of questionnaire items, however, are also of vital importance to the successful interpretation of questionnaire data. For example, JEMIAS contains items such as “I think that increasing English-taught programs taught in Japan is not practical,” where the statement sentence is phrased in the form of negative wording. Participants’ responses to these items would deviate from those for items such as “Because I learn through English, I can remember and retain my English.” The possible influence of item wording on the item score is also in need of evaluation when researchers attempt to analyze participants’ responses.

In this case study, the researcher first conducted MFRM on the five items grouped as Factor 1: “Negative impact of EMI,” where “negatively impact” and “less” are the keywords for phrasing the item texts. In contrast with “strongly agree,” responses of “strongly disagree” would lead to a positive evaluation of EMI.

Figure 7.1 demonstrates a Wright map generated from MFRM, which presents a preliminary view of the synergic effects of the three facets on the item score. The first column of asterisks (“*”) represents the number of participants who responded to the questionnaire, and the number “0” is set to represent the origin of the logit scale. In the second column, however, it could be noticed that participants’ academic major of “Liberal arts,” “Medical science,” “Science,” and “Social science” are located at approximately the same level. “Engineering,” however, appears beneath the level of “0.” As the script in Table 7.4 specifies, “For facet 1 (respondent), higher measures mean higher scores. For the other two facets (academic major and questionnaire item), higher measures mean low scores.” In the data analysis procedure, a higher logit scale generated from “academic major” or “item” usually leads to a lower item score, and the participants are leaning toward the end of disagreement rather than agreement. In other words, engineering majors tend to endorse a higher score for the items on the questionnaire. They incline to be on the “agree” side in acknowledging the negative impact of EMI and are more likely to perceive the negative influence of EMI.

Figure 7.1 MFRM Wright map for the items of Factor 1

The third column includes a ranking of the five items selected for MFRM analysis. Items located in the upper section indicate a higher logit scale, and participants are more likely to select a lower score for the item. In this scenario, respondents tend to disagree more with Item 46: “I think that an increase in courses taught through English in Chinese universities has had a negative impact on teaching and learning.” In comparison, Item 3, “Because I learn courses in English, I have less time to do things I am interested in” has the lowest logit scale, and the respondents are inclined to agree with the statement.

The last column offers researchers with information about participants’ actual use of the scale. For the five items grouped in Factor 1, respondents are relatively consistent in endorsing “2” and “8” across the items. The distance between “1” and “2” and that between “8” and “9”, however, are much wider, which implies that “1” and “9” are much less frequently used. If the short lines under “2” or over “8” fluctuate violently across items, researchers may need to investigate more participants’ use of scales. In this case, participants have their own interpretation of the scale and cannot agree to a consensus. It is also possible that the internal validity of each item calls for a more detailed examination.

Step 5: More Statistical Support

In addition to the Wright map presented in Figure 7.1, the output data file of MFRM analysis includes more statistics as evidence for model efficiency. Table 7.6 is a logit scale report for the two facets of academic majors and items.

Table 7.6 Logit scale and infit/outfit information for items of Factor 1

Academic major	Logit scale	Infit MNSQ	Outfit MNSQ	Item	Logit scale	Infit MNSQ	Outfit MNSQ
Social science	0.23	1.33	1.33	Item 46	0.40	1.00	0.97
Medical science	0.12	1.12	1.10	Item 39	0.10	0.61	0.61
Science	0.11	1.11	1.11	Item 14	−0.03	1.10	1.18
Liberal arts	0.07	1.02	1.02	Item 48	−0.12	0.89	0.92
Engineering	−0.30	0.38	0.49	Item 3	−0.15	1.34	1.37

Both the Infit MNSQ and Outfit MNSQ are within the range between −2 and +2 for the facets “Academic major” and “Item.” As for the facet “Respondent,” less than 10 percent of the respondents have their Infit MNSQ and Outfit MNSQ beyond the range between −2 and +2. All of the statistics have provided evidence for data quality.

Logit scale numbers have also revealed the ranking order of the two facets. Students with an academic background in social science tend to number higher on the logit scale, or disagree that EMI will have a negative impact on their learning experience.

Research Question 1: Will students’ attitudes toward EMI be influenced by their academic majors?

Information regarding logit scales across different academic majors is presented in Table 7.7. It is not hard to notice that the range of logit scale is restricted for most of the item groups, as the difference between the highest and lowest logit scale is mostly narrower than 0.5. A tentative conclusion could be made that Chinese undergraduate students’ attitudes toward EMI vary little across academic majors.

Table 7.7 Logit scales across different academic majors

Factor	Theme	Academic major	Logit scale
1	Negative impact of EMI	Social science	0.23
		Medical science	0.12
		Science	0.11
		Liberal arts	0.07
		Engineering	−0.30
2	Learning opportunities provided by EMI	Liberal arts	0.11
		Social science	0.07
		Engineering	0.07
		Medical science	−0.12
		Science	−0.13
3	Personal learning restrictions for EMI	Science	0.17
		Social science	0.12
		Liberal arts	−0.09
		Medical science	−0.10
		Engineering	−0.11
4	The connection between EMI and international identity	Engineering	0.06
		Liberal arts	0.05
		Social science	0.03
		Science	−0.04
		Medical science	−0.09
5	The use of English in daily life	Engineering	0.74
		Social science	0.38
		Medical science	−0.14
		Liberal arts	−0.31
		Science	−0.67
6	EMI-related teaching practices	Social science	0.16
		Medical science	0.00
		Science	0.00
		Liberal arts	−0.02
		Engineering	−0.14

Among the six different factors, however, items grouped in Factor 1 and Factor 5 have the largest logit scale differences. As for Factor 1, “negative impact of EMI,” students majoring in social science tend to disagree more that EMI would lead to a negative influence on their academic work. To some extent, this phenomenon corroborates with the logit scale information generated for Factor 6, “EMI-related teaching practices.” Two of the three items grouped in Factor 6 are negatively worded responses to classroom interaction with professors. Again, social science majors rank the top in disagreeing with the negative impacts of EMI teaching practices.

Logit scales for items grouped in Factor 5 also show a relatively larger difference between the highest and the lowest point. Engineering majors tend to endorse a lower item score and disagree with the fact that English is not frequently used in China. Science majors, however, incline to report the opposite. Most of the educators and policymakers in higher education institutions would probably place students of these two academic majors in the same category, especially when students are registering for courses of English for Specific Purposes or English for Academic Purposes. It is interesting to observe that undergraduate students’ perception of the use of English in daily life may not be as homogeneous as is expected.

More information regarding Wright maps for each item group is attached in Appendix B. MFRM and statistics generated have produced evidence that JEMIAS works efficiently for students of various disciplinary backgrounds, as the range of logit scale is consistently restricted across participants in different academic major groups.

Research Question 3: With which JEMIAS items would the student respondents agree or disagree the most regarding their attitudes toward EMI?

The logit scales extracted from MFRM analysis, as provided in Tables 7.7 and 7.8, would help researchers identify items that would elicit the strongest sense of agreement or disagreement. According to Table 7.7, the range of logit scale difference across items is more scattered for some factors than others. For example, Item 6 grouped in Factor 3, “Personal learning restrictions for EMI,” “I feel my English is not good enough to learn a course through English,” presents a logit scale of 0.49, indicating that the possibility for participants to select “disagree” and “strongly disagree” is larger than their opting for “agree” or “strongly agree.” To clarify, the participants recruited for this pilot study do not have a strong feeling that their English proficiency level is a road block against successful EMI. However, Item 13, “I think that in courses taught through English students speak a lot less than in courses taught in Chinese,” which is grouped in Factor 6, “EMI-related teaching practices,” renders a logit scale of −0.39. This low logit scale suggests that participants are likely to give a high item score and agree with the statement. Although students tend not to perceive their English proficiency level as a hindrance for learning through EMI, they still reported the possible decline of oral output when English is extensively used as the instructional language in classrooms.

Table 7.8 Logit scales across different items

Factor	Theme	Item	Logit scale
1	Negative impact of EMI	Item 46	0.40
		Item 39	0.10
		Item 14	−0.03
		Item 48	−0.12
		Item 3	−0.15
2	Learning opportunities provided by EMI	Item 10	0.25
		Item 23	0.24
		Item 11	0.18
		Item 18	0.13
		Item 9	0.03
		Item 12	0.02
		Item 24	0.01
		Item 2	−0.07
		Item 26	−0.24
		Item 44	−0.24
		Item 47	−0.28
3	Personal learning restrictions for EMI	Item 6	0.49
		Item 41	0.09
		Item 32	0.00
		Item 22	−0.01
		Item 8	−0.24
		Item 34	−0.33
4	The connection between EMI and international identity	Item 27	0.68
		Item 28	0.35
		Item 19	0.24
		Item 20	−0.28
		Item 30	−0.31
		Item 29	−0.31
		Item 7	−0.37
5	The use of English in daily life	Item 38	0.30
5	The use of English in daily life	Item 40	0.30
6	EMI-related teaching practices	Item 36	0.46
		Item 4	−0.07
		Item 13	−0.39

The logit scales for items grouped in Factor 4, which focus on the connection between EMI or English learning and China’s international identity, also spread relatively more widely. Student participants tend to endorse a lower item score for Item 27, “I think that students will not need English to work for a local Chinese company” (logit scale 0.68). Participants would more probably disagree with the statement and still recognize the functionality of English even in a Chinese local context. Students’ perception of the English language has also been reflected by the logit scale generated from Item 7, “I think Chinese students should be able to speak (should learn) English because it is a global language.” The logit scale of –0.37 indicates that participants tend to agree with the statement, and they highlight the importance of English learning. The scattering difference between logit scales for items grouped in Factor 4 could serve as evidence for students’ emphasis on learning and using English. However, the COVID-19 pandemic has been extremely disruptive for the application process of college study abroad scholarships and graduate school programs in English-speaking countries, which have strongly motivated undergraduate students to familiarize themselves with English as the instructional language. The possible impact of COVID-19 on Chinese undergraduate students’ attitudes toward learning English, along with the rapid changes of policies and the global situation, is in need of careful examination in the future.

MFRM analysis results for Factor 6, “EMI-related teaching practices,” have also offered insights into Chinese undergraduate students’ attitudes toward classroom activities. Item 36, “I think that professors teaching through English in China teach exactly the same way as if they would teach in America,” receives the highest logit scale of 0.46. Item 13, “I think in courses taught through English students speak a lot less than in courses taught in Chinese,” receives the lowest logit scale of −0.39. The result for Item 13 shows that students are likely to agree that they would speak English a lot less in EMI classrooms. In addition to possible challenges caused by language use, the difference in class configuration and teaching style might also result in a lower estimate of oral output in classrooms. While courses delivered in English-speaking countries such as the United States might seek for active in-class participation and group discussion, most of the courses in Chinese universities are probably lecture-based and have a large class capacity. The result of Item 36 (logit scale 0.46) seems to corroborate with the prior analysis, as students seem to disagree more that professors are teaching exactly the same as they are in the United States. Analysis results for Factor 6 would definitely lead to more research opportunities to observe how EMI courses are designed and delivered in the Chinese context, as the analysis in this case study has presented multiple paths for result interpretation.

Concluding Remarks

As one of the categories of Rasch Models, MFRM has provided researchers with a practical option to process questionnaire data. Through transforming ordinal scales into interval scales, Rasch modeling preserves the property of “ranking orders” for questionnaire items. However, the final logit scale still enables researchers to conduct numerical and even parametric comparison across different groups of participants or items. In addition, MFRM captures the concept of “facet” and presents Wright maps that integrate all related facets (e.g., individual participants, participants’ background information, items) into a panoramic picture. Researchers are offered more opportunities to narrate the “story” behind scales, thus expanding abundant space for data interpretation.

The pilot study included in this chapter acts as an instructional case for unpacking the whole process of conducting MFRM analysis. The interpretation of logit scales, however, also elicits more questions regarding the EMI context in the Chinese mainland. First, the adapted JEMIAS does not reveal prominent differences across Chinese undergraduate students’ disciplinary background. The scale is measuring student participants’ most fundamental perceptions and attitudes toward EMI. Second, individual items related to specific factors, such as the relationship between EMI instruction and China’s international identity as well as EMI-related teaching practices, call for further investigation. A more granular analysis of Chinese students’ concerns, opinions, and attitudes toward EMI would hopefully lead to stronger instructional support and more efficient plans for designing relevant programs in the future.

8 Questionnaire Development and Analysis in EMI Research A Study of EMI Learner Motivation and Attitudes

Introduction

The emergence of English Medium Instruction (EMI) education attracts a group of language education scholars who adopt a wide range of research methodologies to investigate various EMI issues, including the effectiveness of EMI programs (e.g., Lo & Lo, Reference Lo and Lo2014) and perceptions, experience, and characteristics of different stakeholders (e.g., Farrell, Reference Farrell2020; Pun, Reference Pun2022). One popular research methodology, which is widely adopted in the current EMI literature, is questionnaire-based research (e.g., Macaro et al., Reference Macaro, Akincioglu and Han2019; Xie & Curle, Reference Xie and Curle2022). While questionnaire findings depict certain features of a large population (e.g., EMI student motivation), questionnaire-based EMI research may be doubted regarding its validity and reliability. Such critiques are due to researchers’ inadvertent negligence of important steps during questionnaire development, validation, and analysis procedures (Curle & Derakhshan, Reference Curle, Derakhshan, Pun and Curle2021). For example, many EMI researchers are not aware of the necessity of a pilot study, or the validity of the questionnaire is not well documented (Macaro et al., Reference Macaro and Akincioglu2018).

To offer some suggestions on how to conduct questionnaire-based EMI research, this chapter reports an empirical study on Chinese EMI students’ attitudes and motivation toward learning the content subject matter in English. With the EMI study as an example, the researchers intend to demonstrate specific steps in questionnaire development (Dörnyei, Reference Dörnyei2007) and methods of questionnaire validation and analysis in the pilot stage.

The Research Project

This research project is to examine and compare Chinese EMI students’ motivation and attitudes toward EMI education. The role of different demographic variables (i.e., discipline, gender, study level) in EMI learner motivation is also investigated. Methodologically, it demonstrates how quantitative approaches and methods (mainly the questionnaire study) can be adopted in EMI research and how the quantitative data can be analyzed with inferential statistical analyses. The motivations behind the research project are multifold. First, along with the emergence of EMI as a vibrant research field (Macaro, Reference Macaro2022), a mushrooming literature has explored EMI policymaking and curriculum design as well as EMI teachers’ classroom practice and continuing development (e.g., De Costa et al., Reference De Costa, Green-Eneix and Li2020; Pun & Thomas, Reference Pun and Thomas2020; Yuan et al., Reference Yuan, Li, Peng and Qiu2022). While such a line of inquiry sheds light on the effective implementation and reforms of EMI in higher education, it is equally crucial to consider students’ perspectives and experiences in diverse EMI programs. In particular, in view of the pivotal role of motivation in shaping students’ learning choices and subsequent engagement and efforts (Gardner, Reference Gardner2004), there is a need to investigate students’ motivation and attitudes as well as the influencing factors in EMI environments.

Second, a cluster of recent studies (e.g., Kim et al., Reference Kim, Kweon and Kim2017; Zhang & Pladevall-Ballester, Reference Zhang and Pladevall-Ballester2022) has revealed students’ positive attitudes toward EMI, their lack of self-confidence about English use, and their tendency to rely on first language (L1) in EMI classrooms. One notable limitation of these studies is their focus on students’ general attitudes and motivation (e.g., about course experience and language use) without probing the complexities involved in their perceptions shaped by various demographic factors, personal dispositions, and contextual forces. For instance, Jiang et al. (Reference Jiang, Zhang and May2019) investigated students’ motivation toward EMI in the Chinese context, and their research design and focus were primarily concerned with students’ language learning needs as the major determinant of students’ EMI motivation. In Macaro and Akincioglu’s (Reference Macaro and Akincioglu2018) survey study, students’ EMI motivation was examined in relation to different variables including gender, year group, and university type. Although the findings demonstrated the strong mediating effects of such factors on students’ choices to enter EMI programs, the study cannot fully capture students’ motivation influenced by intricate, dynamic sociocultural environments. Such a gap gives impetus to the present research project, which, informed by relevant motivational theories, looks into the complexities of students’ EMI motivation in Chinese higher education settings. Particularly, our study not only focuses on the linguistic aspect of EMI (e.g., students’ attitudes toward English and their language learning needs) but also treats EMI as a powerful form of educational experience, in which students’ multiple types of motivation (e.g., integrative and instrumental motivation) may emerge and evolve in their situated institutional and sociocultural settings.

Third, China presents an interesting and meaningful context for EMI research given its fervent pursuit of internationalization of higher education in the past decades. As reported by Jablonkai and Hou (Reference Jablonkai and Hou2023) in their recent review article on EMI in China, around 98 percent of Chinese universities have made active attempts to launch bilingual or EMI programs, while some top universities even offer over 200 EMI courses in a variety of disciplines to attract international students and enhance their level of internationalization. On the other hand, EMI is not without controversy in China. For instance, given the status of English as a Foreign Language (EFL) in China, some university teachers and students may show resistance toward EMI because it may hinder students’ comprehension and application of content knowledge. Some even hold the perception that English as a colonizing and imperialist language can pose threats to the local knowledge and culture in Chinese society (Yuan, Reference Yuan2020). Such beliefs speak to the complexities embedded in the provision of EMI, which not only is a matter of instruction language but also intertwines with diverse intuitional, social, and cultural issues. By exploring students’ motivation and attitudes toward EMI against such a complex backdrop, the project can bring relevant and practical insights to EMI teachers and university managers in both China and similar EFL settings (e.g., Japan and Korea).

Quantitative Research Design

To shed light on university students’ EMI motivations and attitudes in Chinese universities, the research team adopted a mixed methods approach, which involved a large-scale questionnaire survey and follow-up case study (e.g., interviews and observations). Given the scope of the book focusing on quantitative research methods in EMI research, this chapter presents and discusses the design and implementation of a questionnaire survey with practical implications and directions for future research.

We chose to conduct the questionnaire survey based on the following reasons. First, the questionnaire survey study is cost-effective (Dörnyei & Taguchi, Reference Dörnyei and Taguchi2009), and it enables the team to grasp a general understanding of motivation data from EMI students in a wide range of contexts and disciplines. Second, the researchers can conduct inferential statistical analyses to examine EMI motivation and attitudes and explore the role of demographic variables within the constructs. The researchers’ bias can be avoided. Third, the questionnaire survey also serves as the first stage of a larger EMI research project. Based on the findings, the researchers interviewed and observed some of the questionnaire participants to capture in-depth information about their motivational constructs (cf. the explanatory sequential design in Creswell, Reference Creswell2015).

Method in Detail

The Participants

In this study, the scale was administered to 541 EMI students (mean age: twenty-one years old) from three universities in China who had been learning English for more than six years. All three universities are top-tier universities in China (among the top thirty-six in the list of Double First Class Universities in China). They are located one each in northern China, middle China, and southern China because the researchers intended to avoid the potential influence of geographic variables (cf. Qiu & Xu, Reference Qiu and Xu2023) and depict a complete picture of Chinese EMI students’ attitudes and motivation. Among the participants, 290 were male and 251 were female. Also, 328 of them were from hard science disciplines (e.g., chemistry, mechanical engineering, computer science), while 213 were from soft science disciplines (e.g., philosophy, accounting, journalism) (Becher & Trowler, Reference Becher and Trowler2001). The population also consisted of 348 undergraduate students and 193 postgraduate students.

Questionnaire Adaptation

The questionnaire survey consisted of two sections: demographic information of participants and a scale on EMI student motivation and attitudes. Five question items were developed to collect the demographic information of the participants, including gender, age, the tier of the university they enrolled in, their study level (bachelor’s degree program, master’s degree program, or doctoral degree program), and discipline.

The scale on EMI motivation was developed based on the Attitude/Motivation Test Battery (AMTB) developed by Gardner (Reference Gardner2004), which is based on a socioeducational model in second language (L2) research. Underlying this model, Gardner and Lambert (Reference Gardner and Lambert1972) propose that learners’ attitudes toward the L2 and its speakers affect their motivation, which, as a result, imposes an impact on learner achievement. The questionnaire has been proved to be effective to measure L2 learners’ attitudes and motivation (AI-Hoorie & MacIntyre, Reference AI-Hoorie and MacIntyre2019; Gardner & MacIntyre, Reference Gardner and MacIntyre1991). Recently, a few studies also adapted the questionnaire into Content and Language Integrated Learning contexts (e.g., Isidro & Lasagabaster, Reference Isidro and Lasagabaster2022) and EMI contexts (e.g., Lei & Hu, Reference Lei and Hu2014).

Three dimensions of the questionnaire were adopted: (1) attitudes toward English language learning, (2) integrative motivation, and (3) instrumental motivation. Attitudes are related to “the individual’s reaction to anything associated with the immediate context in which the language is taught,” such as positive (favorable) or negative (unfavorable) considerations for L2 learning (Masgoret & Gardner, Reference Masgoret and Gardner2003, pp. 172–173). Integrative motivation refers to learners’ openness to integrate into the community of native speakers (e.g., to appreciate the culture of native speakers), while instrumental motivation is associated with the pragmatic driving force of learning (e.g., obtaining a better job). These concepts were adapted to the EMI context in this study. Students’ attitudes toward learning content knowledge in English and their integrative and instrumental orientation to EMI education were the foci.

In the 5-point Likert scale questionnaire, the researchers adapted the question items of Gardner (Reference Gardner2004) from the context of English as a Second/Foreign Language to learning the content subject in English. To revise the question items, they first did a literature search and read through academic works related to EMI and Gardner’s socioeducational model (e.g., Doiz et al., Reference Doiz, Lasagabaster and Sierra2014; Gardner, Reference Gardner2010). Following the literature search, meetings were held to discuss how to adapt the question items of Gardner (Reference Gardner2004) to the EMI context. For example, “Learning English is really great” was adapted as “Learning subject content in English is really great.” Four items were developed for each dimension, and participants were asked to select whether they agreed or disagreed with each item (5 = strongly agree, 4 = agree, 3 = neither agree nor disagree, 2 = disagree, 1 = strongly disagree). The question items were also translated into Chinese by the first author, and the translation was circulated among the research team for comments. One student assistant back-translated the Chinese items into English, which was compared with the original items. After revisions, the drafted questionnaire was sent to three experts in the fields of EMI and language education and two EMI teachers, who are familiar with questionnaire development, for feedback on questionnaire design, wording, and translation. Online meetings were held for researchers to clarify experts’ confusion, and the questionnaire was finalized for the pilot study.

The Pilot Test

The questionnaire was piloted with 230 EMI students from a university in China, who held prior experience of attending at least one EMI course. They were recruited by their English course tutors and voluntarily participated in this study. Their consent was sought, and the research team also informed them that the data collected would be used for research purposes only and they could complete the questionnaire online anonymously. Exploratory factor analysis (EFA) and the Cronbach’s alpha test were performed with the data collected. EFA is “one of a family of multivariate statistical methods that attempts to … identify the common factors that explain the order and structure among measured variables” (Watkins, Reference Watkins2018, p. 219). EFA measures the validity of a questionnaire, and the Cronbach’s alpha test measures the reliability of the scale. According to Hair et al. (Reference Hair, Anderson, Babin and Black2010), if the Cronbach’s alpha is larger than 0.9, the questionnaire demonstrates excellent reliability. Hence, the scale of this study (Cronbach’s alpha = 0.956) was extremely reliable. Based on EFA, the researchers excluded one question item from each dimension because each was loaded on two different dimensions. The scale was finalized with nine question items, with three items for each dimension.

Data Collection

To collect the questionnaire data for the main study, we first contacted EMI or English course teachers who we knew about and who were working at the three universities via email. And those who were willing to help with data collection invited their students to complete the questionnaire online or sought their colleagues’ help with the recruitment of participants. The research participants were expected to fulfill the following selection criteria. First, they should be full-time undergraduate or postgraduate students who had completed at least one EMI course or were taking EMI courses when the data were collected. Second, they should be learning English as their second/foreign language and speak Chinese as their first language. Before the data collection, the participants were briefed about the research project and informed of the confidentiality of their input. They were also informed that their input would not be sent to their teachers and thus would not affect their academic achievement. All participants completed the questionnaire online. We first collected the data of 561 students, but 20 were excluded after sorting out the data because their input was incomplete.

Data Analysis

The data analysis was conducted in three steps. The first one was to reconfirm the validity and reliability of the scale. Second, the correlations of the three dimensions were investigated. Third, the attitudes and motivation of EMI students with different demographic variables were examined.

Exploratory factor analysis and reliability analysis were performed with IBM SPSS Statistics software version 23 to investigate the validity and reliability of the scale. After validating the scale, we conducted a correlation analysis with SPSS software to explore the relationships between the three constructs. Independent t-tests were then conducted to explore the attitudes and motivation of EMI students with different demographic backgrounds, including gender (male versus female), level of study (undergraduate versus postgraduate), and disciplines (hard science versus soft science). The results are reported in the following section.

Data Reporting

Descriptive Statistics

Table 8.1 depicts the means and standard deviations for the eight question items. One question item was removed after EFA because of the cross-loadings on different question items. In general, the research participants held positive attitudes toward content subject learning in English (Mean = 3.52) and were motivated (Mean > 3.50). A significantly higher mean score was reported on instrumental motivation (Mean = 3.98) than on integrative motivation (Mean = 3.55), t = −12.284, p < 0.001, Cohen’s d = 0.44, which indicates that the students’ EMI motivation was more likely to be driven by pragmatic reasons (e.g., job hunting). According to Cohen (Reference Cohen1988), the thresholds for small, medium, and large effect sizes are 0.20, 0.50, and 0.80. Therefore, a small effect size is observed.

Table 8.1 Descriptive statistics of the scale

Item	Mean	Standard deviation
*Attitudes*	3.52	1.04
1. Learning subject content in English is really great.	3.54	1.15
2. I plan to learn as much subject content in English as possible.	3.65	1.07
3. I love learning subject content in English.	3.38	1.20
*Integrative orientation*	3.55	1.04
4. Studying subject content in English can be important to me because it will allow me to be more at ease with native speakers of English.	3.60	1.15
5. Studying subject content in English can be important for me because it will enable me to better understand and appreciate English culture.	3.61	1.10
6. Studying subject content in English can be important for me because I will be able to participate more freely in the activities of other cultural groups.	3.43	1.16
*Instrumental orientation*	3.98	0.92
7. Studying subject content in English can be important for me because I will need it for my future career.	4.02	0.94
8. Studying subject content in English can be important to me because I think it will someday be useful in getting a good job.	3.94	0.99

Scale Validation

The data were first input into SPSS version 23 software for EFA, which aimed at examining the validity of the scale. Eight question items relating to EMI attitudes and motivation were factor analyzed using principal components analysis with varimax rotation. Kaiser–Meyer–Olkin measure of sampling adequacy was 0.91, above the commonly recommended value of 0.6, and Bartlett’s test of sphericity was significant (χ² (28) = 3599.940, p < 0.001). Three factors were requested, based on the fact that these items were adapted from three constructs: attitudes toward learning content knowledge in English, integrative motivation of learning content knowledge in English, and instrumental motivation of learning content knowledge in English. After rotation, the first factor accounted for 31 percent of the variance, the second factor for 28 percent, and the third factor for 26 percent. Table 8.2 displays the items and factor loadings for the rotated factors, with loadings less than 0.50 omitted to improve clarity.

Table 8.2 Factor loadings from principal axis factor analysis with varimax rotation for a three-factor solution (N = 541)

Item	Factor 1	Factor 2	Factor 3	Communalities
*Attitudes*
1. Learning subject content in English is really great.	0.793			0.83
2. I plan to learn as much subject content in English as possible.	0.707			0.82
3. I love learning subject content in English.	0.770			0.86
*Integrative orientation*
4. Studying subject content in English can be important to me because it will allow me to be more at ease with native speakers of English.		0.765		0.81
5. Studying subject content in English can be important for me because it will enable me to better understand and appreciate English culture.		0.782		0.83
6. Studying subject content in English can be important for me because I will be able to participate more freely in the activities of other cultural groups.		0.824		0.88
*Instrumental orientation*
7. Studying subject content in English can be important for me because I will need it for my future career.			0.857	0.91
8. Studying subject content in English can be important to me because I think it will someday be useful in getting a good job.			0.865	0.91
Eigenvalues	2.495	2.272	2.083
% of variance	31.18	28.40	26.04

Note: Loadings < 0.50 are omitted.

The first factor, which seems to index attitudes toward learning subject content in English, had strong loadings on the first three items, with positive loadings. Item 1 had the highest loading, while Item 2 had the lowest. The second factor, which is related to the integrative motivation of learning the subject matter in English, had high loadings on Items 4–6. Item 6 had the highest loading, but the loading on Item 4 was slightly lower than on the others. The third factor, which seemed to be the instrumental motivation of learning subject content in English, loaded highly on the last two items. The loadings were similar. One question item related to instrumental motivation was excluded because it was loaded on all three factors, which was unsatisfactory. Therefore, generally speaking, the scale was effective to measure EMI students’ attitudes and motivation.

The reliability of the questionnaire was also measured. Cronbach’s alpha for the whole scale was 0.937, indicating excellent reliability. We also measured the reliability of different constructs. Satisfactorily, high reliability was found (attitudes = 0.894, integrative = 0.900, instrumental = 0.901). Thus, the results demonstrated that the results obtained from the scale can be replicated.

Correlation of Different Constructs

Pearson’s correlation analysis was conducted to explore the relations of the three constructs, and the results are reported in Table 8.3. There were strong positive correlations (larger than 0.50) among the three constructs, implying that enhancing one of the three aspects (e.g., attitudes) would lead to higher degrees of others (e.g., integrative motivation, instrumental motivation).

Table 8.3 Correlation analysis

	Attitudes	Integrative orientation	Instrumental orientation
Attitudes	1	0.797**	0.692**
Integrative orientation	0.797**	1	0.656**
Instrumental orientation	0.692**	0.656**	1

Note: ** means p < 0.001.

Demographic Information

Independent t-tests were conducted to compare the attitudes and motivation of EMI students of different genders, levels of study, and disciplines. Descriptive information on gender differences is reported in Table 8.4, and the inferential statistical analysis results suggested that female students held a higher degree of positive attitudes toward learning content knowledge in English than their male counterparts with a small effect size (t = −2.812, p = 0.005, d = 0.24). They also had higher levels of EMI integrative motivation than male students with a small effect size (t = −2.467, p = 0.014, d = 0.21). The same trend was observed for instrumental motivation with a small effect size (t = −3.017, p = 0.003, d = 0.26).

Table 8.4 Gender differences

	Gender	N	Mean	Standard deviation
Attitudes	Male	290	3.41	1.02
Attitudes	Female	251	3.66	1.04
Integrative motivation	Male	290	3.44	1.04
Integrative motivation	Female	251	3.66	1.03
Instrumental motivation	Male	290	3.87	0.91
Instrumental motivation	Female	251	4.11	0.91

In terms of the independent t-test results for undergraduate students and postgraduate students in EMI attitudes and motivation, no significant differences (p > 0.05) were found for the attitudes (p = 0.344), integrative motivation of EMI learning (p = 0.794), and instrumental motivation of EMI learning (p = 0.703). It seems that both undergraduate and postgraduate students held similar levels of positive attitudes and motivation toward learning the subject matter in English, as shown in Table 8.5.

Table 8.5 Differences of study levels

	Level	N	Mean	Standard deviation
Attitudes	Undergraduate	348	3.49	1.05
Attitudes	Postgraduate	193	3.58	1.01
Integrative motivation	Undergraduate	348	3.55	1.03
Integrative motivation	Postgraduate	193	3.53	1.06
Instrumental motivation	Undergraduate	348	3.99	0.89
Instrumental motivation	Postgraduate	193	3.96	0.96

However, disciplinary differences were spotted from the independent t-tests. Table 8.6 presents the means and standard deviations for different constructs and different disciplines. Soft science students held more positive attitudes toward learning content knowledge in English than their hard science counterparts with a small effect size (t = −1.970, p = 0.049, d = 0.17). They were also integratively and instrumentally more motivated than hard science students with small effect sizes (integrative orientation: t = −2.354, p = 0.019, d = 0.20; instrumental orientation: t = −3.051, p = 0.002, d = 0.27).

Table 8.6 Disciplinary differences

	Gender	N	Mean	Standard deviation
Attitudes	Hard	328	3.45	0.99
Attitudes	Soft	213	3.63	1.10
Integrative motivation	Hard	328	3.46	1.03
Integrative motivation	Soft	213	3.67	1.04
Instrumental motivation	Hard	328	3.88	0.89
Instrumental motivation	Soft	213	4.13	0.93

Summary

In this questionnaire-based study, we adapted a scale on EMI students’ learning attitudes and motivation and administered it to 541 students in three Chinese universities. With quantitative analyses, we validated the scale and found that the learners’ positive EMI attitudes were strongly correlated with high instrumental and integrative orientations of EMI motivation. The results also revealed gender differences and disciplinary differences in EMI attitudes and motivation.

The research findings contribute to EMI research with empirical evidence of EMI students’ attitudes toward learning content matter in English and two kinds of EMI motivation; these findings demonstrate the essentiality of considering learners’ demographic information (i.e., gender, disciplinary backgrounds) in order to develop insights into their EMI attitudes and motivational constructs. Methodologically, this study reveals that it is possible to adapt English as foreign/second language scales to EMI contexts, but the adapted scales should be piloted and the validity and reliability need to be measured before they can be formally employed in EMI research. Pedagogically, given the positive correlations among attitudes, integrative motivation, and instrumental motivation of learning content knowledge in English, enhancing EMI students’ integrative orientation of motivation may lead to higher levels of EMI attitudes and instrumental motivation and vice versa. The findings of this study also imply that there is a more urgent need to enhance the EMI attitudes and motivation of male students and those majoring in hard science disciplines because of their relatively lower attitude and motivation levels than female peers and soft science students.

Implications of Conducting EMI Questionnaire-Based Research

Questionnaire-based research is an appropriate research methodology for investigating various EMI research topics, including but not limited to EMI student motivation (Rose et al., Reference Rose, Curle, Aizawa and Thompson2020) and self-efficacy beliefs (Thompson et al., Reference Thompson, Aizawa, Curle and Rose2022). These questionnaire-based studies not only provide valid and reliable measurements for EMI scholars and language teaching practitioners but also depict a descriptive and generalizable picture of a large population. Correlation and causal relationships of different learner/teacher variables can also be examined (e.g., Soruç et al., Reference Soruç, Pawlak, Yuksel and Horzum2022).

While the significance of questionnaire-based EMI research is acknowledged (Curle & Derakhshan, Reference Curle, Derakhshan, Pun and Curle2021), its effectiveness could be optimized by strictly following established guidelines in questionnaire development and employment (e.g., Dörnyei, Reference Dörnyei2007). This process should ideally include the creation of an item pool, selection of appropriate question items, piloting of these items, and conducting item analysis before questionnaire validation. Administering the questionnaire to different student populations and performing inferential statistical analyses on the data (e.g., EFA, the reliability test) are also necessary. It should be reminded that EFA is typically suitable for validating questionnaires in their pilot stage, as demonstrated in the current study, while confirmatory factor analysis (CFA) can be further conducted when testifying the validity of a scale in the follow-up stage. In addition, various analytical methods can be adopted to examine questionnaire data and/or investigate the relations of different variables and components. These methods include but are not limited to parametric and non-parametric tests (e.g., t-test, ANOVA [analysis of variance], path analysis, structural equation modeling, and facet Rasch measurement), with researchers selecting the method that is most appropriate for their specific research questions.

The questionnaire survey design can be an integral part of the mixed methods research design in educational studies (Creswell & Clark, Reference Creswell and Clark2017). Three models of mixed methods are proposed. The first model (the explanatory design) is that questionnaire data are collected first, which inform what kinds of qualitative data are needed to supplement and explain the quantitative results. In Qiu and Fang’s (Reference Qiu and Fang2022) mixed methods research on EMI students’ perceptions of their native and nonnative English-speaking content subject teachers, the researchers first administered a questionnaire survey to 101 EMI students and 12 out of the 101 participants were interviewed on their perceptions of EMI teachers based on their input in the questionnaire survey. This is an example of the explanatory design where the interview findings reflected the reasons underlying the participants’ questionnaire input.

In the second model (the exploratory design), for example, interview data with a small group of EMI students can be collected, which inform the development of larger-scale questionnaires on a related topic. Although this method is relatively rare in EMI research, Yuan and Qiu’s (Reference Yuan, Qiu, Sah and Fang2023) design of their questionnaire on EMI teachers’ language attitudes followed this pattern. Interviews with three EMI teachers were conducted, which served as an exploratory pilot study. The three teachers’ attitudes toward L1 use in EMI classrooms also informed the development of the questionnaire survey, and the researchers integrated some of their views into the survey to explore whether the views were generally accepted by other EMI teachers in China.

The questionnaire data and the qualitative data (e.g., interview) may also be collected simultaneously (Creswell, Reference Creswell2015). In Baker and Hüttner’s (Reference Baker and Hüttner2019) comparative study of EMI programs in Europe and Asia, the researchers surveyed 121 international students from different EMI programs in different countries (Thailand, Austria, United Kingdom) and interviewed four EMI lecturers and eight EMI students. The survey results and interview findings supplemented each other.

To conclude, this chapter discusses how questionnaire-based research can be designed in the context of EMI and how questionnaire data can be quantitatively analyzed to explore EMI students’ attitudes and motivation and to reveal the different EMI attitudes and motivation degrees of students with different demographic information. It suggests that questionnaire-based studies are appropriate for EMI research, but their potential would be maximized if they could be supplemented by qualitative data.

9 Effects of EMI on Learners’ Linguistic Development A Corpus-Based Analysis

Introduction

It is widely assumed by English Medium Instruction (EMI) researchers that EMI is beneficial to learners’ language development (e.g., Kong & Wei, Reference Kong and Wei2019; Macaro et al., Reference Macaro, Curle, Pun, An and Dearden2018; Pun & Curle, Reference Pun, Curle, Pun and Curle2022; Yang, Reference Yang2014). The effectiveness of EMI on language learning has been a classic and extensively discussed topic, with various methods used to address it (see Lo & Lo, Reference Lo and Lo2014, for a research synthesis). Most studies to date have used questionnaire or language tests to measure learners’ learning outcomes and ascertain the effectiveness of EMI. Recently, a new research method has emerged in Content and Language Integrated Learning (CLIL) research to investigate the effects of CLIL on students’ language learning outcomes, namely corpus-based analysis. Compared to traditional methods, corpus-based analysis can unveil learners’ longitudinal linguistic development as instruction progresses, providing more direct and convincing evidence on the benefits of CLIL (Hafner & Wang, Reference Hafner and Wang2018; Lahuerta Martínez, Reference Lahuerta Martínez2017). For example, Bulté and Housen (Reference Bulté and Housen2019) investigated pupils’ linguistic development in a CLIL program over a period of nineteen months. Their study shows that learners’ written productions exhibit noticeable growth in syntactic complexity, such as longer production units and the use of more subordinated structures after the CLIL education. This research paradigm may also have great potential in EMI research, offering researchers an alternative perspective to explore the benefits of EMI. As such, this chapter aims to introduce the application of this quantitative research method (i.e., corpus-based analysis) in EMI research with three major tasks. First, it will summarize relevant literature exploring the effects of EMI on English learning in the second section. The third section will focus on how to use corpus-based analysis to conduct relevant studies, including corpus construction, linguistic analysis instruments, and statistical analyses. Lastly, the fourth section will present one example study that demonstrates the use of corpus-based analysis in EMI research. The example study aims to use corpus-based analysis to investigate learners’ longitudinal development of phraseological competence in an EMI course in the Chinese context.

Previous Studies on the Effects of EMI on Learners’ Linguistic Development

As pointed out by Pun and Curle (Reference Pun, Curle, Pun and Curle2022), the effect of EMI education on language learning is a pivotal research area within the field of EMI. Many researchers contend that due to the constant exposure to authentic language materials, learners’ language proficiency could be enhanced through EMI education (e.g., Aizawa, et al., Reference Aizawa, Rose, Thompson and Curle2020; Bolton & Botha, Reference Bolton and Botha2015; Jiang et al., Reference Jiang, Zhang and May2016; Marsh et al., Reference Marsh, Hau and Kong2000). Lo and Lo (Reference Lo and Lo2014) used several theories to explain the positive impact of EMI on language learning. To be more specific, according to the comprehensive input hypothesis, the provision of high-quality input is conducive to language learning. Meanwhile, many psychological theories assert that learners’ affective factors play a critical role in language learning, such as a higher level of language motivation. They further summarized that EMI can expose learners to substantial language input and enhance their language learning motivation, leading to successful learning outcomes. Nonetheless, the relevant discussion on the effectiveness of EMI education is overall “long on claims and short on empirical research” (Kong & Wei, Reference Kong and Wei2019, p. 44). Most existing empirical studies mainly use experimental designs to investigate the differences in the language outcomes between learners in EMI programs and those in non-EMI programs. In other words, the benefits of EMI programs are mainly established by comparing the achievements of learners in EMI and non-EMI programs (Lei & Hu, Reference Lei and Hu2014). For example, Lin and Morrison (Reference Lin and Morrison2010) compared the academic vocabulary knowledge of Hong Kong EMI and non-EMI learners. Participants’ lexical knowledge is measured by two aspects: receptive word knowledge and productive word knowledge. Their results indicated that learners from EMI programs outperformed those from Chinese Medium Instruction programs in both the vocabulary tests. Thus, they concluded that EMI is more beneficial to language learning. However, studies of this type are limited in the following two aspects. In the first place, the longitudinal development of learners’ language competence has been neglected. As a result, the relevant studies fail to inform us of the extent to which and how learners’ language competence has developed longitudinally in the EMI program. In the second place, language proficiency is normally measured by the scores of language tests in previous studies (e.g., Lei & Hu, Reference Lei and Hu2014; Lo & Murphy, Reference Lo and Murphy2010; Yuksel et al., Reference Yuksel, Soruç, Altay and Curle2023). Compared with measures of linguistic complexity in learners’ productions, test scores are more large-grained measures of language proficiency, which may not demonstrate the nuanced picture of learners’ language competence (Bulté & Housen, Reference Bulté and Housen2019). Therefore, more empirical data of learners’ longitudinal linguistic development are needed to better demonstrate the benefits of EMI education.

Recently, there emerged a new research trend in CLIL research, that is, using corpus-based techniques to analyze the longitudinal changes in linguistic complexity measures in learners’ written and spoken productions so as to demonstrate CLIL benefits. Linguistic complexity refers to the absolute sophistication of linguistic features in learners’ productions. Relevant measures can quantify the number of linguistic constructions in their productions, thus gauging their linguistic competence (Bulté & Housen, Reference Bulté and Housen2014; Lu, Reference Lu2011). Additionally, comparing complexity scores of students’ productions over different periods can measure the development of their linguistic competences with the progression of CLIL. Along similar veins, Jablonkai (Reference Jablonkai, Pun and Curle2021, p. 102) remarks that the analysis of lexical and syntactic complexity measures may unveil learners’ “gains in terms of discipline-specific language competence.” Table 9.1 presents the details of the relevant studies, including the context and participants of the study, linguistic complexity measures, statistical analyses, and major findings. The main objective of research of this line is to track CLIL learners’ longitudinal linguistic development, ultimately providing empirical evidence on CLIL benefits. Since EMI and CLIL share a similar pedagogical goal and an overall design (Wannagat, Reference Wannagat2008), it is assumed that the corpus research paradigm holds potential in research on EMI benefits. Therefore, as pointed out earlier, the subsequent section aims to elaborate on the application of corpus-based analysis in EMI research.

Table 9.1 Details of corpus-based studies on CLIL benefits

Studies	Context and participants	Linguistic complexity measures	Statistical analysis	Major findings
Bulté and Housen (Reference Bulté and Housen2019)	Secondary school CLIL programs in the Netherlands; five participants (aged 11 to 13) with Dutch as their first language (L1)	Three syntactic complexity measures, one lexical complexity measure, and one morphological complexity measure	Mixed effects or multilevel models	CLIL learners’ productions exhibit significant increases in syntactic, lexical, and morphological complexity measures
Gené-Gil et al. (Reference Gené-Gil, Juan-Garau and Salazar-Noguera2015)	High school CLIL programs in Spain; thirty participants (aged 13 to 16) with Spanish or Basque as their L1	Two syntactic complexity measures and one lexical complexity measure	Analysis of variance (ANOVA)	CLIL leaners’ written emails show significant increases in the scores of both syntactic and lexical complexity measures
Lázaro and García Mayo (Reference Lázaro and García Mayo2012)	Secondary school CLIL programs in Spain; fifteen participants (aged 13) with Spanish or Basque as their L1	Three morphosyntactic complexity measures	Not available	Learners’ morpho-syntactic knowledge increases significantly between the two data collection times
Roquet and Pérez-Vidal (Reference Roquet and Pérez-Vidal2017)	Secondary school CLIL programs in Spain; fifty participants (aged 13) with Catalan or Spanish as their L1	One lexical complexity measure and one syntactic complexity measure	Descriptive statistics	Students’ compositions progress in terms of syntactic and lexical complexity within this CLIL program

Furthermore, although several corpus-based studies have been conducted to examine the impact of CLIL on language learning, these studies are subject to three limitations. Firstly, the existing studies are mainly concerned with the longitudinal development of learners’ syntactic and lexical competence. However, researchers contend that linguistic complexity is a multidimensional construct that encompasses at least four components: syntactic, lexical, morphological, and phraseological complexity (Bulté & Housen, Reference Bulté and Housen2019; Lu, Reference Lu2012; Paquot, Reference Paquot2018). In other words, previous studies have neglected other linguistic complexity components, such as phraseological complexity, in addition to syntactic and lexical complexity. A more comprehensive understanding of learners’ longitudinal development of linguistic competence in the immersion programs need further exploration. Secondly, most of these studies have small sample sizes, such as five participants in Bulté and Housen (Reference Bulté and Housen2019) and fifteen in Lázaro and García Mayo (Reference Lázaro and García Mayo2012). Lastly, as for the results, virtually all the studies demonstrate that CLIL has a positive impact on learners’ linguistic development. That is, the complexity measures in learners’ productions show increasing tendencies as the instruction proceeds (see Table 9.1). However, the corresponding reasons for the growth of learners’ linguistic knowledge in CLIL programs remain underexplored. The existing studies are mostly descriptive, and few explanatory studies have been conducted to investigate the relevant factors accounting for CLIL benefits.

To summarize, more attention should be paid to the longitudinal development of leaners’ other linguistic competence, such as phraseological competence in language immersion programs. Data of larger sample sizes may also render the results more representative and generalizable. These limitations of CLIL research should also be avoided in the research of EMI benefits. Accordingly, we have carried out a study to investigate eighty-nine students’ longitudinal development of phraseological complexity measures during a semester-long EMI course in the Chinese context. Linguistic features of learners’ language input are also examined to show the possible effects of textbook input on learners’ linguistic gains in EMI education. This study will be reported in detail in the section titled “The Study.” The introduction of this study will hopefully demonstrate the potential of corpus-based analysis in understanding the extent and manner in which EMI benefits learners’ language development. Before the example study is presented, the following section will provide an overview of the methodology employed in relevant studies. This will offer target readers an understanding of how to conduct corpus-based analyses in this area.

Methodology of Corpus-Based Studies on EMI Effectiveness

This section is dedicated to the methodology of studies utilizing corpus-based analysis to investigate the effectiveness of EMI. It commences with a discussion of data collection, followed by an overview of the measures and instruments. Finally, it concludes with the statistical analyses used to reveal differences in linguistic complexity measures and demonstrate the longitudinal gains in linguistic competence. The methodology details outlined in this section may assist researchers and students in designing their own studies.

Data Collection

Corpus is a “collection of machine-readable authentic texts (including transcripts of spoken data) which is sampled to be representative of a particular language or language variety” (Gilquin, Reference Gilquin, Granger, Gilquin and Meunier2015, p. 9). To our knowledge, there are relatively few EMI learner corpora in existence despite the availability of numerous corpora for general and special purposes. Exceptions include Hong Kong Learner Corpus of Legal Academic Writing in English (Hafner & Wang, Reference Hafner and Wang2018) and Corpus of Chinese Academic Written and Spoken English (Chen et al., Reference Chen, Harrison, Weekly, Parviainen, Kaunisto and Pahta2019). In light of the limited number of EMI corpora, researchers may need to construct their own corpora depending on the goals of their research. As per the aforementioned definition of corpora, the written compositions or spoken productions by learners from EMI programs or courses could be constructed as a small-scale learner corpus. The size of a corpus is not strictly restricted and can be determined based on research aims and data access. When EMI researchers are constructing their corpora, they need to consider the following three major points. Firstly, as recognized by learner corpus researchers (Gilquin, Reference Gilquin, Granger, Gilquin and Meunier2015; Lu, Reference Lu2011), the metadata of corpora are extremely important in later research, such as learners’ language proficiency, their age, their gender, writing conditions (timed writing or untimed writing), and the genre of the task, to name a few. Thus, the metadata in relation to learners and writing tasks must be recorded. Secondly, there are two types of corpus design: cross-sectional and longitudinal. Cross-sectional corpora are data collected “from one period in time,” which is “a snapshot of learners’ knowledge of the target language at a particular moment” (Gilquin, Reference Gilquin, Granger, Gilquin and Meunier2015, p. 14). In comparison, longitudinal corpora are data collected “from several periods,” which are “a representation of the evolution of their knowledge through time” (Gilquin, Reference Gilquin, Granger, Gilquin and Meunier2015, p. 14). Thus, the longitudinal corpora may be more suitable for tracking the longitudinal linguistic development of EMI students. Lastly, as noted by substantial researchers (Bi, Reference Bi2020; Yoon & Polio, Reference Yoon and Polio2017), text type would influence linguistic features of learners’ written and spoken productions. Therefore, with regard to the longitudinal data, at each data collection point, the genre of the written prompts should be identical. Otherwise, the change of linguistic complexity measures will be blurred, in that the change may also be caused by the inconsistencies of text genres along different data collection points.

Measures and Instruments

Linguistic complexity is normally measured in four categories: syntax (syntactic complexity), lexis (lexical complexity), morphology (morphological complexity), and phraseology (phraseological complexity). Each linguistic complexity category targets a specific linguistic competence – that is, syntactic complexity measures learners’ knowledge of syntactic constructions; lexical complexity captures learners’ competence of single words; morphological complexity shows learners’ morphological knowledge; and phraseological complexity evaluates learners’ competence of multiword sequences. While syntactic and lexical complexities have been extensively studied, morphological and phraseological complexities have received less attention (Bulté & Housen, Reference Bulté and Housen2019; Paquot, Reference Paquot2018). Furthermore, within each complexity category, there are also different dimensions targeting the number of particular constructions and their corresponding sophistication. For example, syntactic complexity is usually conceptualized as four dimensions: length of production units, number of subordinate structures, number of coordinate structures, and degree of phrasal elaboration (Bulté & Housen, Reference Bulté and Housen2014; Lu, Reference Lu2011). However, researchers conceptualize each linguistic complexity category, particularly syntactic complexity, in various ways. As a result, some measures are used in only one or two studies. The inconsistency and unsustainability of complexity measures would make the results of previous studies incomparable. Thus, EMI researchers should employ widely used linguistic complexity measures to quantify students’ linguistic competence.

With the fast development of natural language processing (NLP) techniques and their growing application in corpora studies, a number of NLP tools have been invented by researchers to automatically calculate the scores of linguistic complexity measures. Some well-known automatic tools include the Second Language Syntactic Complexity Analyzer (L2SCA) (Lu, Reference Lu2010), Coh-Metrix (McNamara et al., Reference McNamara, Graesser, McCarthy and Cai2014), Lexical Complexity Analyzer (LCA) (Lu, Reference Lu2012), and the Tool for the Automatic Analysis of Lexical Sophistication (TAALES) (Kyle et al., Reference Kyle, Crossley and Berger2018). These tools simplify the coding of linguistic features in learners’ productions. Table 9.2 provides an overview of commonly used linguistic complexity measures and the corresponding instruments for analyzing them. It is important to note that, in addition to complexity, accuracy and fluency are also significant aspects for measuring learners’ linguistic development. However, due to the scope of this chapter, a detailed introduction of accuracy and fluency measures is not included.

Table 9.2 Overview of linguistic complexity measures and corresponding instruments

Linguistic complexity category	Dimensions	Measures	Instrument	Representative studies
Syntactic complexity	Length of production units	Mean length of clause	L2SCA	Lu (Reference Lu2011, Reference Lu2017)
		Mean length of T-unit
		Mean length of sentence
	Amount of subordination	Clauses per T-unit
	Amount of subordination	Dependent clauses per clause
	Amount of coordination	Coordinate phrases per clause
	Amount of coordination	T-units per sentence
	Degree of phrasal elaboration	Complex nominals per clause
	Degree of phrasal elaboration	Verb phrases per T-unit
Lexical complexity	Lexical density	Lexical density	LCA	Lu (Reference Lu2012)
	Lexical variety	Type-token ratio (TTR)
		Corrected TTR
		D measure
	Lexical sophistication	Lexical sophistication-1
	Lexical sophistication	Lexical sophistication-2
Morphological complexity	–	Morphological complexity index	–	Brezina and Pallotti (Reference Brezina and Pallotti2016)
Phraseological complexity	Phraseological sophistication	Mutual information score of phraseological units	Dependency parsing tools and in-house Python scripts	Jiang et al. (Reference Jiang, Bi, Xie and Liu2021); Paquot (Reference Paquot2018)
Phraseological complexity	Phraseological diversity	TTR of phraseological units	Dependency parsing tools and in-house Python scripts

Note: There are fourteen measures in L2SCA and twenty-six measures in LCA. Many of the measures in both L2SCA and LCA are redundant, as they assess the same construct. For instance, four measures in L2SCA assess the level of subordination. Thus, in order to save space, we have chosen to present some representative measures in case of redundancy.

Statistical Analysis

Both descriptive and inferential statistics have been utilized to illuminate on the differences in linguistic complexity measures along different data collection times. Descriptive statistics could overview the change patterns or trends of linguistic complexity measures with the proceeding of the EMI program. Inferential statistics may further demonstrate whether the differences have reached statistically significant levels. The specific statistical tests are determined by the number of data collection points. Normally, paired samples t-tests are applied to test the differences between two data collection points while repeated measures ANOVA are needed to determine the differences among three or more data collection points. In some studies, effect sizes are also presented to show the strength of the longitudinal progress (e.g., Van Mensel et al., Reference Van Mensel, Bulon, Hendrikx, Meunier and Van Goethem2020). It should be noted that in longitudinal research design, the data at different collection points are from the same group of learners, which violates the assumption of independence of scores. Therefore, paired samples t-tests rather than independent samples t-tests should be employed. It is a similar case for ANOVA. That is, repeated measures ANOVA rather than one-way ANOVA should be used.

Apart from these traditional statistics, recently, some more advanced statistical tests have also been used by learner corpus researchers to investigate individuals’ linguistic development process under the guidance of complexity theories. These studies are more concerned with the dynamic and nonlinear linguistic development trajectories. In order to achieve that goal, advanced statistical techniques such as min-max graphs and Monte Carlo Simulations have been applied to delve into the dynamic change patterns of linguistic complexity measures along different data collection points (Spoelman & Verspoor, Reference Spoelman and Verspoor2010). To date, it seems that such statistical tests have not been used by EMI researchers.

The Study

This section aims to introduce the example study for two reasons. Firstly, presenting this study will provide readers with a more profound comprehension of the potential of corpus-based analysis in EMI research. Secondly, since this study has addressed the limitations of previous research, it may offer theoretical implications for relevant studies. The example study intends to investigate eighty-nine students’ longitudinal development of phraseological competence in a four-month-long EMI course in an Eastern Chinese university. The corresponding reason for EMI benefits is also examined in this study. More specifically, this study chooses to elaborate on one important variable, that is, learners’ input to show the extent to which textbook input would influence learners’ linguistic competence in an EMI context. The following are the detailed research questions of this study:

1. How does learners’ phraseological competence change with the progression of EMI?
2. To what extent would textbook input influence learners’ linguistic performance?

The data of this study consist of students’ written responses to several short answer questions in an EMI course, whose name is English Novel Reading, along three data collection times. Ninety-five students participated in this study but eighty-nine of them completed all three assignments. Apart from taking the compulsory English as a Foreign Language classes, participants chose to enroll in this EMI course in order to boost their English proficiency and to gain more knowledge of British and American literary works. The eighty-nine students come from different science and humanities majors. Seventy-four percent of them are second-year university undergraduate students while the remaining 36 percent are first-year or third-year undergraduates. Based on the test scores of the College English Test (Band-4), they are of intermediate language proficiency. As can be seen from Table 9.3, our learner data comprise the written productions of these eighty-nine students, which amount to 32,417 words. To address the second research question, we also built a textbook corpus, which consists of the excerpts of novels for learners to read at each data collection point. These excerpts of original novels are used by instructors to teach, and students’ written productions are related to the excerpts.

Table 9.3 Details of the EMI learner corpus

	Date	Number of words	Corresponding textbook input
Data collection time 1	March 2021	8,991	An excerpt from Robinson Crusoe
Data collection time 2	April 2021	10,203	An excerpt from Gulliver’s Travels
Data collection time 3	May 2021	13,223	An excerpt from Tess of the D’Urbervilles

This study centers on learners’ phraseological competence for the following three reasons. In the first place, as argued earlier, phraseological complexity is understudied in EMI literature. In the second place, the results of previous studies indicate that compared with syntactic and lexical complexity measures, phraseological complexity measures are more correlated with learners’ language competence for intermediate-level and advanced-level learners (Paquot, Reference Paquot2018). In the third place, some studies suggest that the appropriate and idiomatic use of phraseological units such as collocations and n-grams pose challenges for second language (L2) learners (Jiang et al., Reference Jiang, Bi, Xie and Liu2021). In conclusion, investigations of the development of phraseological competence may show a more comprehensive picture of EMI’s effects on learners’ linguistic development. Since there are various types of phraseological units (collocations, bigrams, formulaic sequences, etc.), it is not feasible to include all of them in a single study. Therefore, this study focuses on the widely researched phraseological unit, namely n-grams (i.e., bigrams and trigrams). Bigrams refer to the contiguous sequences of two words and trigrams refer to contiguous units containing three words. Learners’ phraseological competence is conceptualized by the sophistication of bigrams and trigrams. To quantify this sophistication, this study adopts two approaches suggested by Paquot (Reference Paquot2018): association strength-based measures (such as MI-score and T-score) and the proportion of bigrams/trigrams in a reference corpus. Association strength-based measures aim to quantify the likelihood for the two individual words of a bigram to appear together in a reference corpus (Jiang et al., Reference Jiang, Bi, Xie and Liu2021; Paquot, Reference Paquot2018). The higher the scores, the more complex the n-grams. Therefore, regardless of the operationalization of phraseological complexity, a reference corpus is necessary to calculate the scores of phraseological complexity measures. The Corpus of Contemporary American English (COCA) has been frequently used as a reference corpus due to its large size and representativeness of contemporary English. Thus, COCA is used as a reference corpus in this study as well to compute the scores of phraseological complexity measures. Eight specific phraseological complexity measures are adopted in this study: bi_MI, bi_T, tri_MI, tri_T, bi_prop_30k, bi_prop_60k, tri_prop_30k, and tri_prop_60k (see Table 9.4). The instrument used to analyze these phraseological complexity measures is TAALES (Kyle et al., Reference Kyle, Crossley and Berger2018). One thing to note is that TAALES calculates the scores of phraseological complexity measures in reference to the different subcorpora of COCA, such as the COCA magazine corpus, the COCA news corpus, and the COCA academic corpus. Since our study is conducted in the context of an academic EMI course, we selected only the measures calculated in reference to the COCA academic corpus. Scores of the eight phraseological complexity measures in learners’ written productions along the three collection times will be compared by the use of repeated measures ANOVA to demonstrate the development of learners’ phraseological competence.

Table 9.4 Eight phraseological complexity measures used in this study

Dimension	Codes of measures	Description
Association strength	bi_MI	Mutual information of bigrams
	bi_T	T-score of bigrams
	tri_MI	Mutual information of trigrams
	tri_T	T-score of trigrams
Proportion of native-like n-grams	bi_prop_30k	Proportion of the most frequent 30,000 bigrams in the COCA academic corpus
	bi_prop_60k	Proportion of the most frequent 60,000 bigrams in the COCA academic corpus
	tri_prop_30k	Proportion of the most frequent 30,000 trigrams in the COCA academic corpus
	tri_prop_60k	Proportion of the most frequent 60,000 trigrams in the COCA academic corpus

Table 9.5 presents the descriptive statistics of the eight phraseological complexity measures in learners’ written productions along the three data collection times. The development trends of the eight measures through the three time points are presented in Figure 9.1. As is clear from Table 9.5 and Figure 9.1, four measures show clear ascending tendencies with the progression of EMI, that is, bi_MI, bi_T, tri_MI, and tri_prop_60k. There are no obvious development patterns observed for the following two measures: tri_T and bi_prop_30k. Surprisingly, two measures, that is, bi_prop_60k and tri_prop_30k, show a striking descending tendency as time proceeds. Several repeated measures ANOVA tests were also conducted to ascertain whether time or EMI education exerts significant effects on the eight phraseological complexity measures. The results indicate that there exist significant differences in six measures along the three data collection times: bi_MI (F = 13.571, p = 0.000^*), bi_T (F = 10.823, p = 0.000^*), tri_MI (F = 11.916, p = 0.000^*), bi_prop_60k (F = 8.985, p = 0.000^*), tri_prop_30k (F = 4.703, p = 0.01^*), and tri_prop_60k (F = 5.457, p = 0.005^*). No significant differences among the three times were observed for the other two measures: tri_T (F = 0.679, p = 0.412^*) and bi_prop_30k (F = 0.142, p = 0.868^*).

Table 9.5 Descriptive and inferential statistics of the eight phraseological complexity measures

Measures	Descriptive statistics			Inferential statistics
Measures	Time 1 Mean (SD)	Time 2 Mean (SD)	Time 3 Mean (SD)	F	p
bi_MI	1.435 (0.144)	1.452 (0.155)	1.533 (0.181)	13.571	0.000^*
bi_T	40.342 (9.293)	43.773 (10.805)	46.123 (9.944)	10.823	0.000^*
tri_MI	2.825 (0.395)	2.812 (0.499)	3.081 (0.585)	11.916	0.000^*
tri_T	16.042 (5.043)	17.033 (3.570)	16.584 (3.489)	0.679	0.412
bi_prop_30k	0.427 (0.069)	0.430 (0.049)	0.498 (0.070)	0.142	0.868
bi_prop_60k	0.498 (0.070)	0.487 (0.062)	0.463 (0.049)	8.985	0.000^*
tri_prop_30k	0.137 (0.041)	0.128 (0.046)	0.142 (0.045)	4.703	0.01^*
tri_prop_60k	0.137 (0.041)	0.150 (0.043)	0.172 (0.118)	5.457	0.005^*

Figure 9.1 Development trends of the eight phraseological complexity measures along three data collection times

In summary, the results presented here indicate that learners’ phraseological knowledge exhibits growth in EMI education, especially with regard to the association strength of bigrams and trigrams. Subsequently, we examined the distribution of high-frequency bigrams and trigrams in the textbooks and learners’ written productions. The similarities of phraseological units between the input and output enabled us to understand the effects of the input on EMI learners’ language use. Many high-frequency bigrams consist of two functional words such as of the, which may not be informative with regard to the input effects. Therefore, those bigrams that consist of two functional words were removed from analysis. Tables 9.6–9.8 list the most frequent twenty bigrams and trigrams in the textbook and learners’ written productions. It could be summarized from Tables 9.6–9.8 that leaners tend to use some bigrams and trigrams that appear in the textbook, showing that input would influence students’ language use. Also, as time proceeds, the effect is getting larger in that at data collection time 2 and time 3, more bigrams and trigrams in learners’ productions are the same as those of textbook input (see Tables 9.7 and 9.8). It is hypothesized that the influence of textbooks may allow learners to produce more complex phraseological units, leading to positive outcomes in their phraseological competence in English as a Foreign Language learning.

Table 9.6 Most frequent twenty bigrams and trigrams in the input and output at data collection time 1

Rank	Bigrams (learner data)	Bigrams (textbook data)	Trigrams (learner data)	Trigrams (textbook data)
1	The island	I found	If I were	Went to work
2	To survive	The ship	A knife to	A great deal
3	Desert island	A little	I were him	I came back
4	Twice as	On shore	Ink tree or	I was forced
5	A notch	A great	Of a square	When I had
6	Island and	The water	Able to build	Great deal of
7	Courage and	I went	Sides of a	I had brought
8	A knife	My raft	Three or four	If I were
9	Days and	I made	Ability to survive	The water I
10	The desert	The shore	Were him I	Sides of a
11	Notch twice	On board	Can live independently	The water and
12	Was able	To make	Some observation and	I had got
13	To build	Began to	Wine carpenter’s tool	And a little
14	Long every	The sea	Paper pen ink	A kind of
15	To make	Side of	Food domesticate poultry	To work with
16	Robinson can	The island	Live independently on	On the side
17	A desert	Was able	Carpenter’s tools guns	Was a great
18	Every seven	Side of	Wheat wine carpenter’s	The ship in
19	Seven days	The ground	Or four compasses	Man or beast
20	Food and	To build	Four compasses some	Was forced to

Note: Some n-grams are in bold to show that they are the same in the input and output.

Table 9.7 Most frequent twenty bigrams and trigrams in the input and output at data collection time 2

Rank	Bigrams (learner data)	Bigrams (textbook data)	Trigrams(learner data)	Trigrams (textbook data)
1	The king	My left	Back to Lilliput	My left hand
2	Of lilliput	The ground	A bow and	On each side
3	The emperor	My body	Bow and arrow	Half a mile
4	To lilliput	My face	To break the	Hundreds of the
5	My left	My head	Abandoned the ship	Of the habitants
6	Each other	The same	Low heel party	My left leg
7	My head	Want to	The water and	To my face
8	The high	Half a	In order to	From the ground
9	The whole	I felt	As soon as	The strings that
10	The small	I heard	The soldiers of	On my body
11	The water	The people	The low heel	Of the people
12	Divide into	I lay	Each other and	I heard a
13	Fire in	The small	The high heels	Two or three
14	Two countries	The ship	Was less than	Above a hundred
15	Want to	To get	Was a fire	On my left
16	The end	The strings	In the palace	The left side
17	High heels	The inhabitants	In the queen’s	Into the engine
18	King of	Each other	To my face	To sleep I
19	Six inches	Side of	The left side	Inches from the
20	Parties in	And made	To put out	To the right

Table 9.8 Most frequent twenty bigrams and trigrams in the input and output at data collection time 3

Rank	Bigrams (learner data)	Bigrams (textbook data)	Trigrams (learner data)	Trigrams (textbook data)
1	Was seduced	Her mother	She was seduced	Get up his
2	Her past	Your father	Was seduced by	Up his strength
3	Want to	His strength	Fell in love	End of the
4	In love	The whole	In love with	The end of
5	Wedding night	Door and	Born in a	As soon as
6	Seduced by	Want to	In a poor	The door and
7	A poor	Get up	Was born in	To the end
8	Fell in	The end	To her husband	To get up
9	Went to	The door	The end of	On to the
10	Her mother	The day	As a result	Half an hour
11	Her parents	The interior	To a rich	To the outhouse
12	The whole	Put on	Was arrested and	The spotted cow
13	The heroine	The outhouse	Night she confessed	Of the melody
14	A rich	The spotted	As soon as	Where is father
15	The day	Had passed	Confessed her past	All round there
16	Was born	Wished for	A rich old	And the song
17	Love with	So long	Arrested and hanged	They wished for
18	Put on	So many	Old women’s house	Upon the girl’s
19	Wished for	The village	At this time	Opened the door
20	This time	Alone with	To live with	If they wished

However, the analysis on input effects presented here is exploratory, and there are some limitations to this analysis. Firstly, the comparison includes only the most frequent twenty n-grams between the input and output, which may introduce bias and not fully capture the influences of the input on learners’ written productions. Secondly, the correspondence of phraseological units between the input and output is just one explicit way to demonstrate the effects of textbook input. Future studies may employ alternative methods to operationalize the textbook effects to triangulate the findings.

Concluding Remarks

This chapter outlines the application of corpus-based analysis in exploring the impact of EMI on learners’ linguistic development. It details the methodology of such analyses. To summarize, researchers need to consider the design of the corpus, the sustainability of measures, and the robustness of the statistical analyses. Additionally, an example study is presented to offer readers a clear understanding of the value of corpus-based analysis in EMI research. Future studies may compare the longitudinal linguistic gains of EMI learners with those of non-EMI learners to provide further empirical evidence of EMI benefits.

10 Path Analysis of Science Learning in Hong Kong’s EMI Secondary Schools

Introduction

Research in the field of language education has a long-standing history of utilizing structural equation modeling (SEM) to examine intricate relationships among variables related to language learners. It is suggested that models developed in SEM methodology “are usually, and should best be, based on existing or proposed theories that describe and explain phenomena under investigation” (Raykov & Marcoulides, Reference Raykov and Marcoulides2006, p. 6). As one of the SEM methods, path analysis intends to explain causal relations formulated within the model based on existing knowledge on what a researcher wants to study without considering latent variables (Collier, Reference Collier2020). Under the assumption that variables in a hypothesized model are linearly connected, path analysis can simultaneously analyze complex relationships between dependent variables and independent variables and reveal direct and indirect effects of an independent variable on a dependent variable (Jeon, Reference Jeon2015). Therefore, researchers widely use this technique for examining and developing theories of interest in their research domain.

Despite the enduring popularity of path analysis, there has been limited research in the context of English Medium Instruction (EMI) to illustrate established theories. Moreover, researchers have yet to incorporate statistical data to refine the theoretical models and better elucidate the causal relationships between various factors that potentially influence EMI students’ academic achievement. To fill this gap, this study aims to develop and analyze a well-fitted model that could account for contingent links between variables that directly and indirectly affect EMI students’ academic achievement in science. Drawing on survey data from eight EMI secondary schools in Hong Kong, the current study identified interplayed roles of students’ English proficiency, language use in the science classroom, self-perceived English difficulty in the science classroom, self-concept of science learning, and science achievement.

Theoretical Background on EMI

Learners’ English Proficiency and Academic Success

According to Cummins’s (Reference Cummins2000) threshold hypothesis, the advantageous aspects of bilingualism are unlikely to take effect until the learner has “attained a certain minimum or threshold level of competence in a second language” (p. 229). Consistent with Cummins’s hypothesis, some researchers have found that second language (L2) learners’ language proficiency plays an important role in their academic performance (e.g., Ardasheva, Reference Ardasheva2010; Johnson & Swain, Reference Johnson and Swain1997; Leaver & Stryker, Reference Leaver and Stryker1997; Mahon, Reference Mahon2006; Solórzano, Reference Solórzano2008).

In EMI, English language competence has often been found to be the main issue affecting learners’ content comprehension and classroom interactions (e.g., Kym & Kym, Reference Kym and Kym2014; Rose et al., Reference Rose, Curle, Aizawa and Thompson2019). Previous studies have revealed that Hong Kong secondary school students generally have insufficient English proficiency to handle the tasks assigned in EMI classrooms (Marsh et al., Reference Marsh, Hau and Kong2000; Yip et al., Reference Yip and Tsang2007). Similar problems have emerged in science subjects, in which teachers have to adjust their teaching methods due to students’ lack of English competence (Probyn, Reference Probyn2006, Reference Probyn2009). It is therefore reasonable to consider that the English proficiency of EMI students is an important factor in their science learning success.

Language Use in the EMI Classroom

English Medium Instruction classrooms can present various challenges in terms of classroom interaction, as highlighted by Lo and Macaro (Reference Lo and Macaro2012). In some instances, subject teachers in EMI classrooms have been observed to switch from English to their native language to enhance student engagement. While EMI promotes the use of English for teaching content subjects, additional language support may be necessary during interactions that are not progressing well, such as employing intermittent use of the first language (L1) (Xie & Curle, Reference Xie and Curle2020). Sibomana (Reference Sibomana2020) discovered that when EMI teachers switched from English to L1 in the classroom, more students tended to participate by raising their hands to answer questions.

Furthermore, Lin and Wu (Reference Lin and Wu2015) conducted a comprehensive analysis of classroom interactions in an EMI science classroom. The researchers observed that many science teachers utilized translanguaging practices by drawing upon their linguistic resources to engage with students in the classroom (Pun & Tai, Reference Pun and Tai2021; Tai & Li, Reference Tai and Li2020). To facilitate students’ comprehension of subject matter, teachers employed L1 as the everyday language for communication when teaching scientific knowledge, while using English for subject-specific language. Consequently, allowing for mixed language communication between teachers and students has been recognized as an effective approach for content learning in the EMI context (Wu & Lin, Reference Wu and Lin2019).

Impacts of Affective Variables on Learning

In previous studies, researchers have recognized the significant influence of learners’ psychological factors on their academic endeavors (Macaro, Reference Macaro2018; Marsh et al., Reference Marsh, Trautwein, Lüdtke, Köller and Baumert2005; Shen & Tam, Reference Shen and Tam2008). Within the realm of EMI research, there has been a notable focus on students’ beliefs, perceptions, and attitudes toward subject learning in schools (Macaro et al., Reference Macaro2018). Multiple studies (Hengsadeekul et al., Reference Hengsadeekul, Koul and Kaewkuekool2014; Marsh et al., Reference Marsh, Hau and Kong2000, Reference Marsh, Hau and Kong2002; Parker et al., Reference Parker, Marsh, Ciarrochi, Marshal and Abduljabbar2014; Roo et al., Reference Roo, Ardasheva, Newcomer and Magaña2018; Thompson et al., Reference Thompson, Aizawa, Curle and Rose2019; Yip & Tsang, Reference Yip and Tsang2007) have established strong correlations between learners’ perceptions and the quality of their learning experiences in school. The findings generally suggest that L2 learners with higher levels of English proficiency tend to exhibit greater enthusiasm for classroom participation and hold more positive perceptions of EMI, ultimately leading to better learning outcomes.

Regarding individual affect, the self-concept of L2 students plays a partial role in their ability to overcome challenges encountered in EMI settings and motivates them to employ various learning strategies (Lo & Lo, Reference Lo and Lo2014). Recent research has focused on the self-concept of EMI students, drawing on Marsh’s (1990) reciprocal effect model that posits a virtuous cycle between learners’ self-concept and academic achievement. Several studies have confirmed a strong association between students’ self-concept and their academic performance. Acknowledging the significance of affective variables in the existing literature, this study primarily considers two affective variables in constructing the research model: EMI students’ perception of English difficulty in the science classroom and their self-concept of science learning.

Taking into account five variables (EMI students’ English proficiency [EP], language for interaction in the science classroom [LISC], perceived English difficulty in the science classroom [PEDSC], self-concept of science learning [SCSL], and science achievement [SA]), this study initially establishes three hypothetical models aimed at explaining effective science learning in the EMI context. The researchers aim to address the following research questions:

1. Which model provides the best explanation for the structural relationships among the focal variables?
2. How can students’ science achievement be explained based on the chosen model?

Method

The data for this study were collected through a large-scale survey administered to senior secondary students attending EMI secondary schools in Hong Kong. The researchers carefully selected eight EMI schools that matched in terms of geographical location, average academic performance, curriculum design, and regional socioeconomic status. Prior to the recruitment process, the researchers sent invitation emails to the schools, requesting their participation in the study. All eight EMI secondary schools responded to the invitation and confirmed their willingness to cooperate.

In Hong Kong, secondary schools are categorized into three bands based on academic performance, with Band 1 representing above-average schools where students maintain high academic performance and are eligible to learn content subjects in English. In this study, all participating schools were from the Band 1 category, located in close proximity to each other, and shared similar demographic features.

Participants

Initially, 600 questionnaires were distributed to science students studying at the participating EMI schools. A total of 400 completed questionnaires were collected, resulting in a response rate of 66.7 percent. To ensure statistical power and minimize the impact of missing values, survey forms with missing responses for any of the focal variables were excluded from the analysis. Therefore, the response data from 356 students were used for the final analysis, with 152 males, 203 females, and 1 missing gender value. Among the respondents, 162 students were from Grade 10 (45.5 percent) and 194 students were from Grade 11 (54.5 percent).

Measurement Scale

The survey instrument included items adapted from existing measurement tools commonly used in EMI research. The questionnaire covered various indicators that could potentially influence science achievement, such as English proficiency, self-concept (Yip et al., Reference Yip and Tsang2007), classroom language use (Evans, Reference Evans2000), and language difficulties encountered in EMI classrooms (Tatzl, Reference Tatzl2011). All constructs were measured using a 5-point Likert scale. The survey items were phrased in plain language to enhance students’ comprehension.

To establish the validity of the questionnaire, five experienced science teachers from the participating schools were invited to review, provide feedback, and evaluate all survey items in terms of face validity. Following several rounds of discussion, the content of the items was finalized. The content validity index was calculated to be 0.86, indicating an acceptable score. A pilot test was then conducted with 116 Grade 10 students from one of the participating schools. These students were not included in the main study. The results of the pilot test indicated that participants had no difficulties comprehending the survey items. However, during the survey administration, some students reported that they could not recall the exact scores from their previous science and English exams. To address this issue and ensure data accuracy, teachers were asked to double-check the grades reported by the students. In cases where inconsistencies between student and teacher reports arose, the information provided by the teachers was primarily relied upon. The reliability of the survey items was assessed by calculating the internal consistency of the focal variables (see Table 10.1).Footnote ¹

Table 10.1 Research variables

Variable(Cronbach’s alpha)	No. of items	Cronbach’s alpha
SA	Participants were asked to report their examination outcomes from the previous year in biology, chemistry, and physics subjects.	–
EP	Participants were asked to report their English grades on the previous year’s school examination, providing separate answers for speaking, reading, writing, and listening.	–
LISC	13	0.92
SCSL	7	0.78
PEDSC	17	0.93

Data Analysis

This study employed a three-phase data analysis approach. Firstly, a correlation analysis was conducted to examine the relationships between the variables LISC, EP, SA, PEDSC, and SCSL. This analysis was performed using Statistical Package for the Social Sciences (SPSS) 21.0 software, providing a foundation for model development and aligning with existing theories.

Based on the results of the correlation analysis and existing theories, three hypothesized models were constructed to capture the potential relationships between LISC, EP, PEDSC, and SCSL, and their impact on SA (refer to Figure 10.1). Model fit indices were then employed to compare and select the best-fitting model among the three alternatives.

Figure 10.1 Hypothesized models

Note: *English proficiency→EP; MOI (medium of instruction) interaction→LISC; English difficulty→PEDSC; self-concept→SCSL; science achievement→SA.

In the third phase, a path analysis using SEM was conducted for the selected model. Analysis of Moment Structures (AMOS) 21.0 software was utilized for this analysis. The path analysis allowed for an examination of the total, direct, and indirect effects of the independent variables on learners’ science achievement. The bootstrapping technique was employed to identify these effects accurately.

By employing these three phases of data analysis, this study aimed to explore the relationships among the variables and determine the factors that significantly influence students’ science achievement. The combination of correlation analysis, model development, and path analysis using SEM provides a robust approach to investigating these relationships and uncovering the underlying mechanisms.

Results

Correlations between Variables

Table 10.2 presents the bivariate correlations among the focal five variables. The data analysis revealed several significant findings. Firstly, there was a significant positive correlation between LISC and EP (r = 0.116, p < 0.05). This indicates that higher levels of language for interaction in the science classroom were associated with higher levels of English proficiency.

Table 10.2 Correlation matrix

	LISC	EP	SA	PEDSC	SCSL
LISC	1
EP	0.116^*	1
SA	0.025	0.576^**	1
PEDSC	−0.099	−0.262^**	−0.081	1
SCSL	−0.017	0.128^*	0.319^**	−0.085	1

Note: *p < 0.05, **p < 0.01.

Secondly, participants’ SA demonstrated significant correlations with EP (r = 0.576, p < 0.01) and SCSL (r = 0.319, p < 0.01). These findings suggest that higher levels of English proficiency and self-concept on science learning were positively related to science achievement. Lastly, PEDSC showed a statistically significant negative correlation with EP (r = −0.262, p < 0.01). The correlations presented in Table 10.2 provide insights into the relationships among the focal variables and highlight the factors that may influence students’ science achievement.

Effect Strengths in the Path Diagram

Based on existing theories on EMI and the results of the correlation analysis, three hypothesized models were developed. Figure 10.1 illustrates these models:

Model 1: In this model, students’ SA is directly and indirectly influenced by the other variables. Both LISC and SCSL mediate the effects of EP and PEDSC on students’ science achievement.
Model 2: This model considers both affective variables (SCSL and PEDSC) as mediating variables that directly impact students’ SA. In this model, EP and LISC have direct and indirect effects on students’ SA, but LISC does not have a direct effect on PEDSC.
Model 3: Building upon Model 2, Model 3 includes an additional direct effect of LISC on PEDSC. This addition is based on the possibility that students’ LISC, which is significantly correlated with EP, may have an influence on PEDSC as it reflects the amount of English exposure in the science classroom.

These three models were developed to explore and understand the potential relationships between the focal variables and their effects on students’ science achievement. The models consider different pathways and mediating variables to examine the complex interactions among EP, LISC, PEDSC, SCSL, and SA in the context of EMI. Further analysis and evaluation of these models will help determine which model best explains the structural relationships between the variables and how students’ science achievement can be explained based on the chosen model.

Based on the model fit indices presented in Table 10.3, it was found that among the three hypothesized models, Model 3 provided the most appropriate fit in capturing the causal connections between the variables. The fit indices for Model 3 indicated a good fit: χ²(1) = 1.132, p > 0.05, Root Mean Square Error of Approximation (RMSEA) = 0.019, Tucker-Lewis Index (TLI) = 0.994, Adjusted Goodness of Fit Index (AGFI) = 0.981, and Comparative Fit Index (CFI) = 0.999.

Table 10.3 Fit indices for the research model

	χ²(p)	df	RMSEA	TLI	AGFI	CFI
Model 1	1.749 (.186)	1	0.045	0.965	0.972	0.996
Model 2	4.147 (.126)	2	0.054	0.949	0.967	0.990
Model 3	1.132 (.287)	1	0.019	0.994	0.981	0.999

χ² measures the goodness of fit based on the statistical significance of the data. However, it is important to note that χ² is sensitive to sample size. Therefore, it is recommended to consider other fit indices when analyzing SEM results. In this study, fit indices such as RMSEA, TLI, AGFI, and CFI were utilized, as they are relatively stable across different sample sizes. A desirable fit is typically indicated by an RMSEA value of 0.05 or less. Values greater than 0.95 for TLI, AGFI, and CFI suggest a reasonably good fit (see Blunch, Reference Blunch2013; Kline, Reference Kline2005). In the case of Model 3, the fit indices met these criteria, indicating that the model provides a satisfactory fit to the data. These findings support the use of Model 3 to further analyze the relationships between the variables and understand students’ science achievement in the context of EMI.

The results of the path analysis revealed several significant findings (Table 10.4). Firstly, participants’ EP and their self-perceptions, including SCSL and PEDSC, had statistically significant positive effects on their SA. The standardized coefficients were as follows: βEP → SA = 0.57 (p < 0.001), βSCSL→SA = 0.25 (p < 0.05), and βPEDSC→SA = 0.09 (p < 0.001). These findings suggest that students with higher English proficiency and more positive self-concept, and more challenges in using English in the science classroom, tend to have better science achievement.

Table 10.4 Maximum likelihood parameter estimates for the research model

Model path	Unstandardized coefficient (SE)	Standardized coefficient
LISC→PEDSC	−0.078 (.057)	−0.070
LISC→SCSL	−0.030 (.049)	0.033
LISC→SA	−0.042 (.063)	−0.028
SCSL→SA	0.415 (.068)***	0.253
PEDSC→SA	0.119 (.058)*	0.087
EP → PEDSC	−0.180 (.037)***	−0.254
EP → SCSL	0.078 (.031)*	0.132
EP → SA	0.553 (.042)***	0.569

Note: ^*p < 0.05, ^***p < 0.001.

Furthermore, when examining the effect of participants’ EP, it was found that it had a statistically significant positive path coefficient for SCSL (βEP → SCSL = 0.13) and a significantly negative influence on PEDSC (βEP → PEDSC = −0.25). This means that higher English proficiency was associated with more positive self-concept in the science classroom and lower perceived English difficulty. Overall, these results highlight the importance of English proficiency and self-perceptions in shaping students’ science achievement in the EMI context (see Figure 10.2).

Figure 10.2 Results of path analysis (standardized coefficients)

Note: *English proficiency→EP; MOI interaction→LISC; English difficulty→PEDSC; self-concept→SCSL; science achievement→SA.

Table 10.5 shows the decomposition of the standardized effects of research Model 3. The standardized total influence of each variable on participants’ science achievement was measured by evaluating its indirect and direct effects. For instance, the total effect of participants’ English proficiency on science achievement was calculated as follows: direct effect (0.569) + indirect effect mediated by students’ self-perceptions (0.011) = 0.58. The results showed that participants’ English competence was the factor with the strongest effect on science learning outcomes in EMI secondary schools. Students’ self-concept and their perceived level of English difficulty in the science classroom also positively affected their achievement in science subjects.

Table 10.5 Decomposition of standardized effects for the model

		LISC	SCSL	PEDSC	EP
SA	Direct effect	−0.028	0.253	0.087	0.569
	Indirect effect	−0.014	−	−	0.011
	Total effect	−0.042	0.253	0.087	0.580

Note: Numbers in bold indicate statistical significance.

Additionally, participants’ SCSL had a positive direct effect on SA with a value of 0.253. This suggests that positive self-concepts in science learning contribute to better science achievement. Moreover, participants’ PEDSC also showed a positive direct effect on SA, although its magnitude was relatively smaller (0.087). This indicates that students’ perceived language difficulty in understanding English in the science classroom contributes to better science achievement.

Discussion

The findings from path analysis, based on statistical data from eight EMI secondary schools in Hong Kong, provide insights into the relationships between English proficiency, language for interaction in the science classroom, self-perceived English difficulties, science learning self-concept, and academic achievement in science for secondary students.

At the individual level, English proficiency emerged as the strongest predictor of students’ science achievement. This finding aligns with previous research in the EMI context, highlighting the positive association between English competence and academic success (e.g., Lo & Lo, Reference Lo and Lo2014; Macaro, Reference Macaro2018; Rose et al., Reference Rose, Curle, Aizawa and Thompson2019). As studying content subjects in English requires an understanding and application of knowledge, a lack of English proficiency can hinder students’ ability to absorb and utilize knowledge of the subject matter (e.g., Beckett & Li, Reference Beckett and Li2012; Doiz et al., Reference Doiz, Lasagabaster and Sierra2013). The results underscore the importance of developing students’ academic English proficiency to enhance their learning of scientific knowledge in the EMI setting.

In contrast, the path analysis itself did not find a significant impact of communication mode in the science classroom on students’ scientific knowledge attainment. While previous studies have emphasized the effectiveness of mixed language use in promoting comprehension and learning (Wu & Lin, Reference Wu and Lin2019; Xie & Curle, Reference Xie and Curle2020), the findings suggest that language output types may not necessarily reflect students’ internal learning processes.

Regarding affective variables, students with a stronger science learning self-concept demonstrated higher science achievement. Additionally, contrary to expectations (Marsh et al., Reference Marsh, Trautwein, Lüdtke, Köller and Baumert2005; Shen & Tam, Reference Shen and Tam2008), students’ perceived level of English difficulty in science learning was positively associated with scientific knowledge acquisition. It is possible that the challenge of learning science subjects in English serves as a motivating factor that drives students to invest more effort. Students may recognize the importance of studying content subjects in English and be inspired to devote time and energy to science disciplines (Lasagabaster, Reference Lasagabaster2016). Although the current data cannot provide precise explanation for this phenomenon, it becomes evident that negative perceptions of EMI should not always be considered obstacles to effective learning of subject content.

Furthermore, the study revealed that both science learning self-concept and perceived English difficulty partially mediate the effect of English proficiency on science achievement. This underscores the importance of not only fostering students’ English skills but also paying attention to their psychological well-being and providing an effective learning environment.

Conclusion

The current study contributes to the EMI literature by examining the complex relationships between variables that can impact students’ science learning outcomes. The empirical evidence obtained in this research supports previous findings while also providing new insights into the roles of each independent variable in the EMI secondary education context in Hong Kong. Based on the findings from path analysis, the following conclusions are drawn:

Consistent with previous findings, English proficiency remains a potent factor for effective learning of science subjects in the EMI context. To ensure good learning of science subjects, EMI secondary schools should make efforts to promote students’ English skills.
The type of language used in the science classroom may influence students’ understanding of subject content, but it does not guarantee the quality of their actual acquisition of scientific knowledge. Despite the widespread support for the translanguaging approach in EMI context, empirical evidence about its effectiveness in learning is lacking (Cummins, Reference Cummins, Juvonen and Källkvist2021). Good learning of science subject knowledge may extend beyond language use in the EMI classroom.
While previous EMI research has often highlighted the relationship between students’ positive attitudes toward EMI and academic achievement, the results in this study reveal that students’ perceived difficulty with English does not necessarily lead to negative science learning outcomes among EMI students. Although the data from the current study do not provide further explanations about the phenomenon, it is worth delving into the motivational mechanisms under the surface, which are still underexplored in the EMI research area.

While the study provides a holistic understanding of the relationships between the variables of interest, there are several limitations that should be addressed in future research. Firstly, self-report surveys can be subject to trustworthiness issues, and additional research using different methods is needed to enhance the credibility of the findings. Secondly, the study includes a small number of EMI secondary schools in Hong Kong, limiting the generalizability of the results. Future research should involve a larger sample of local EMI schools to validate the current findings. Lastly, the study focused on a limited set of variables in the path analysis. To gain a deeper understanding of the mechanisms promoting EMI students’ science achievement, future studies should explore more comprehensive models with a broader range of variables.

11 Questionnaire Survey in Researching EMI

Introduction: What Is a Questionnaire Survey?

As one of the most popular methods in social science, survey research seeks to understand a population as a whole through investigating a sample of individuals within this group. It tends to follow the postpositivist paradigm, grounded in the theoretical assumption that researchers can integrate certain pieces of knowledge conveyed by participants with other pieces of knowledge to generate new understandings of a phenomenon. In the field of English Medium Instruction (EMI), survey has been used extensively in a wide variety of studies (e.g., Drljača Margić & Vodopija-Krstanović, Reference Drljača Margić and Vodopija-Krstanović2018; Graham et al., Reference Graham, Eslami and Hillman2021; Jiang et al., Reference Jiang, Zhang and May2019; Kim & Thompson, Reference Kim and Thompson2022; Macaro et al., Reference Macaro, Tian and Chu2020; Qiu & Fang, Reference Qiu and Fang2022; Xie & Curle, Reference Xie and Curle2022).

Survey data can be gathered through structured interviews or self-administered questionnaires; it is the latter that is at the forefront of the discussion in this chapter. A questionnaire survey refers to “any written instrument that presents participants with a series of questions or statements to which they should react either by selecting from existing possibilities or writing out their answers” (Brown, Reference Brown2001, p. 6). According to the definition, participants are required to provide answers to prompts that take shape in either statements or questions. The popularity of questionnaire surveys in EMI research can be partly attributed to the fact that it is reasonably straightforward to construct and is capable of collecting vast quantities of information readily available for analysis. This is especially true when the questionnaire surveys are administered to a diverse group of participants, for example, a mix of EMI lecturers and students, or EMI students from different sociocultural backgrounds. In this case, data collection can be completed within hours. Considerable investment of time and effort can be saved when using online questionnaire surveys, for researchers just need to send emails and regular reminders to participants. Researchers are free from entering data one by one; rather, such numerical data can be downloaded and transformed to various formats. Compared to the classic pen-and-pencil questionnaire surveys, which require printing, mailing, and possibly dispatching by researchers themselves, a good return can be expected on the cost when online survey software packages and services are used.

Apart from enhancing efficiency with regard to researcher time, effort, and cost, one contributing factor to the widespread use of questionnaire survey in EMI studies is its versatility that facilitates researchers’ measurement of abstract constructs. These abstracts are assumed to exist and can be difficult to be measured otherwise. A brief glance at research outlets (e.g., academic journals, conference proceedings, theses, dissertations) reporting empirical evidence of EMI reveals a diverse application of questionnaire survey. Researchers use questionnaire survey to collect data on learner beliefs, learner motivation, learner characteristics, parents’ perspectives toward EMI implementation, learner expectations, learner attitudes, learners’ EMI learning experience, instructor needs, learners’ self-efficacy, learner anxiety, feedback practices, learner and instructor satisfaction, instructors’ challenges and coping strategies, just to mention a few. These constructs cannot be obtained through direct observation, and the understanding of them is a preliminary step toward designing a targeted intervention to improve the quality of EMI programs.

How to Use Questionnaire Survey in EMI Research?

Selecting a Sample Frame

During the initial planning of a questionnaire survey, the researcher must decide on his or her population of interest. Questionnaire survey research seeks to gather as much information as possible about a certain population, such as EMI instructors of nonlanguage subjects and international students from non-English-speaking countries. Given the huge expense of collecting census data, it is much more feasible to survey a subset of the total population that can be reasonably obtained, which is generally referred to as a sample. For example, understanding the relationship between perceived teacher support and language-associated challenges experienced by postgraduate students enrolled in EMI courses nationwide may entail surveying students in certain domestic universities with the expectation that the results are likely to extrapolate to the wider population. Hence, a crucial step in conducting questionnaire survey research is selecting a sample frame (also known as survey frame), defined as “the listing of all units in the population from which the sample will be selected” (Bryman, Reference Bryman2012, p. 187). This is because the authenticity, characteristics, and representativeness of the sample frame have a marked influence on the generalizability of EMI research findings to the larger population.

Two basic sampling techniques are commonly used in questionnaire survey research, namely, probability and nonprobability sampling. Probability sampling is a technique in which a sample is randomly selected; in this case, each research unit (e.g., student, lecturer) has a known and nonzero chance of being included in the study (Bryman, Reference Bryman2012). The four main types of probability sampling are simple random sampling, stratified random sampling, cluster sampling, and systematic sampling.

Simple random sampling. As the most straightforward form of probability sampling, the simple random sample aims for a truly representative sample where each unit of the population has an equal chance of selection (Bryman, Reference Bryman2012). With simple random sampling, researchers assign numbers to individuals within the target population and then randomly select from these numbers either by hand or using a random number generator. In the end, the numbers being chosen represent the individuals included in the sample.

Example: simple random sampling

Imagine that we plan to investigate how lecturers at a particular university perceive EMI undergraduate programs and have access to the list of 800 full-time lecturers that engage in these programs. We decide that we need a sample of 150 lecturers for our questionnaire survey study. After assigning a number to each lecturer in the list, we may produce a simple random sample of 150 participants with the use of a simple random sample. For example, if the first automatically generated number is 17, lecturer #17 on our list will be included in the sample. We may continue this process until 150 numbers are generated by the program.

Stratified random sampling. Stratified random sampling is a variety of sampling where a more extensive population group is divided into mutually exclusive subgroups based on certain criteria (e.g., gender, education background, career) and a sample is drawn from each subgroup. It ensures that each subgroup is properly and adequately represented in the sample. This is particularly useful when researchers are studying a population with diverse characteristics.

Example: stratified random sampling

When investigating 1,000 third-year business majors’ foreign language anxiety in EMI classrooms at a particular university, we might want our sample of 100 to adequately reflect the proportional representation of the different departments within the business school. Say that there are five departments in the business school; we would have five subgroups, with the numbers in each subgroup being one-tenth of the sum for each department. If there are 100 third-year students in the Department of Accountancy, 10 students from this department are expected to be included in the sample with the use of the sampling fraction of one in ten.

Systematic sampling. Systematic sampling is a technique where sample members are selected at a regular interval from a random starting point. The sampling interval is calculated as the result of the division of the population size and the sample size. If the population is arranged in a random order (e.g., alphabetically), systematic sampling can help generate a representative sample that can be utilized to draw some conclusions about a large and diverse population.

Example: systematic sampling

Imagine that we plan to explore non-English majors’ attitudes toward the use of English as a medium of instruction at a particular university. In this study, 6,000 non-English majors in this university make up the target population. When using systematic sampling for the selection of a random group of 600 people from this population, we need to place all the potential participants on a list in alphabetical order and choose a starting point from the first ten names on the list. Since the sampling interval equals 6000/600 = 10, every tenth participant is included in the sample.

Cluster sampling. Cluster sampling involves dividing the target population into individual clusters and randomly choosing participants from the clusters to form the sample. Given that the quality and representativeness of clusters influence the validity of findings, researchers need to ensure that each cluster mirrors the diversity of or at least share a roughly similar set of characteristics with the target population.

Example: cluster sampling

Imagine that we are interested in English language proficiency of first-year EMI undergraduates in a particular city. Sampling from a complete list of all first-year EMI undergraduates across the city can be a daunting task; simple random or systematic sampling technique may generate a widely dispersed sample. In this case, one way of obtaining a citywide representative sample might be to first sample universities and then target students from each university included in the study. We would need to adopt a probability sampling method at each stage of sampling. For example, we may assign a number to each university and sample twenty universities from the whole population of universities using a random number generator. After generating twenty clusters, we would randomly select a sample of students from within each cluster.

As is shown here, probability sampling highlights random selection for samples representative of the entire population, facilitating strong statistical inferences about this target population and generalizability of the findings. Nevertheless, probability sampling may come with challenges such as time and cost considerations, the issue of nonresponse, and heterogeneity of the target population (Bryman, Reference Bryman2012). When it is hard to individually identify the population parameters or gain access to a random, representative sample, researchers may refer to nonprobability sampling, which is a branch of sample selection not based on the canons of probability sampling (Bryman, Reference Bryman2012). The purpose of using nonprobability sampling is not to accurately represent all individuals of a wider population but to understand the perspectives of a targeted set of individuals on the basis of their availability, geographic location, knowledge, experience, or other characteristics. Among the various types of nonprobability sampling methods, convenience sampling is frequently adopted in questionnaire survey-based research in EMI (e.g., Graham et al., Reference Graham, Eslami and Hillman2021; Macaro et al., Reference Macaro, Tian and Chu2020; Xie & Curle, Reference Xie and Curle2022). Convenience sampling, as its name suggests, is a sampling technique where researchers survey participants who are conveniently available.

Example: convenience sampling

Imagine that two researchers who teach corporate finance using English as the medium of instruction at a university plan to investigate students’ belief about first language use in EMI classrooms. They may survey several classes of students, all of whom are taking corporate finance as a compulsory course in the major of finance. The chances are that most students in their class may give their consent to participation in the questionnaire study, boosting a response rate. This study may yield interesting and useful information about these students, but it may be difficult for the research findings to be generalized to the rest of the population (all EMI students).

Designing the Questionnaire Survey

As Dörnyei and Taguchi (Reference Dörnyei and Taguchi2010) point out, a professional questionnaire contains a concise title, introduction, instructions for participants, and a “thank you” message. Clear introductory statements can help boost response rates and improve the quality of data. The introduction of a questionnaire survey makes clear the researchers’ names, the governing institution where the researchers are based, the research goals and objectives, how the data collected from the questionnaire will be processed, the voluntary nature of the research, and the extent to which confidentiality of records will be maintained. When the questionnaire is delivered, the aforementioned information can also be included in the cover letter accompanying the questionnaire. Cover letters, although not the main component of the questionnaire survey, often function as the first point of contact that potential participants have with researchers. They therefore have a significant impact on the former’s decisions about whether to engage with the study. For example, a lengthy cover letter loaded with technical terms may fail to captivate the potential participants. On the contrary, a concise and well-organized introduction inclusive of the required information is likely to improve the response rate.

Writing items are the most important yet time-consuming part of questionnaire design. Items listed in a professional questionnaire survey can be close- and open-ended. Close-ended items provide a stimulus (either questions or statements) for participants to read and to choose their responses from a list of predefined options. These possible responses can be dichotomous (e.g., yes/no, agree/disagree, true/false) or multiple choice (e.g., Likert scales, rating scales, checklist type, rank order). Given their structured nature, participants can interpret and answer them quickly and easily. On the contrary, open-ended items are unstructured and require participants to answer in open-text format, with no restrictions on participants’ choices of responses. Possible responses can either be brief and specific (e.g., in response to “what is the medium of instruction in your high school?”) or long and in-depth (e.g., in response to “what are the challenges of using English in science instruction?”). Questionnaire items, either open- or close-ended, are used to gather both objective and subjective data. Objective data include background information on participants, such as age, gender, major, language proficiency level, language of instruction, and native language. Subjective data include beliefs, attitudes, and perspectives of participants, such as students’ perceived linguistic gains in EMI, students’ communicative needs in EMI classrooms, and the influence of sociocultural background on EMI instructors’ teaching behaviors, to name a few.

It is important to test the questionnaire survey prior to data collection. Piloting enables researchers to identify issues that would prevent the items from gathering the information they are designed to measure. A pilot study can be conducted to (1) generate fixed-choice answers using open-ended questions if the main research is primarily based on close-ended questions; (2) check the phrasing and wording of the questionnaire items and instructions; (3) arrange questionnaire items in the proper order; (4) find out whether additional and specifying items are needed; and (5) identify items that elicit almost the same responses from participants and thus do not constitute a variable (Bryman, Reference Bryman2012). Pilot testing should be conducted with a sample representative of the target population of the main study under survey conditions (e.g., similar sociocultural background and geographical location). During the pilot study, participants are encouraged to complete the questionnaire survey as well as to offer their feedback and suggestions on how to improve the research design.

Distributing the Questionnaire Survey

The questionnaire survey is distributed after being piloted and revised. The most frequently used means nowadays is to distribute the questionnaire survey as a link created via integrated survey tools. The key advantage of technologically mediated methods is that data can be viewed in real time, reported in various forms (e.g., graphs, figures, tables), shared with others, and exported for further statistical analysis. Various web-based distribution methods such as email and Quick Response code (QR code) can be used according to the research questions, sample size, and resources. Email is arguably the most popular channel of questionnaire distribution, with the potential participants clicking on a hyperlink and then being forwarded to the survey forms hosted on a website. An email survey can be accessed by smartphones, tablets, and laptops whenever and wherever possible. Emails provide researchers with an opportunity to use multimedia elements to broaden the appeal of the questionnaire survey. With the use of emails, researchers need to send reminder emails in case participants miss the survey link enclosed in the first email. Researchers may also choose to create a QR code from the survey link when they seek to gather point-in-time responses. A QR code is a type of machine-readable barcode that stores information as a set of pixels in a square grid. Participants are able to access the survey link after scanning the QR codes. These codes can be created quickly with QR code generators, printed out at one go, and used repeatedly during research.

Case Study

Background

The COVID-19 pandemic declared in early 2020 has resulted in far-reaching changes worldwide. Given the fast-spreading and contagious nature of the pandemic, governments around the globe introduced precautionary measures in the short and long terms, such as travel restrictions, social distancing, school closures and the wearing of face masks. In the world of higher education (HE), institutions responded with a shift and urgent transition from face-to-face teaching to online instruction. In 2020, for example, the Ministry of Education of China launched an emergency policy initiative called “Suspending Classes without Stopping Learning,” which aims to “integrate national and local school teaching resources, provide rich, diverse, selectable, high-quality online resources for all students across the country, and support teachers’ online teaching and children’s online learning” (Ministry of Education of the People’s Republic of China, 2020a). By August 2020, approximately 22.59 million university students had been enrolled in online modes of instruction, with 1.1 million courses offered by 1.08 million teachers (Ministry of Education of the People’s Republic of China, 2020b). All face-to-face universities were forced to move to emergency remote teaching (ERT), which required a shift in teaching methods and resources for a smooth transition to distance education.

In accordance with a fixed schedule, the original course during the ERT is delivered by instructors via an online teaching platform. In this way, instructors have the responsibility of staying on schedule and tracking students’ progress. The course can be a combination of lectures, individual assignments, and collaborative activities. Student–instructor and student–student interactions are mediated through discussion forums, live chats, lecture notes, and audio and video conferencing. Toward the end of the course, assessment is conducted online to collect information about students’ academic progress and to make judgments about students’ level of achievement during the course.

Recent studies of ERT in HE during the pandemic have reported several challenges encountered by university students, including network malfunction, lack of digital readiness, the need for self-discipline, a sense of isolation, temptation for procrastination, students’ physical and mental health, and difficulty of navigating the relationship between theory and practice (see, e.g., Al-Nofaie, Reference Al-Nofaie2020; Holzer et al., Reference Holzer, Lüftenegger, Korlat, Pelikan, Salmela-Aro, Spiel and Schober2021; Jiang et al., Reference Jiang, Islam, Gu and Spector2021; Yu & Huang, Reference Yu, Huang, Pun, Curle and Yuksel2022). In view of these challenges brought by considerable academic and social changes in the crisis, designers and instructors of online courses are under great pressure to develop and evaluate strategies for getting university students engaged in online learning. Student engagement refers to “the extent to which students actively engage by thinking, talking, and interacting with the content of a course, the other students in the course, and the instructor” (Dixson, Reference Dixson2015). As a key predictor of student achievement, student engagement is closely associated with student satisfaction, learning persistence, and academic performance (Handelsman et al., Reference Handelsman, Briggs, Sullivan and Towler2005; Meyer, Reference Meyer2014). Therefore, increasing students’ engagement remains essential for student success and retention during the crisis and for adequate preparation and understanding of instructors regarding the unique needs for distance education. This chapter aims to incorporate collaborative writing activities into online teaching and to evaluate the acceptability and usefulness of these activities in promoting university students’ online engagement through their self-perceptions.

Student Engagement and Online Learning

The role of student engagement in HE has received increased attention. In HE contexts, studies of student engagement in learning often focus on what students are doing and its effect on students’ academic performance (Carini et al., Reference Carini, Kuh and Klein2006). Student engagement can be represented in “the development of critical thinking skills, higher grades and a general embracing of learning by taking responsibility and actions to achieve intrinsically motivated goals” (O′ Shea et al., Reference O′ Shea, Stone and Delahunty2015, p. 43). However, central to the discussion is what engagement means and what it is that university students are engaging with. This is of particular relevance since learning is often assumed to take place as the level of engagement demonstrated by students rises (Beldarrain, Reference Beldarrain2006; O′ Shea et al., Reference O′ Shea, Stone and Delahunty2015). Prior studies have noted that instructors are also required to engage in teaching so that reciprocal benefits can be substantial in both teaching and learning (Dyment et al., Reference Dyment, Downing and Budd2013; Pittaway, Reference Pittaway2012). When it comes to online contexts, however, engagement is manifested in different forms because of insufficient face-to-face interaction and the ways in which teaching and learning activities are supported by technology.

Drawing on an online questionnaire, Swan et al. (Reference Swan, Shea, Fredericksen, Pickett, Pelz and Maher2000) identify three factors that exert influence upon the success of asynchronous online learning, namely, design of the course interface, quality interaction with course instructors, and active discussion with classmates. Likewise, Anderson (Reference Anderson, Moore and Anderson2003) sees interaction as key to student engagement and it should be nurtured in online teaching and learning. Interactions that promote student engagement include learner–learner, learner–instructor, and learner–content interactions (Moore, Reference Moore, Harry, John and Keegan1993). They are useful in kindling students’ interest in distance education and in driving them toward the best possible educational outcomes (Chen et al., Reference Chen, Gonyea and Kuh2008). Previous studies such as those outlined earlier have established the importance of online student engagement for student success (Anderson, Reference Anderson, Moore and Anderson2003; Dennen et al., Reference Dennen, Aubteen Darabi and Smith2007; Robinson & Hullinger, Reference Robinson and Hullinger2008; Swan et al., Reference Swan, Shea, Fredericksen, Pickett, Pelz and Maher2000). Creating an online learning environment that is cohesive and interactive (Gaytan & McEwen, Reference Gaytan and McEwen2007) can help address students’ feelings of isolation (Lewis & Abdul-Hamid, Reference Lewis and Abdul-Hamid2006; Madeline et al., Reference Madeline, Ricky, Tracy, Roberts and Emily2005; Yu & Huang, Reference Yu, Huang, Pun, Curle and Yuksel2022), as it enables students to connect with teachers, students, and course content (Young, Reference Young2006).

The scholarly attention given to student engagement gives rise to strong theoretical foundations and useful models. According to Kuh (Reference Kuh2003, p. 25), student engagement involves “the time and energy students devote to educationally sound activities inside and outside of the classroom, and the policies and practices that institutions use to induce students to take part in these activities.” This conception serves as a basis for the development of the National Survey of Student Engagement (NSSE). Created to incorporate the language of effective pedagogical practices into efforts aimed at improving collegiate quality, the NSSE measures five dimensions of engagement: level of academic challenge, active and collaborative learning, student interaction with faculty members, enriching educational experience, and a supportive campus environment (Robinson & Hullinger, Reference Robinson and Hullinger2008). The NSSE looks into students’ collegiate experience as a whole, both inside and outside the classroom. Other instruments that attempt to measure student engagement pay more attention to the classroom context. For example, Handelsman et al. (Reference Handelsman, Briggs, Sullivan and Towler2005) developed a measure of in-class student engagement. Their study revealed four distinct dimensions of student engagement: skills engagement (general learning strategies used by students for intrinsic and extrinsic rewards); emotional engagement (emotional involvement with course material); participation/interaction engagement (class participation and communication with instructors and other students); and performance engagement (levels of classroom performance) (Handelsman et al., Reference Handelsman, Briggs, Sullivan and Towler2005).

Drawing on these previous incarnations of student engagement in traditional classroom settings, Dixson (Reference Dixson2015) sees online student engagement as involving students (1) devoting time and energy to learning teaching materials and acquiring skills, (2) interacting with instructors and peers in a meaningful way (enough so that other people become “real”), and (3) becoming emotionally involved with their studies (i.e., getting excited about an idea, enjoying the learning and/or interaction). In this way, online engagement consists of students’ attitudes, viewpoints, and learning behaviors as well as communication with others (see Figure 11.1). Student engagement in online courses is therefore concerned with students devoting time, energy, ideas, efforts, and, to some degree, emotions to their studies. Following this perspective, Dixson (Reference Dixson2015) develops and validates the Online Student Engagement Scale (OSE), which measures how students behave in learning as well as how they perceive their learning process and the connections they are making with the course content, the teacher, and fellow students in the light of skills, participation, performance, and emotion.

Figure 11.1 Components of online student engagement

Methods

Due to growing trends toward academic internationalization and the overwhelming dominance of English as a lingua franca in academia, Chinese universities are caught up in a rush to adopt EMI (Yu & Pun, Reference Curle, Derakhshan, Pun and Curle2021). In 2001, the Ministry of Education of China issued several guidelines aimed at promoting bilingual teaching, formulating the goal that within three years, 5–10 percent of content area courses should be delivered in a foreign language, mainly English (Ministry of Education of the People’s Republic of China, 2001). During the past decade, EMI has evolved in HE from being a bilingual teaching experience in economically developed areas to being widely used on a national scale (Jiang et al., Reference Jiang, Zhang and May2019). With EMI becoming an essential part of the ongoing HE internationalization, Chinese universities are offering specialized undergraduate courses in English to increase students’ global competitiveness and to attract overseas students. While benefiting from the native-like English-speaking environment, university students enrolled in EMI programs are often troubled by language and academic problems (Skyrme, Reference Skyrme2007), which may influence their in-class engagement. The online learning environment further adds to the difficulties that university students encounter in EMI settings, and one of the major challenges faced by instructors is how to activate students’ academic engagement during ERT.

Participants were recruited from an undergraduate-level EMI course entitled “scientific writing” offered in the spring semester in 2021 by a public research-intensive university in mainland China. The course is a compulsory one covering academic writing skills required in medical research. Forty-six second-year medical students (eighteen females, twenty-eight males) taking the course participated in this pre–post intervention study. The instructor taught academic writing for various academic contexts (research papers, conference presentations, technical reports) through video conference-delivered lectures. During this one-semester course, students were randomly assigned to twelve groups to collaboratively complete a review paper via an online platform called Tencent Docs. Tencent Docs is an online document platform designed for multiperson collaboration, offering a word processor and spreadsheet editor like Microsoft Word and Excel with functions of editing, viewing track changes, sharing, translating, and downloading. At the beginning of the course, the instructor provided five recommended topics and paper structures for students’ reference. Student groups gathered information, searched medical literature using different databases, located relevant literature, and collaboratively worked with their group members online to complete the group assignment. Students were free to negotiate writing timelines and progress within the group. At the end of the course, each group was required to submit a full-fledged paper and to make an oral presentation online.

The two course instructors are specialized in academic writing education and have at least five years of EMI teaching experience. They have been working closely with practicing doctors in developing and refining the course content and in-class activities.

With the help of the course instructors, the nineteen-item OSE (Dixson, Reference Dixson2015) was distributed to participants to measure their perceived engagement in this online course (see Appendix). This self-rating questionnaire used a 5-point Likert scale (1 = strongly disagree, 2 = disagree, 3 = neither agree nor disagree, 4 = agree, 5 = strongly agree), where students reported how well each learning behavior, thought, or emotion was characteristic of them or their behavior before and at the end of the course (Dixson, Reference Dixson2015). In the questionnaire (Cronbach’s alpha = 0.92), statements included descriptions of thoughts and feelings such as “finding ways to make the course material relevant to my life” and “finding ways to make the course interesting to me”; those of skills such as “looking over class notes between getting online to make sure I understand the material” and “taking good notes over readings, PowerPoints, or video lectures”; those of participation such as “helping fellow students” and “posting in the discussion forum regularly”; and those of performance items such as “doing well on the tests/quizzes” and “getting a good grade.”

After data collection, statistical analyses were performed using Statistical Package for the Social Sciences (SPSS) 22 and self-contrast methodology. The means and standard deviations were used to describe the total scores and single scores for each item listed on the questionnaire. The paired t-test was used to compare pre- and post-activity scores. A significant difference was considered to exist if the p-value was less than 0.05 (two-tailed).

Findings

From March to June 2021, ninety-two questionnaires were made available online and were distributed. All participants in the EMI course finished the collaborative writing assignment and participated in this survey research. They completed the questionnaire prior to and at the end of the course.

At the beginning of the course, many students considered the collaborative writing assignment as a way of releasing pressure during ERT. The true aim was to help students cope with ERT, where the challenges of insufficient interaction and lack of emotional support have to be overcome (Yu & Huang, Reference Yu, Huang, Pun, Curle and Yuksel2022), and reducing stress was just a natural part of this process. Students were encouraged to communicate with other course participants and make full use of computer-supported collaborative learning and then apply these in other online courses.

The questionnaire was completed in full by all forty-six out of forty-six students in this EMI course. The results reveal that the majority of the participating students had significant improvements (p < 0.05) in online engagement in terms of skills, emotion, and participation (Table 11.1). Ratings for each item before and after the collaborative writing assignment were analyzed using the paired t-test, and the results were significantly different. No significant difference was observed between the implementation of online collaborative writing and students’ performance in tests and quizzes (p > 0.05).

Table 11.1 Differences between pre- and post-activity self-assessment (n = 46)

		Mean	SD	95 percent confidence interval	p
Skills
1	Making sure to study on a regular basis	1.063	1.216	0.624 ~ 1.501	<0.001
2	Staying up on the readings	1.063	1.014	0.697 ~ 1.428	<0.001
3	Looking over class notes between getting online to make sure I understand the material	1.094	1.228	0.651 ~ 1.536	<0.001
4	Being organized	0.875	1.04	0.500 ~ 1.250	<0.001
5	Taking good notes over readings, PowerPoints, or video lectures	0.844	1.273	0.385 ~ 1.303	0.001
6	Listening/reading carefully	0.938	1.162	0.518 ~ 1.357	<0.001
Emotion
7	Putting forth effort	1.125	1.1	0.728 ~ 1.522	<0.001
8	Finding ways to make the course material relevant to my life	1.75	1.164	1.330 ~ 2.170	<0.001
9	Applying course material to my life	0.688	1.256	0.235 ~ 1.140	0.004
10	Finding ways to make the course interesting to me	1.656	1.285	1.193 ~ 2.120	<0.001
11	Really desiring to learn the material	0.5	1.218	0.608 ~ 0.939	0.027
Participation
12	Having fun in online chats, in discussions, or via email with the instructor or other students	0.5	0.803	0.210 ~ 0.790	0.001
13	Participating actively in small-group discussion forums	0.938	1.105	0.539 ~ 1.336	<0.001
14	Helping fellow students	1.094	1.51	0.549 ~ 1.638	<0.001
15	Engaging in conversations online (chat, discussions, email)	0.531	1.391	0.030 ~ 1.033	0.039
16	Posting in the discussion forum regularly	0.781	1.008	0.418 ~ 1.145	<0.001
17	Getting to know other students in the class	1.438	1.268	0.980 ~ 1.895	<0.001
Performance
18	Getting a good grade	0.344	1.31	−0.129 ~ −0.816	0.148
19	Doing well on the tests/quizzes	0.281	1.468	0.255 ~ 0.817	0.293

Discussion

In the HE sector, the outbreak of COVID-19 had given rise to a considerable disruption in the teaching modality. To control the spread of this rapidly growing and highly infectious disease, HE institutions had moved education online. This shift to ERT meant that a great number of university students could keep learning, albeit situated in a context of social injustice, inequity, and the digital divide (Bozkurt et al., Reference Bozkurt, Jung, Xiao, Vladimirschi, Schuwer, Egorov, Lambert, Al-Freih, Pete, Olcott, Rodes, Aranciaga, Bali, Alvarez, Roberts, Pazurek, Raffaghelli, Panagiotou, de Coëtlogon, Shahadu, Brown, Asino, Tumwesige, Ramirez Reyes, Barrios Ipenza, Ossiannilsson, Bond, Belhamel, Irvine, Sharma, Adam, Janssen, Sklyarova, Olcott, Ambrosino, Lazou, Mocquet, Mano and Paskevicius2020). Instructors were required to adapt schedules to ERT environments, usually with minimal if any prior experience. Despite instructors’ and students’ varying degrees of preparedness, this crisis “has been, and will continue to be, both a massive challenge and a learning experience for the global education community” (Schleicher, Reference Schleicher2020, p. 7). This study introduced in detail the teaching methods for collaborative writing and systematically evaluated these techniques for improving student engagement in online courses.

The results yielded a positive association between students’ practice of collaborative writing and self-reported emotions in ERT. This resonates with the notion that the measures to control the spread of the coronavirus have an influence on students’ psychological well-being, such as social isolation and loneliness arising from the disruption in their everyday routines and interpersonal communication (Labrague et al., Reference Labrague, De Los Santos and Falguera2021). The prolonged social distancing measures made it hard for students to satisfy one of their basic needs – the need to associate with others (Bavel et al., Reference Bavel, Baicker, Boggio, Capraro, Cichocka, Cikara, Crockett, Crum, Douglas, Druckman, Drury, Dube, Ellemers, Finkel, Fowler, Gelfand, Han, Haslam, Jetten, Kitayama, Mobbs, Napper, Packer, Pennycook, Peters, Petty, Rand, Reicher, Schnall, Shariff, Skitka, Smith, Sunstein, Tabri, Tucker, Linden, Lange, Weeden, Wohl, Zaki, Zion and Willer2020; Yu & Huang, Reference Yu, Huang, Pun, Curle and Yuksel2022). In view of the months-long closure of many HE institutions, students have fewer opportunities for peer-to-peer communication on campus and may feel less connected with their friends (Elmer et al., Reference Elmer, Mepham and Stadtfeld2020), which may consequently affect the way they engage in online courses. The findings generated from this study highlight the significance of relational factors, such as peer support during collaborative writing, for students’ mental health and their engagement in distance education during the time of the crisis. Collaborative writing helps enhance university students’ perception of social support from and interaction with peers free of violations against the ongoing anti-epidemic measures, for instance, by encouraging students to work with peers on an academic project via social media. Collaborative writing afforded students the opportunity to exchange ideas and feedback with each other.

A comparison of the scores shows that collaborative writing activities are effective in promoting students’ skills engagement. Such skills include (1) making sure to study on a regular basis; (2) staying up on the readings; (3) looking over class notes between getting online to make sure I understand the material; (4) being organized; (5) taking good notes over readings, PowerPoints, or video lectures; and (6) listening/reading carefully. As is shown in previous research, the success of online learning is positively conditioned by active learners who are in charge of their own learning process (Hiltz & Shea, Reference Hiltz, Shea, Hiltz and Goldman2005). In taking charge of their studies, students are implementing self-regulated learning, which refers to “self-generated thoughts, feelings, and actions that are planned and cyclically adapted to the attainment of personal goals” (Zimmerman, Reference Zimmerman, Boekaerts, Pintrich and Zeidner2000, p. 14). Self-regulated learning is gaining increasing significance in online education, as distance learning sets a great demand on students’ agency and ownership of their own learning (Çebi & Güyer, Reference Çebi and Güyer2020). A systematic review of online student engagement during ERT reveals that the levels of self-regulated learning in the online contexts were higher than or similar to those of traditional face-to-face courses (Salas‐Pilco et al., Reference Salas‐Pilco, Yang and Zhang2022). The results of this study indicate that collaborative writing activities are important in helping students enhance their self-regulation in the EMI course. In a rather similar line, Su et al. (Reference Su, Zheng, Liang and Tsai2018) point out the key role of online collaboration in fostering students’ self-regulation. It might be argued that collaborative writing, as a technology-mediated type of assignment, creates a convenient and user-friendly environment for students to apply self-regulated strategies, such as goal orientation, self-monitoring, and self-assessment in order to heighten self-regulation. According to Boykin et al. (Reference Boykin, Evmenova, Regan and Mastropieri2019), students with varying knowledge levels and needs are able to collaborate effectively and efficiently on the writing tasks and promote their self-regulation during the learning process. In addition, online collaborative writing is student-centered in nature and allows the students to take charge of and make plans for the academic activities, which could further help them actively engage in their studies.

The findings of this study demonstrate that students’ participation/interaction engagement is positively associated with online collaborative writing activities. The six aspects in focus are (1) having fun in online chats, in discussions, or via email with the instructor or other students; (2) participating actively in small-group discussion forums; (3) helping fellow students; (4) engaging in conversations online (chat, discussions, email); (5) posting in the discussion forum regularly; and (6) getting to know other students in the class. When producing a jointly written text, student participants are following active and collaborative learning, which refers to students’ efforts of contributing to class discussions, working with other students, and engaging in other class events (Kuh, Reference Kuh2001). The results support previous research into this area, which views collaboration as critical to facilitating learning in promoting learning in the online classroom (Conrad & Donaldson, Reference Conrad and Donaldson2011; Robinson & Hullinger, Reference Robinson and Hullinger2008). The online classroom has often been perceived as a learning community, where learners are expected to devote collaborative efforts to their studies (Barker, Reference Barker and Cothran2002; Benbunan-Fich et al., Reference Benbunan-Fich, Hiltz, Harasim, Hiltz and Goldman2005; Dede, Reference Dede and Hanna2000). The results of the study confirm that collaborative writing promotes online learners’ active engagement in online learning through thinking, talking, and interacting with the other students. This corroborates the ideas of Palloff and Pratt (Reference Palloff and Pratt2001) that collaborative assignments empower and engage students to the extent that this affects students’ subsequent learning behaviors. In completing a collaborative project, group exchanges can solicit different perspectives and encourage idea sharing, which are generally considered as effective learning techniques (Benbunan-Fich et al., Reference Benbunan-Fich, Hiltz, Harasim, Hiltz and Goldman2005; Conrad & Donaldson, Reference Conrad and Donaldson2011; Robinson & Hullinger, Reference Robinson and Hullinger2008). As a core part of collaborative learning, learning with peers enables students to freely pose their own inquiry in respective study groups (Kemery, Reference Kemery and Aggarwal2000).

The results indicate that students’ performance in online courses appears not to have been affected by collaborative writing. Students’ attitudes toward their grades, tests, and quizzes seem to remain the same, as the quantitative results were not statistically significant. A few reasons might be plausible to explain these results. Firstly, although students were required to collaboratively work on the writing tasks following a tight schedule, to students this was a course designed for scientific writing and the assignment merely constituted a small part of the total grade, so they tended to focus more on their own essays for better marks, not the quality of collaborative writing. Apart from this, technical vocabulary and word spelling were not explicitly taught in class, but their importance was emphasized by the instructor in class. Hence, participating students tended to use online editing services with which they could check for spelling and grammatical errors and to consult online dictionaries for definitions, sample sentences, and synonyms. To a certain extent, the use of online proofreading tools limits peer-tutoring during the collaborative activities. Secondly, group work may not lead to productive outcomes associated with some language skills. Shehadeh (Reference Shehadeh2011) shows that compared with independent learning, collaboration may be less helpful in the acquisition of some straightforward skills such as spelling. This may also apply to students’ basic language knowledge, vocabulary being an example (Li & Mak, Reference Li and Mak2022). As is shown here, however, collaboratively working on a writing project may promote university students’ understanding of pragmatics, for instance, word usage in writing. Moreover, students’ priorities may differ in certain language aspects in their collaboration with peers. This is in line with Kessler et al. (Reference Kessler, Bikowski and Boggs2012), who point out that students appear to neglect mistakes that they regard as less important in jointly producing a text, despite the fact that they could make the corrections. However, these conclusions are only suggestive, given the small-scale nature of the study and the limited number of examinations the students attended. The findings do suggest that the effects of collaboration on the students’ in-class performance online need to be investigated further with a larger sample size and more forms of assessment.

This chapter reports the experiences using online collaborative writing for engaging students in the EMI field. Several limitations of this case study should be noted. Firstly, it included a limited number of learners and the results were based on a single EMI course. Therefore, no systematic conclusions can be made. Future research is needed to include more subjects and participants and to examine a broader range of disciplines and interventions. In addition, detailed insights into the student collaborative setting are required for a better understanding of the learning process. Future research may seek to analyze video data of real-time online interaction (e.g., instructor–student interaction, student–student interaction). Even though the data collected with questionnaires shed light on students’ perception of their engagement in online classes, the generalizability of these findings is limited. Moreover, the small number of participants may cause a certain bias. It is suggested that different settings of EMI be investigated in future studies to provide more food for thought on the creation of an engaging online learning environment.

As Hughes (Reference Hughes2007) points out, while the technology-mediated education is “more welcoming of diversity” (p. 710) and has the potential for students’ flexible engagement, there remains the possibility of disengaging. With the current educational disruption resulting from the COVID-19 pandemic fueling a great deal of investments in and the use of instructional technologies designed for ERT (Asare et al., Reference Asare, Yap, Truong and Sarpong2021), online learning is likely to become more feasible for university students, in consideration of safety, cost, time, and distance. Further studies carried out at a multi-institutional or national level for an understanding of the nature of student engagement in the online context are therefore recommended. Qualitative analysis that foregrounds students’ and instructors’ perspectives may provide more insights into the myriad factors that affect student engagement in online courses and may also serve as a foundation for interventions aimed at nurturing and maximizing student engagement.

Concluding Remarks

Given their potential in gathering large amounts of data in a short time with minimum researcher involvement, questionnaire surveys have gained immense popularity in EMI research. Bearing the goal of constructing or testing a theory in mind, data collected from questionnaire surveys can be used to explain and interpret trends or determine how two or more variables may be related, for example the relationship between EMI teaching practices and students’ learning motivation. Also, they can be used to make a comparison between different groups of research participants, such as support needs of EMI instructors across different countries, or to investigate an individual participant or a particular group of participants over a period of time, such as students’ self-report of attitudes toward EMI during years of study. Moreover, questionnaire-based research is effective in measuring the influence of several independent variables on the dependent variable, for example, how EMI-related language ideologies, academic skills support provision, and language use anxiety may impact student performance. Despite their extensive application in EMI research, questionnaire surveys have become somewhat “problematizing,” with a range of disparate strands to account for. Questionnaire development and enactment can be complicated, can be difficult to adapt in agreement with research aims, and is likely to require applications of questionnaire theory, since data accuracy is positively conditioned by adequate and well-documented validity and reliability properties (Curle & Derakhshan, Reference Curle, Derakhshan, Pun and Curle2021). Such challenges are complicated by the dynamic and evolving nature of EMI in the current era characterized by the globalization and marketization of HE, yet they offer exciting new perspectives and opportunities for EMI researchers and practitioners.

12 The Use of MANOVA in Analyzing the Effects of Gender on Perceived Difficulties in Speaking and Writing in a Hong Kong EMI University

Project Overview and Research Context

Research has increasingly focused on the teaching of academic subjects such as science, medicine, and engineering in English Medium Instruction (EMI) universities (Macaro et al., Reference Macaro, Curle, Pun, An and Dearden2017). EMI broadly refers to the use of “the English language to teach academic subjects (other than English itself) in countries where the first language of the majority of the population is not English” (Macaro & Akincioglu, Reference Macaro and Akincioglu2018, p. 256). Governments and educational institutions have implemented EMI policies aimed at promoting the integration of content and language learning (Dafouz et al., Reference Dafouz, Camacho and Urquia2013) in response to the “Englishisation” of higher education globally (Galloway et al., Reference Galloway, Numajiri and Rees2020, p.395). EMI is increasingly applied in tertiary education, and thus understanding and addressing the requirements of students in terms of their learning experiences and the linguistic challenges they face are important to fully address the varying levels of English proficiency in EMI classrooms (Galloway et al., Reference Galloway, Numajiri and Rees2020; Kamaşak et al., Reference Kamaşak, Sahan and Rose2021).

This project focuses on the dimension of gender in language experiences in EMI university settings. Recent EMI studies have explored the relationship between the motivation to learn English and gender (Lasagabaster, Reference Lasagabaster2016) and gender differences in the self-reported challenges of using academic English (Pun & Jin, Reference Pun and Jin2021). Female students have been found to be more motivated in English as a Foreign Language (EFL) settings (Meece et al., Reference Meece, Glienke and Burg2006). However, Lasagabaster (Reference Lasagabaster2016) argued that the EMI setting can dilute the gender-related differences in language learning motivation, as “the use of the language to learn content is equally prone to motivate male and female students,” and that the EMI approach can support “both male and female students to form vivid images of themselves as authentic users and speakers of the language” (Lasagabaster, Reference Lasagabaster2016, p. 328). Lasagabaster’s (Reference Lasagabaster2016) study indicates that in language learning in EMI university contexts, gender-related differences in levels of motivation tend to disappear. Pun and Jin’s (Reference Pun and Jin2021) study involved seventy-three randomly recruited science students from two EMI universities in Hong Kong, in which both females and males reported English language difficulties in academic speaking and writing. However, male students demonstrated better peer communication in English than females in a science classroom setting, at a statistically significant level. Pun and Jin (Reference Pun and Jin2021) suggested that females were likely to feel more stressed or shy when actively presenting their opinions in front of people, rather than any actual gender difference in English-speaking ability. Thus, in this project we focus on male and female students’ perceptions rather than gender differences in language motivation or proficiency.

Although the literature has recently addressed the connection between gender and students’ perceptions of English language use in EMI universities, further research is required. Students’ perceptions are socially and culturally situated and based on gender-related social experiences. In countries such as Japan, more females join EMI programs than males, as internationalism and English are considered to provide them with a degree of freedom in a male-centered society (Bradford et al., Reference Bradford, Shimauchi and Brown2017). However, females in other countries, such as Finland, were found to be less confident in their English skills than males (Bukve, Reference Bukve2020). Chou (Reference Chou2018, p. 615) analyzed studies published in recent decades and found that students often feel discouraged from actively participating in class, due to the “anxiety resulting from a lack of confidence in speaking and a fear of losing face in public.”

Students’ experiences are likely to significantly influence their feelings regarding EMI. Many students express that they feel anger, due to “not understanding the content knowledge, failing in EMI courses and losing concentration” (Turhan & Kırkgöz, Reference Turhan and Kırkgöz2018, pp. 272–273). However, successful EMI courses and teaching strategies can also lead to feelings of hope and enjoyment (Turhan & Kırkgöz, Reference Turhan and Kırkgöz2018). The personal characteristics of learners, including gender, can potentially affect their experiences and consequently how they perceive the challenges of using academic languages. Although some studies have assessed the types of challenges that students in EMI universities face, the relationship between these challenges and students’ characteristics remains unclear. The contribution of this project is to explore the effects of gender on students’ perceptions of the language difficulties they encounter.

In terms of the methodology applied, questionnaire surveys are useful tools for measuring students’ characteristics, including gender, and the challenges they perceive when studying in EMI universities (Jiang et al., Reference Jiang, Zhang and May2016; Tatzl & Messnarz, Reference Tatzl and Messnarz2013). For example, a recent study investigated the linguistic challenges faced by 498 undergraduate students in an EMI university in Turkey using a questionnaire survey (Kamaşak et al., Reference Kamaşak, Sahan and Rose2021). Second- and fourth-year undergraduate students were found to perceive reading English as more challenging than first-year students, but interestingly male students found understanding EMI instruction more challenging than female students. Pun and Jin (Reference Pun and Jin2021) used a similar method in their study of seventy-three undergraduate students in two EMI universities in Hong Kong, and through a questionnaire survey they identified gender differences in the EMI experience, particularly in terms of communication. Chu et al. (Reference Chou2018, p. 9) also used a questionnaire in their study of 278 undergraduate students in Taiwan and proposed that the “majority of local students’ command of English may not be compatible” with that of their international peers, in terms of “verbally expressing themselves fluently.” They found that local students whose first language was not English were reluctant to use English in class participation and social interaction, to “prevent themselves from making mistakes or embarrassing themselves” (Chu et al., Reference Chou2018, p. 9). Tsui and Ngo (Reference Tsui and Ngo2017, p. 69) used a questionnaire survey to explore perceptions of EMI instructions among 606 undergraduate students in Hong Kong and found that many were worried that the instructions could diminish their “academic results, motivation to learn, learning atmosphere and in-class discussion.” Thus, questionnaire surveys can provide valuable insights into the various aspects of students’ language learning experiences in EMI settings. However, additional statistical measures are required to further explore the effect of gender on students’ perceptions.

Questionnaire surveys can be combined with statistical techniques such as analysis of variance (ANOVA) or multivariate ANOVA (MANOVA) to compare group differences, in terms of gender and years of study, in various language-related challenges (see examples from Kamaşak et al., Reference Kamaşak, Sahan and Rose2021). For example, Turhan and Kırkgöz (Reference Turhan and Kırkgöz2018) combined a questionnaire with MANOVA to reveal differences in the motivations of students toward EMI instructions. The MANOVA test was used to investigate whether the year of study was a determinant of the motivation of students, and although they found that the year did not influence the motivation levels of the participants, final-year university students were the most highly motivated toward EMI (Turhan & Kırkgöz, Reference Turhan and Kırkgöz2018). Similarly, Chen and Kraklow (Reference Chen and Kraklow2014) applied MANOVA and found significant group differences in terms of both intrinsic motivation and English learning engagement between EMI and non-EMI students. In addition, Kamaşak et al. (Reference Kamaşak, Sahan and Rose2021) used ANOVA and MANOVA to explore group differences in language-related challenges, including students’ gender, field of study, year of study, first language background, prior EMI experience, and language proficiency test score. Their study revealed significant differences, with male students struggling more with listening, social sciences students facing greater challenges with reading and writing, second- and fourth-year students having greater reading-related challenges, Turkish students encountering more linguistic challenges than international students, and students with no prior EMI experience finding more challenges with writing, reading, speaking, and listening. Informed by these studies, we apply a one-way MANOVA as an additional statistical method to investigate group differences in our research. This enables us to examine the effects of gender on the perceived writing and speaking challenges of students in a Hong Kong EMI university. Achievements in different elements of English language skills are shown to be correlated (Stotsky, Reference Stotsky1983). In this chapter, we explore the mean differences between groups for a combination of perceived language challenges.

What Is MANOVA?

Multivariate ANOVA is a technique for measuring the linear effects of categorial variables on multiple dependent variables (D’Amico et al., Reference D’Amico, Neilands and Zambarano2001; Keselman et al. Reference Keselman, Huberty, Lix, Olejnik, Cribbie, Donahue and Levin1988). It differs from ANOVA (introduced in Chapter 2 of the book), which involves only one dependent variable. MANOVA is essentially an expanded form of ANOVA, and the main distinction between an ANOVA test and a MANOVA test is that an ANOVA test considers only one dependent variable (the measure that an experimenter is evaluating to see if it changes when other variables change), whereas a MANOVA test includes two or more dependent measures (Emerson, Reference Emerson2018). Moreover, in MANOVA, the independent variable is categorical while the dependent variable should be an interval or scale. Figure 12.1 illustrates the concepts that guide researchers to select different family members of ANOVA. In our case, gender is the independent variable, while writing and speaking challenges are the two dependent variables. Gender (male/female) is a categorical variable, while writing and speaking challenges are measured on an interval/scale. As no control variables are included in this study and we have one independent variable, a one-way MANOVA is most appropriate.

Figure 12.1 Considerations in using MANOVA/MANCOVA

Use of MANOVA in EMI Research in the University Context

Of late, MANOVA has increasingly been applied in research into EMI in universities. We searched for three phrases in the Google Scholar search engine: “MANOVA,” “university students,” and “English Medium Instruction.” Table 12.1 shows some of the recent studies that have used MANOVA in the context of EMI universities. Several were conducted in Taiwan (Chen & Kraklow, Reference Chen and Kraklow2014; Chou, Reference Chou2018), Turkey (Kamaşak et al., Reference Kamaşak, Sahan and Rose2021; Soruç et al., Reference Soruç, Altay, Curle and Yuksel2021; Turhan & Kırkgöz, Reference Turhan and Kırkgöz2018), and Japan (Aizawa et al., Reference Aizawa, Rose, Thompson and Curle2020). Statistical Package for Social Sciences (SPSS) and R were the two most common software packages used in these studies. Most used several rounds of one-way MANOVAs to study the effects of independent variables on more than one dependent variable. Other future studies may extend to two-way or three-way MANOVA, two-way multivariate analysis of covariance (MANCOVA) or three-way MANCOVA. Unlike MANOVA, MANCOVA includes control variables. Table 12.1 reports research into the effects of different types of EMI instruction on the posttest academic achievements and motivation of university students, when controlling for pretest achievement. MANCOVA has two main advantages over MANOVA: (1) it can explain within-group variance and (2) confounding factors can be controlled for (George & Mallery, Reference George and Mallery2019).

Table 12.1 Review of the use of MANOVA in recent research in an EMI university context

Studies	Countries/region	Software used	Types of MANOVA	Independent variable(s)	Dependent variable(s)	Key results/findings
Chen and Kraklow (Reference Chen and Kraklow2014)	Taiwan	SPSS version 20.0	One-way MANOVA	Academic programs	1. Motivation 2. Engagement	Students in EMI programs had statistically significantly higher motivation and engagement than those who were in non-EMI programs.
Chou (Reference Chou2018)	Taiwan	SPSS version 23.0	One-way MANOVA	EMI context	Set A: speaking anxiety 1. Speech anxiety 2. A lack of confidence in speaking 3. Negative feelings in EFL context 4. Fear that mistakes are corrected by teachers 5. Fear of failing the EFL course Set B: speaking strategy use 1. Rehearsal strategies 2. Paraphrasing 3. Borrowing	Students in a full EMI context showed a significant difference in the five combined dependent variables of speaking anxiety from those in a partial EMI context. Students in a full EMI context showed a significant difference in the three combined dependent variables of speaking strategy use from those in a partial EMI context.
Kamaşak et al. (Reference Kamaşak, Sahan and Rose2021)	Turkey	N/A	One-way MANOVA	Several rounds of one-way MANOVA were conducted 1. Field of study 2. Year of study 3. First language background 4. Prior EMI experience	1. Writing challenges 2. Speaking challenges 3. Listening challenges 4. Reading challenges	Compared to engineering students, students studying social sciences found reading and writing English more challenging. Students in their second year and fourth year of study found reading more challenging than first-year freshmen. Students whose first language was Turkish showed a significant difference in all four components of language-related challenges from those whose first language was not Turkish. Differences were found in prior EMI experience on all four components of language-related challenges.
Turhan and Kirkgöz (Reference Turhan and Kırkgöz2018)	Turkey	SPSS version 23.0	One-way MANOVA	Year of study	General aspects of motivation toward EMI 1. Integrative motivation 2. Instrumental motivation 3. Motivation at the learner level 4. Motivation at the situation level Specific aspects for motivation toward EMI 1. Cognitive aspect 2. Affective aspect 3. Conative aspect	No significant differences in the general aspects of motivation toward EMI were found across year groups. Significant differences in a specific affective aspect of motivation toward EMI was found across year groups.
Aizawa et al. (Reference Aizawa, Rose, Thompson and Curle2020)	Japan	R software	One-way MANOVA	Common European Framework of Reference for Languages (CEFR) levels	1. Perceived ease of academic writing 2. Perceived ease of academic listening 3. Perceived ease of reading 4. Perceived ease of speaking	All of the four academic-related language challenges of writing, speaking, listening, and reading were significantly different across all CEFR levels.
Soruç et al. (Reference Soruç, Altay, Curle and Yuksel2021)	Turkey	R software	One-way MANOVA	CEFR levels	1. Perceived ease of academic writing 2. Perceived ease of academic listening 3. Perceived ease of reading 4. Perceived ease of speaking	Four language-related challenges differed significantly across the CEFR levels of university students of both international relations and electronic engineering.
Chu et al. (Reference Chu, Lee and Obrien2017)	Taiwan	R software	One-way MANOVA	Student type (international vs. local)	Ten language-related factors 1. Taiwanese students’ susceptibility to foreign cultures 2. School management responsiveness 3. Students’ ability to use English to communicate in class 4. Cross-cultural diversity within the college 5. Students’ perception of the internationalization of the college 6. Students’ barriers to social interaction 7. Language of communication used by the university 8. Students’ perception of campus facilities 9. Variety of food available on campus 10. Students’ perception of course content (p. 6).	A significant effect of student type on all ten factors was found. Four factors differed significantly among student types, namely, Taiwanese students’ susceptibility to foreign cultures, students’ ability to use English to communicate in class, students’ perception of course content, and school management responsiveness.

A wider range of independent variables than dependent variables was included in these studies. The independent variables can be classified into three types: learners’ characteristics (Chu et al., Reference Chu, Lee and Obrien2017; Kamaşak et al., Reference Kamaşak, Sahan and Rose2021; Turhan & Kırkgöz, Reference Turhan and Kırkgöz2018); the context of study (Chen & Kraklow, Reference Chen and Kraklow2014; Chou, Reference Chou2018); and English language level (Aizawa et al., Reference Aizawa, Rose, Thompson and Curle2020; Soruç et al., Reference Soruç, Altay, Curle and Yuksel2021). Perceived ease, anxiety, and challenges of language use were the dependent variables most often studied (Aizawa et al., Reference Aizawa, Rose, Thompson and Curle2020; Chou, Reference Chou2018; Kamaşak et al., Reference Kamaşak, Sahan and Rose2021; Soruç et al., Reference Soruç, Altay, Curle and Yuksel2021). Affective aspects, such as motivation and engagement, were also investigated as dependent variables (Chen & Kraklow, Reference Chen and Kraklow2014; Turhan & Kırkgöz, Reference Turhan and Kırkgöz2018). However, MANOVA is seldom used to examine the effects of different types of EMI instruction on students’ learning outcomes and language-related challenges.

The findings of studies that applied MANOVA as the statistical technique reveal various patterns. For example, similarities can be identified in two studies focusing on Japan and Turkey (Aizawa et al., Reference Aizawa, Rose, Thompson and Curle2020; Soruç et al., Reference Soruç, Altay, Curle and Yuksel2021), as language-related challenges were found to differ across year groups in the EMI university settings of both. Researchers can compare the results across EMI studies of how groups differ in terms of the joint effects of the dependent variables. They can also compare results from different national contexts, as in the example of Japan and Turkey. We jointly examined the studies of Chen and Kraklow (Reference Chen and Kraklow2014) and Turhan and Kırkgöz (Reference Turhan and Kırkgöz2018) and found that both indicate that the year of study does not predict the motivation toward EMI but that EMI programs motivate and engage students in EMI instruction. Thus, the application of MANOVA enables the findings across studies to be combined.

Research Question and Hypothesis Formulation

The research question we address in this study is: How does gender predict language-related challenges in terms of writing and speaking?

Research Design and Methodology

A questionnaire survey was designed to measure demographic information and students’ perceived speaking and writing challenges in an EMI university. We recruited seventy-five university students from two Hong Kong universities who were studying science subjects such as engineering or natural science. Of these, twenty-four were female and fifty-one were male. A majority of students (n = 53) were in their first undergraduate year and the remainder (n = 22) were in their second year or above. A majority of the students (n = 69) reported that their first language was Cantonese, and the remainder reported Mandarin as their first language. Our preliminary results were reported in Pun and Jin (Reference Pun and Jin2021), while in this book chapter we aim to illustrate how we can incorporate one-way MANOVA into our study of the effects of gender on perceived writing and speaking challenges.

The scales of the questionnaire survey were adapted from the literature. The survey comprised 149 items, of which seven sought the respondents’ background information and the remaining 142 were self-assessment questions on the respondents’ experiences in EMI universities (Pun & Jin, Reference Pun and Jin2021). The questionnaire was reviewed by several education researchers and science lecturers (Pun & Jin, Reference Pun and Jin2021) and was piloted with forty-eight undergraduates (Pun & Jin, Reference Pun and Jin2021).

When conducting one-way MANOVAs, researchers commonly identify one categorial independent variable and two scale dependent variables. The aim of a one-way MANOVA is to identify the effects of the independent variable on both dependent variables, and the two dependent variables are conceptually related to each other (Pallant, Reference Pallant2020). In our case, the independent variable is gender, while the two dependent variables are perceived writing difficulty and perceived speaking difficulty. As mentioned in the previous section, these dependent variables are conceptually related (Cleland & Pickering, Reference Cleland and Pickering2006). Cronbach’s alpha was calculated to validate the reliability of the scale of perceived writing and speaking challenges, and the values are given in Table 12.2.

Table 12.2 Perceived writing and speaking difficulties

Components	Cronbach’s alpha	Example items
Perceived writing difficulty (five items)	0.832	How easy or difficult do you find planning long written assignments?
Perceived speaking difficulty (seven items)	0.835	How easy or difficult do you find participating in class discussions?

Taber (Reference Taber2018) suggested that a scale with a Cronbach’s alpha of 0.7 is considered reliable. Thus, the scales in Table 12.2 can reliably reflect perceived writing and speaking difficulties.

Assumptions of MANOVA

University students’ responses to the items were coded on a scale of 1 (very easy) to 5 (very difficult). An average score was computed for each dependent variable, with a higher score representing a higher level of perceived difficulty. A one-way MANOVA is appropriate because we are interested in how gender affects students’ perceived writing and speaking challenges. Table 12.3 provides a summary of the means and standard deviations of perceived speaking and writing difficulties for university students.

Table 12.3 Descriptive statistics of perceived speaking and writing difficulties of university students

	Gender	Mean	SD	n
Perceived speaking difficulties	Female	3.0417	0.43265	24
	Male	2.7115	0.55143	51
	Total	2.8171	0.53636	75
Perceived writing difficulties	Female	3.0250	0.52523	24
	Male	2.9137	0.72361	51
	Total	2.9493	0.66503	75

Pallant (Reference Pallant2020) suggested that researchers must ensure various assumptions are met before conducting a one-way MANOVA:

1. The sample size should be large and higher than the required number of cases per cell.
2. Univariate and multivariate normality should be ensured. Multivariate normality can be indicated by Mahalanobis distances.
3. Univariate and multivariate outliers should be checked.
4. Linear relationships between dependent variables should be ensured.
5. Each pair of dependent variables should have moderate correlations.
6. Homogeneity should exist in variance and covariance matrices.

In the following subsections, we describe the tests we conducted to ensure that these assumptions were met.

Sample Size

A large sample is important because it reduces the risk of violating other assumptions, such as univariate and multivariate normality. The smallest number of cases in each cell is two in our study, as there are two dependent variables. Therefore, we have a total of four cells (two levels of independent variable and two dependent variables). As the descriptive statistics reveal, the sample size is much greater than the required number of cases.

Univariate and Multivariate Outliers and Normality

The normality of the groups can be assessed in terms of their perceived writing and speaking challenges using Shapiro–Wilk tests (Pallant, Reference Pallant2020). The significance levels are 0.179 for females and 0.319 for males regarding perceived speaking challenges and 0.140 for females and 0.576 for males regarding perceived writing challenges. All of the significance values are greater than 0.05, indicating that the dependent variables are normally distributed across males and females. For univariate outliers, we checked the boxplot for outliers in terms of the perceived writing and speaking challenges, as represented by a circle or an asterisk at the end.

In Figure 12.2a, “case 46” is circled for the perceived writing challenges in the male group; in Figure 12.2b, “case 8” (female) and “case 40” (male) are both circled. These indicate three outliers. We must then consider whether these cases should be excluded or whether the variables should be transformed. Alternatively, we can decide whether to exclude these three cases after examining the multivariate outliers, as indicated in the following discussion.

Figure 12.2 Boxplots of (a) perceived writing challenges and (b) perceived speaking challenges

Multivariate outliers can be identified by calculating the Mahalanobis distance, which is a measure of the distance between a case and the centroid of the other cases (Tabachnick and Fidell, Reference Tabachnick and Fidell2013). Tabachnick and Fidell (Reference Tabachnick and Fidell2013) noted that the critical values at α = 0.001 vary according to the number of dependent variables. These values are typically listed in a chi-square table and can be read off according to the number of dependent variables. As we have two dependent variables in this study, the critical value is 13.82. The maximum value of the Mahalanobis distance obtained from IBM SPSS version 28 is 11.5, which is lower than the critical value. This may suggest an absence of multivariate outliers, but if there is only one case with a score higher than the critical value, then we should consider whether to include or exclude that case in the statistical analysis by assessing whether the difference between the critical value and the individual score is large (Pallant, Reference Pallant2020). The presence of multiple cases of multivariate outliers may require their removal from the dataset.

Linearity

A linearity test was conducted to assess whether a linear relationship exists between the two dependent variables of perceived speaking challenges and writing challenges. The scatter plot in Figure 12.3 shows that the relationship is linear, with an R² value of 0.211. As indicated in Table 12.4, the significance level in the column of deviation from linearity is greater than 0.05, confirming that the relationship between these two variables is linear.

Figure 12.3 Plot showing the relationship between perceived speaking and writing challenges

Table 12.4 ANOVA of linearity

			Sum of squares	df	Mean square	F	p
Speaking_Ability * Writing_Ability	Between groups	(Combined)	6.936	13	0.534	2.268	0.016
		Linearity	4.499	1	4.499	19.120	<0.001
		Deviation from linearity	2.438	12	0.203	0.863	0.587
	Within groups		14.352	61	0.235
	Total		21.288	74

Multicollinearity and Singularity

The dependent variables in MANOVA should be moderately correlated. French et al. (Reference French, Macedo, Poulsen, Waterson and Yu2008) argued that if the dependent variables are highly correlated, insufficient variance will remain. Too low a correlation, however, indicates that running MANOVA may be unnecessary, because a univariate analysis would suffice to identify the effects of the independent variable on each dependent variable. We ran a Pearson’s correlation between our two dependent variables of perceived writing and perceived speaking challenges. The Pearson’s correlation coefficient of r_p = 0.460 (p < 0.05) indicates that the two dependent variables are moderately correlated, and thus running MANOVA may yield meaningful results. However, a correlation in the range of 0.80 to 0.90 would be a concern (Pallant, Reference Pallant2020).

Levene’s Test of Equality of Error Variances

An important assumption of MANOVA is that the error variances should be equal. Levene’s test of equality of error variances in Table 12.5 indicates that no significant value was lower than 0.05. Therefore, equal error variances were assumed.

Table 12.5 Levene’s test of equality of error variances

		Levene statistic	Sig.
Speaking_Ability	Based on mean	2.019	0.160
	Based on median	2.335	0.131
	Based on median and with adjusted df	2.335	0.131
	Based on trimmed mean	1.960	0.166
Writing_Ability	Based on mean	2.596	0.111
	Based on median	2.422	0.124
	Based on median and with adjusted df	2.422	0.124
Based on trimmed mean	2.527	1	73

This tests the null hypothesis that the error variance of the dependent variable is equal across groups.

Homogeneity of Variance–Covariance Matrices

Box’s test of equality of covariance matrices reflects the homogeneity of variance–covariance matrices (Pallant, Reference Pallant2020). If the value of significance is larger than 0.001, the assumption of homogeneity of these matrices has not been violated (Pallant, Reference Pallant2020). The significance value in our study is 0.185, and thus the assumption of homogeneity of variance–covariance matrices is valid.

Multivariate Tests

The one-way MANOVA results are presented in Table 12.6. The four statistics that can be selected are Pillai’s trace, Wilk’s lambda, Hotelling’s trace, and Roy’s largest root. Tabachnick and Fidell (Reference Tabachnick and Fidell2013) suggested that if the data violate any of the MANOVA assumptions such as a large sample size, researchers can consider using Pillai’s Trace. Otherwise, Wilks’ Lambda should be generally used (Tabachnick and Fidell, Reference Tabachnick and Fidell2013). In the second “Gender” column the Wilks’ lambda row is in bold. Wilks’ lambda is consulted when conducting a one-way MANOVA (Pallant, Reference Pallant2020), and a significance level lower than 0.05 indicates a significant difference between the gender groups, while a level higher than 0.05 indicates no significant difference. Table 12.6 shows that the significance (“Sig.”) level is 0.037, and thus lower than 0.05. We can conclude from this that the students’ overall perceptions of language challenges were affected by their gender.

Table 12.6 Multivariate tests

Effect		Value	F	Hypothesis df	Error df	Sig.	Partial eta squared
Intercept	Pillai’s Trace	0.969	1143.926	2.000	72.000	<0.001	0.969
	Wilks’ Lambda	0.031	1143.926	2.000	72.000	<0.001	0.969
	Hotelling’s trace	31.776	1143.926	2.000	72.000	<0.001	0.969
	Roy’s largest root	31.776	1143.926	2.000	72.000	<0.001	0.969
Gender	Pillai’s Trace	0.087	3.444	2.000	72.000	0.037	0.087
	Wilks’ Lambda	0.913	3.444	2.000	72.000	0.037	0.087
	Hotelling’s trace	0.096	3.444	2.000	72.000	0.037	0.087
	Roy’s largest root	0.096	3.444	2.000	72.000	0.037	0.087

Tests of Between-Subjects Effects

Tests of between-subjects effects can provide researchers with a more detailed view of each dependent variable, as shown in Table 12.7. The perceived speaking and writing challenges are the dependent variables in this study. This test includes separate analyses of each dependent variable. To reduce the risk of a Type I error (claiming significant relationships when in fact there are none), we applied a stricter Bonferroni adjustment (Pallant, Reference Pallant2020). We divided our original alpha level of 0.05 by 2, thus obtaining a new alpha level of 0.025. Our results can be considered as significant only if the significance level falls below 0.025. As indicated by the bold text in Table 12.7, perceived speaking challenges have a significance level of 0.012, while perceived writing challenges have a significance level of 0.503. This indicates that gender has a statistically significant effect on perceived speaking challenges (F (2, 75) = 6.658; p < 0.025; partial η² = 0.084) but not on perceived writing challenges (F (2, 75) = 0.454; p = 0.503; partial η² = 0.006). The partial eta squared of the effects of gender on perceived speaking challenges is 0.084, which indicates a moderate effect size.

Table 12.7 Tests of between-subjects effects

Source	Dependent variable	Type III sum of squares	df	Mean square	F	Sig.	Partial eta squared
Corrected model	Speaking	1.779	1	1.779	6.658	0.012	0.084
Corrected model	Writing	0.202	1	0.202	0.454	0.503	0.006
Intercept	Speaking	540.172	1	540.172	2021.253	<0.001	0.965
Intercept	Writing	575.581	1	575.581	1291.835	<0.001	0.947
Gender	Speaking	1.779	1	1.779	6.658	0.012	0.084
Gender	Writing	0.202	1	0.202	0.454	0.503	0.006
Error	Speaking	19.509	73	0.267
Error	Writing	32.525	73	0.446
Total	Speaking	616.510	75
Total	Writing	685.120	75
Corrected total	Speaking	21.288	74
Corrected total	Writing	32.727	74

Multiple Comparisons

Table 12.8 shows the pairwise comparisons between females and males for the perceived speaking and writing challenges. The mean scores are significantly different for males and females for perceived speaking challenges (p < 0.05) but not for perceived writing challenges (p = 0.503). The mean score for perceived speaking challenges is 0.333 higher for females than for males. Although the mean of perceived writing challenges for females is higher than the mean of perceived writing challenges for males by 0.111, the difference is not significant.

Table 12.8 Pairwise comparisons

Dependent variable	(I) Gender	(J) Gender	Mean difference (I–J)	Std. error	Sig.	95 percent confidence interval for difference
Dependent variable	(I) Gender	(J) Gender	Mean difference (I–J)	Std. error	Sig.	Lower bound	Upper bound
Speaking	Female	Male	0.330*	0.128	0.012	0.075	0.585
Speaking	Male	Female	−0.330*	0.128	0.012	−0.585	−.075
Writing	Female	Male	0.111	0.165	0.503	−0.218	0.441
Writing	Male	Female	−0.111	0.165	0.503	−0.441	0.218

Note: p < 0.05.

Writing up the Results

The following paragraph indicates how we wrote up the results according to the format suggested by Pallant (Reference Pallant2020):

A one-way MANOVA was conducted to investigate the effects of gender on perceived language challenges. Two dependent variables were studied with this technique: perceived writing challenges and perceived speaking challenges. The independent variable was gender. Several assumptions were met before conducting MANOVA: univariate and multivariate normality; linearity; outliers; multicollinearity; and homogeneity of variance–covariance matrices. The results show that males differed significantly from females on the combined dependent variables, F (2, 75) = 3.444, p = 0.037, Wilks’ lambda = 0.913, partial η² = 0.087. If we analyze the dependent variable separately, with reference to the Bonferroni adjusted alpha level of 0.025, we only find a significant difference between males and females on perceived speaking challenges, F (1, 75) = 6.658, p = 0.012, partial η² = 0.084. A closer examination of the mean score shows that females (M = 3.042, SD = 0.34) reported a higher perceived challenges in speaking English than males (M = 2.71, SD = 0.55).

The following components should be included when writing up the results: (1) the F-value; (2) the p-value; (3) the partial eta squared; and (4) Wilks’ lambda. These give a good indication of whether the independent variable has a significant effect on the interaction between the two dependent variables.

Discussion of the Use of MANOVA in EMI Research Studies

One major challenge of MANOVA is that many complex assumptions need to be met. EMI researchers may therefore ask why we cannot simply perform a series of ANOVAs to test the effect of an independent variable on several dependent variables. MANOVA has the advantage of reducing the likelihood of committing a Type I error, which involves claiming that relationships are significant when in fact there are no significant relationships. MANOVA is also more powerful than ANOVA, in the sense that researchers can examine the effects of an independent variable on a linear combination of outcome variables (Keselman et al., Reference Keselman, Huberty, Lix, Olejnik, Cribbie, Donahue and Levin1988). The two outcome variables should be moderately correlated. In this study, we are interested in how gender affects the linear combination of perceived writing and speaking challenges. The two challenges are conceptually related (Cleland & Pickering, Reference Cleland and Pickering2006) as they are both types of perceived linguistic challenges.

In addition to examining the outputs of multivariate tests, researchers should also examine the outputs of the test of between-subjects effects and pairwise comparisons. Although we have a combined effect on the dependent variables in our study, there is no statistically significant effect on perceived writing challenges. Thus, this approach enables a finer examination of the effects of gender on each dependent variable. Further insights can be drawn from post hoc pairwise comparisons.

In summary, a one-way MANOVA can be extremely useful when examining the effects of a categorical variable on various dependent variables, including the perceived challenges, motivation, or language competence of EMI university students. The discussions in this chapter suggest that future research into the use of MANOVA for analyzing results from questionnaire surveys is required. In addition, MANCOVA should be considered if variables are included as covariates.

13 Using a Longitudinal Quantitative Design to Investigate Student Transition from Secondary School to EMI Higher Education

Introduction

In recent years, the learning experience of students transitioning from first language (L1)-medium secondary schools to English Medium Instruction (EMI) tertiary education has attracted growing scholarly attention (e.g., Evans & Morrison, Reference Evans and Morrison2011, Reference Evans and Morrison2016; Lin & Morrison, Reference Lin and Morrison2010; Zhou & Rose, Reference Zhou, McKinley, Rose and Xu2021; Zhou & Thompson, Reference Zhou, Thompson and Zhou2023). Students might experience tremendous difficulties in their first experiences listening to EMI academic lectures, and their perceptions of difficulties may change during the first semester – a critical transition period reported to shape students’ perceptions, beliefs, and behaviors (Ding & Stapleton, Reference Ding and Stapleton2016; Zhou et al., Reference Zhou, Thompson and Zhou2023). Although the importance of transitional issues has been well recognized in EMI research (Macaro et al., Reference Macaro, Curle, Pun, An and Dearden2018), very few studies have used a longitudinal quantitative design to provide an overview of changes in students’ learning patterns during transition.

In this chapter, a case study will be presented that draws on a longitudinal quantitative design to investigate students’ perceptions of lecture listening difficulties during their first semester transitioning into an EMI university in China. Ortega and Iberri-Shea (Reference Ortega and Iberri-Shea2005) elucidated four criteria that qualify a study as longitudinal: first, the duration of the study; second, the utilization of multiwave data collection; third, the conceptual emphasis on capturing change through design; and fourth, the emphasis on creating antecedent–consequent relationships through continuous monitoring of the phenomenon within its environment, instead of relying on experimental controls or comparisons. Echoing this definition of a longitudinal design, the study collected data from the same group of students on their perceptions of lecture listening difficulties across multiple time points (i.e., the beginning, halfway, and the end) during the transition semester and analyzed patterns of change through statistical tests. Such a design can broaden the methodological scope of extant EMI research that investigates students’ learning experience and may be especially applicable to situations where changes in perceptions or behaviors are likely to occur such as the transition period.

The rationale underpinning the longitudinal quantitative design of the study stems from two underresearched issues in extant EMI literature. First, although academic challenges in EMI settings have been well documented in the past two decades (e.g., Aizawa & Rose, Reference Aizawa and Rose2020; Evans & Morrison, Reference Evans and Morrison2016; Hellekjær, Reference Hellekjær2010; Kim & Yoon, Reference Kim and Yoon2018; Kırkgöz, Reference Kırkgöz2005; Zhou et al., Reference Zhou and Rose2021), studies adopting a longitudinal design are scarce. Data have usually been collected through cross-sectional designs to illustrate learning difficulties at a particular time point for a selected cohort of students, for example, first-year undergraduate students (e.g., Aizawa & Rose, Reference Aizawa and Rose2020; Evans & Morrison, Reference Evans and Morrison2016), undergraduate students from mixed year levels (e.g., Kim & Yoon, Reference Kim and Yoon2018; Kırkgöz, Reference Kırkgöz2005), or a combination of undergraduate and postgraduate students (e.g., Hellekjær, Reference Hellekjær2010; Zhou et al., Reference Zhou and Rose2021). As such, these studies can only offer a one-shot description of learning issues at a fixed time point, whereas little is known about whether and how students’ perceived difficulties evolve as they adapt to studying via EMI. Especially for students who just arrive at an EMI university, the usually adopted one-off surveys are unable to capture the dynamics and complexities of their transitional learning experience (see Evans & Morrison, Reference Evans and Morrison2011).

Another issue relatively underexplored in EMI research is how students entering with different English proficiency levels may vary in their transitional experience. It has been found that students from EMI and L1-medium secondary schools reported different challenges when listening to EMI lectures at a tertiary level of education (e.g., Aizawa & Rose, Reference Aizawa and Rose2020; Evans & Morrison, Reference Evans and Morrison2016). Since students coming from EMI secondary schools might be more proficient in English than their peers from L1-medium schools, their English proficiency upon entry might be a covariate (or confounding variable) that explains the differences in the perceptions of challenges identified in these studies. Hence, to disentangle the respective trajectories of students’ lecture listening associated with their level of English proficiency during transition, the study aims to answer the following questions:

1. What are the difficulties experienced by students in listening to lectures upon entering an EMI university?
2. Are there changes in students’ perceptions of listening difficulties during their first semester transitioning into the university?
3. To what extent is students’ English listening proficiency associated with their longitudinal patterns of change in listening difficulties during transition?

In what follows, the chapter will introduce a longitudinal quantitative study that aims to answer these research questions. The overall design of the study will first be outlined, followed by a detailed description of procedures for implementing the study. Important issues such as sampling, the design of measures, and the selection of statistical tests will be discussed, detailing the considerations and rationale behind each methodological decision. Key dilemmas such as handling attrition and missing data will be illustrated, while sharing suggestions for addressing these problems . The chapter will conclude with the methodological implications of the study for future EMI research and offer recommendations informed by the methodological limitations of the study. This case study focuses on the methodological design of the study, and findings are reported separately in Zhou and Rose (Reference Zhou and Rose2024).

A Case Study on EMI Longitudinal Quantitative Research

Research Design and Context

The study adopts a longitudinal quantitative design, collecting data at the beginning (Week 2, Time point 1/T1), halfway (Week 8, Time point 2/T2), and the end (Week 13, Time point 3/T3) of students’ first semester after entry into an EMI transnational university in southeast China. A transnational university was selected because it involves a higher percentage of English use in classroom instruction than other types of EMI programs (McKinley et al., Reference McKinley, Rose and Zhou2021; Rose et al., Reference Rose, McKinley, Xu and Zhou2019), which might result in more language-related difficulties for local students. This type of university is also commonly known as 中外合作办学 (zhong wai he zuo ban xue) (De Costa et al., Reference De Costa, Green-Eneix and Li2021), which is jointly run by a Chinese and a foreign institution in terms of management but the programs are mainly offered to Chinese students (State Council, 2003). As one of the first joint-venture EMI universities in China, the university also has a more established curriculum compared to other emerging institutions, thus attracting and intaking a large number of new undergraduate students annually. The large student population made it easier for recruiting a relatively large sample for this study.

Upon admission into the university, all students are required to take the Oxford Online Placement Test (OOPT) and are streamed into three bands of English for Academic Purposes (EAP) classes (high level, standard, foundation) based on the test results. In addition, students in the first academic year can sign up for two to three major-specific EMI courses. The EMI lectures, which usually take place in auditoriums for 200+ students, are organized by topic and delivered by academic staff from different academic faculties from a variety of countries. This study sampled from three EMI courses of business, communications, and linguistics, which belong to the faculties of business and humanities and social sciences. These three courses were chosen because they tend to involve lengthy verbal explanations of the teachers that focus on conceptual understanding, where students may encounter more language-associated difficulties than in subjects heavily dependent on scientific and mathematical formula.

Sampling

Representativeness of the Sample

In quantitative research, an important issue to consider regarding sampling is its representativeness of the target population. Whereas probability sampling such as random sampling might be ideal in ensuring the representativeness, it is sometimes less practical in applied linguistics research due to contextual limitations for data collection (Dörnyei, Reference Dörnyei2007). Following a complete collection sampling strategy (Cohen et al., Reference Cohen, Manion and Morrison2017), all first-year students enrolled in the three EMI courses were invited to participate in the study. Since this strategy is a nonprobability sampling strategy, it is important to examine to what extent the sample represents the target population. Key demographic variables were compared between the sample (n = 412) and the population (N = 1,473), that is, all Chinese first-year students enrolled in 2019 majoring in business and humanities and social sciences. As Table 13.1 summarizes, the sample and the population share similar ratios in key demographic variables, such as gender, major, and EAP class level, indicating good representativeness of the sample.

Table 13.1 Comparison between sample and population on key demographic variables

Variable		Sample (n = 412)		Population (N = 1,473)
		Count	%	Count	%
Gender	Male	99	24	296	20
	Female	313	76	1,177	80
Major	Business	267	65	1,057	72
	Humanities and social sciences	145	35	416	28
EAP class level	High level	65	16	157	11
	Standard	294	71	1,077	73
	Foundation	53	13	239	16

Dealing with Attrition in Longitudinal Sampling

Another important consideration for the sampling of longitudinal quantitative research is the issue of attrition, which may result in a shrinkage of the initial sample. Regarding this challenge, the first step taken was to control for the dropout rate and encourage participation over time, including multiple measures outlined as follows:

The distribution of the survey at all time points was facilitated by the course coordinators both in class and via emails after class.
At T1, the researcher promised to send a personal customized term report to students individually, requiring complete survey data at all time points.
The researcher traced the responses every time after survey and sent emails to the missing participants to encourage further participation.
Those students who completed all three surveys were invited to participate in a lucky draw as an incentive for their responses.

The measures seemed to be effective: From T1 to T2, the attrition rate was 16.5 percent, and from T2 to T3 it was a further 6.8 percent, totaling a 23.3 percent of attrition rate throughout the duration of data collection. This led to a final dataset of 316 students, who completed the survey at all three time points.

Another key challenge to address during data collection and analysis was handling the missing data resulting from attrition. According to Graham (Reference Graham2009), data can be “missing completely at random” (MCAR), “missing at random” (MAR), or “missing not at random” (MNAR). MCAR refers to the scenario where missing cases can be considered as a random subset of the complete sample, where each case has the same probability of being missing, and any pattern that exists in the complete sample can be consistently estimated from the available sample. Compared to MCAR, MAR is a more reasonable assumption for real-life data, which allows the probability of a case being missing to depend on the predictor variables but not on the outcome variable or unobserved predictor variables. Both MCAR and MAR, as Graham (Reference Graham2009) notes, yield unbiased parameter estimates in the statistical analysis, which means that the impact of these two types of attrition is not substantial. However, one needs to be careful with the case of MNAR. MNAR refers to the assumption that the probability of a case being missing may vary depending on both observed and unobserved variables. This can result in biased parameter estimates if data are systematically missing with a pattern that is not accounted for in the analysis. For example, if students with more listening difficulties were less likely to complete the survey, then the survey results would be biased toward students with fewer listening difficulties and therefore not representative of the full population. However, as Cox et al. (Reference Cox, McIntosh, Reason and Terenzini2014) point out, there is no straightforward method to confirm whether the data are missing at random. Key demographic variables were compared between the missing and remaining sample through a Pearson’s chi-square test. As can be seen from the test results in Table 13.2, the missing sample (n = 96) was not significantly different from the remaining sample in gender, major, EAP class level, or study/living abroad experience. This indicates that the composition of the remaining sample (n = 316) was similar to the original sample at T1 (n = 412) demographically.

Table 13.2 Pearson’s chi-square test on demographics for the remaining and missing sample

Variable	Group	Remaining sample (n = 316)		Missing sample (n = 96)		χ²	df	p
		Count	Expected	Count	Expected
Gender	Female	246	240.1	67	72.9	2.62	1	0.106
	Male	70	75.9	29	23.1
Age	18	248	248.5	76	75.5	0.04	2	0.981
	19	52	51.4	15	15.6
	Other	16	16.1	5	4.9
Major	Business	209	204.8	58	62.2	1.06	1	0.304
	Humanities and social sciences	107	111.2	38	33.8
EAP class level	High level	50	49.9	15	15.1	0.70	2	0.704
	Standard	223	225.5	71	68.5
	Foundation	43	40.7	10	12.3
Study/living abroad experience	Never	275	273.0	81	83.0	0.44	1	0.507
Study/living abroad experience	Yes (<6 months)	41	43.0	15	13.0

Note: Fisher’s exact test is interpreted for p-value when the expected count is below 5.

Finally, within the dataset collected at the same time point, missing data still existed where some participants selectively responded to only some subsections of the survey. In this regard, the decision to apply listwise deletion instead of pairwise deletion was exercised in running statistical tests using SPSS. This decision was made based on the following considerations. First, the final sample (n = 316) was large enough to support all statistical analyses, even when cases including missing responses to certain variables were deleted. Second, compared to pairwise deletion, the listwise deletion approach may result in less biased statistics derived from a variance–covariance matrix, for example, the regression coefficients (Cox et al., Reference Cox, McIntosh, Reason and Terenzini2014). Third, as Allison (Reference Allison2002, p.7) argues, “whenever the probability of missing data on a particular independent variable depends on the value of that variable (and not the dependent variable), listwise deletion may do better than maximum likelihood or multiple imputation.” In consideration of these aspects, listwise deletion was considered a suitable method for handling missing data in the study.

Designing Measures

Measures used in the study include (1) students’ listening scores in the OOPT as a measure of their English listening proficiency upon entry and (2) the “Difficulties in listening to EMI lectures questionnaire.”

Using OOPT Score as a Measure of English Listening Proficiency

The use of the test score of OOPT as a measure of students’ English listening proficiency was supported by several reasons. First, as a placement test compulsory for all newly enrolled first-year students, the score offers a unitary measure of English proficiency for all participants. Second, the test includes an independent section on “listening,” in addition to a more general measure on the “use of English,” making it possible to delineate a more skill-specific proficiency score. The test scores are also benchmarked to the Common European Framework of Reference for Languages levels, enabling an easy comparison to other studies conducted in different EMI settings worldwide. Third, the test has been piloted with thousands of test takers and is continually validated (see Oxford University Press, 2020). It has also been already used in previous EMI research (Aguilar & Muñoz, Reference Aguilar and Muñoz2014). Finally, the score serves as a criterion for streaming students into different EAP classes in the university where the data were collected. Thus, the results of the study can easily translate into practical pedagogical suggestions to inform language support in the participating university.

Designing the “Difficulties in Listening to EMI Lectures Questionnaire”

The “Difficulties in listening to EMI lectures questionnaire” was designed through an initial examination of literature and a pilot to improve the face and content validity, followed by a principal component analysis (PCA) to check construct validity and a test of reliability. In the first step of questionnaire development, three main issues were considered, namely questionnaire item writing, the scale of data, and questionnaire formatting. After that, the questionnaire was piloted and checked for construct validity and reliability.

Questionnaire item writing. Based on Cohen et al.’s (Reference Cohen, Manion and Morrison2017) suggestions on questionnaire design, the development of questionnaire items in this study observed two principles: (1) The items should be clear on the purpose to measure the target construct to answer the research questions and (2) the items should exhaustively cover the aspects of the target construct to be measured. As such, a thorough review of literature was conducted to conceptualize and define the construct that the questionnaire intended to measure, while synthesizing items from different studies to ensure the measurement comprehensiveness of the questionnaire. The pool of items brought together two strands of research, namely, studies on listening difficulties in EMI university contexts (e.g., Evans & Morrison, Reference Evans and Morrison2016; Hellekjær, Reference Hellekjær2010) and studies on second language (L2) listening difficulties (e.g., Chen, Reference Chen2013; Goh, Reference Goh2000). The compiled questionnaire thus included items both exploring general macro-level EMI listening challenges (e.g., “I find it difficult to follow teachers’ line of thought”) and the micro-level issues aligned with L2 listening cognitive stages as outlined in Anderson’s (Reference Anderson2015) listening model (e.g., “Sometimes I do not recognize words I know while listening to EMI lectures”). The goal of combining items from the two strands of research was to comprehensively capture students’ perceptions of EMI lecture listening problems to improve the content validity of the questionnaire.

Deciding the scale of data. A typical scale for investigating the degree and intensity of perception is the Likert scale (Cohen et al., Reference Cohen, Manion and Morrison2017). The questionnaire adopted a 6-point Likert scale, ranging from “strongly agree” (6) to “strongly disagree” (1). A decision was made to exclude a middle point “neither agree nor disagree” because, as some researchers have warned (e.g., Krosnick & Presser, Reference Krosnick, Presser, Marsden and Wright2010), such a middle point may not genuinely reflect what the participants perceive but might be chosen for reasons such as self-protection, ambivalence in understanding, or simply little care. In China, given the cultural influence of Confucianism, students may opt for the middle point as a “safe choice” for self-protection. Hence, the middle point was removed from the scale to promote students’ conscious decision in choosing a more appropriate descriptor for their perceptions.

Questionnaire formatting and piloting. After developing the items and deciding the rating scales, the questionnaire was drafted according to Cohen et al.’s (Reference Cohen, Manion and Morrison2017) guidelines to adjust the length, formatting, and layout. To improve the face validity of the questionnaires, four experts in applied linguists specializing in EMI and L2 listening were invited to check the questionnaire items and offer suggestions for revision. After that, the questionnaire was piloted with 122 first-year students from the same EMI university, among which five were invited to attend a follow-up interview. During the interviews, students were asked to provide feedback on three aspects: (1) the length and layout of the questionnaire; (2) the wording clarity of questionnaire items and their interpretation of the items; and (3) the sequence of items on the questionnaire. The questionnaire was then revised based on the feedback of the pilot and was ready for distribution for the main data collection.

Checking for construct validity and reliability. After the pilot, the questionnaire was checked for its construct validity using PCA with a sample of 412 students at T1. PCA was chosen because the goal of the analysis was to reduce the number of items so that a maximum amount of variance could be explained by fewer items (Field, Reference Field2013). Prior to the analysis, results from the Bartlett’s test of sphericity (p < 0.001) were checked to ensure that the questionnaire items were correlated to a certain degree so that they could be summarized with a few factors. Kaiser–Meyer–Olkin test results fell between 0.8 and 0.9, which, according to Hutcheson and Sofroniou (Reference Hutcheson and Sofroniou1999), suggests that the sample was satisfactorily adequate for conducting factor analysis. In addition, the communalities table was checked to ensure that no item had a communality value less than 0.40, indicating that a sufficiently high proportion of the variance of each item in the questionnaire was explained by the extracted factor(s). When running the analysis, main statistical issues considered included the extraction of factors and the rotation method. For factor extraction, the scree plot and the eigenvalues of factors served as main indicators for the number of factors to be retained. Kaiser (Reference Kaiser1960) suggested extracting factors with an eigenvalue above one. On top of that, Cattell (Reference Cattell1966) argued that the number of points above the inflexion point of the curve on the scree plot suggests how many factors should be extracted. Based on these two criteria, one factor was extracted that accounted for 54.2 percent of the total variances. Subsequently, an orthogonal rotation (varimax) method was chosen to increase factor loadings for a small number of items so that important items for the factor could be identified (Field, Reference Field2013). The final questionnaire included twelve items, and Cronbach’s alpha was calculated as 0.92, indicating that the questionnaire was highly reliable.

Data Collection and Analysis

The finalized questionnaire was distributed in Week 2 (T1), Week 8 (T2), and Week 13 (T3) through the online survey platform “Qualtrics.” Students were asked to fill in the questionnaires immediately after EMI lectures to enhance the accuracy and ease for recalling the listening difficulties they experienced (Cohen & Macaro, Reference Cohen and Macaro2007). After collecting data at T3, the datasets from all three time points were imported into SPSS 25.0 directly from “Qualtrics” to avoid inaccurate data entry. The three datasets were then merged into a longitudinal set of data inclusive of 316 students who completed the survey at all three time points. The repeated responses of each student were matched via a unique ID number. Before data analysis, all personally identifiable information was removed and the dataset was entirely anonymized to strictly observe the ethical regulations of the university.

When the longitudinal data were ready for analysis, statistical tests were selected to answer each research question. For Research Question 1, descriptive statistics such as the mean, standard deviation, and minimum and maximum values were calculated to outline students’ perceptions of lecture listening difficulties at the outset of the semester. Research Question 2 explores the longitudinal pattern of change in students’ listening difficulties. On top of that, Research Question 3 further investigates whether this pattern of change varied according to students’ English listening proficiency level upon entry. A two-level multilevel modeling analysis was conducted, holding student as the grouping variable. Prior to the analysis, the assumptions for multilevel modeling were checked including linearity, normality of random effects (normality of residuals at level 1), normality of the full model (normality of residuals at level 2), and homoscedasticity. In Model 1, variance in the listening difficulties (outcome variable) was tested with time as the level 1 predictor variable. In Model 2, variation in the observed rate/slope of change was tested by using individuals’ English listening proficiency as the level 2 predictor variable.

In a review of L2 longitudinal quantitative research, Barkaoui (Reference Barkaoui2014) summarizes a variety of statistical tests to capture change over time, such as repeated measures univariate and multivariate analysis of variance, multilevel modeling, and latent growth curve modeling. In this study, a multilevel modeling analysis was chosen because, first, the study aims to examine average change patterns across different time points instead of intra-individual variations, hence excluding statistical tests such as latent growth curve modeling (Collins, Reference Collins2006). Second, where a between-subjects factor is incorporated in the model, analysis methods such as mixed ANOVA require the factor to be categorical instead of continuous data such as the listening proficiency score in this study. In fact, Barkaoui (Reference Barkaoui2014) has pointed out that when observations gathered on an individual over time are nested within the person, multilevel modeling can be a useful approach to pursue. Hence, the use of multilevel modeling allows for illustrating an overall trend of change in listening difficulties to identify any significant main effect of time. Meanwhile, discrepancies in the change patterns among students entering with different listening proficiency scores could also be revealed by examining the interaction effect between time and proficiency.

Conclusion: Methodological Implications and Suggestions

The chapter introduces a longitudinal quantitative study design that investigates change in students’ perceptions of lecture listening difficulties during the critical transition into EMI higher education. It outlines the key procedures for designing and implementing the study, including decisions about research context and sampling, the development of measures, and the selection of appropriate statistical tests for data analysis. Approaches for reducing attrition and handling missing data in longitudinal quantitative research were described, and the strategies for mitigating these problems were also shared. The longitudinal quantitative design of the study could be applied to future EMI research that aims to (1) explore the longitudinal change in students’ learning analytics such as perceptional, behavioral, and motivational changes and (2) examine how the shape of change over time is associated with individual difference variables.

Some methodological limitations of the study should be noted, which could inform the design of future longitudinal quantitative research in EMI contexts. First, due to the outbreak of COVID-19 at the end of 2019, the study only collected data from the first semester via three time points. Future research can extend the time frame and collect data from more time points to capture the change over a longer duration such as an academic year. Second, the study only sampled from disciplines of social sciences and humanities (commonly regarded as “soft” subjects), which limits the capacity of the generalization of the findings. It would be interesting for future research to also sample from “hard” subjects (e.g., physics, engineering, statistics) that depend much more on mathematical formulaic expressions or graphical illustrations, because students’ perceptions of lecture listening difficulties could vary substantially from those reported in this study. Finally, in longitudinal designs where the same measures are repetitively administered, the “practice effect” should be cautioned against (Barkaoui, Reference Barkaoui2014). In other words, the observed changes might be initiated by increased familiarity with the measurement instead of authentic changes in the constructs being measured. Future research with a longer duration can consider lengthening the time interval between different time points of data collection to reduce participants’ perceived familiarity with the research instrument to lower chances of the “practice effect.”

14 A Corpus-Based Multidimensional Analysis of EMI University Classroom Discourse The Case of Singapore

Introduction

Globally, the use of English as the medium of instruction in various educational contexts has grown exponentially in the last decade. The majority of studies in the field concentrate on issues relating to the implementation of English Medium Instruction (EMI) programs, teachers’ and students’ beliefs, challenges, and first language (L1) use (Curle et al., Reference Curle, Jablonkai, Mittelmeier, Sahan, Veitch and Galloway2020; Ismailov et al., Reference Ismailov, Chiu, Dearden, Yamamoto and Djalilova2021; Jablonkai & Hou, Reference Jablonkai and Hou2021; Macaro et al., Reference Macaro, Curle, Pun, An and Dearden2018). However, there has been very little research into the actual language used in EMI classrooms to date. The few existing studies in this area have highlighted a number of relevant language-related issues, for example, identifying language and discourse functions in EMI classroom discourse, determining learning outcomes in terms of discipline-specific communicative competence, and investigating context and discipline-specific language use (Dalton-Puffer & Smit, Reference Dalton-Puffer and Smit2013; Macaro, Reference Macaro2019). The analysis of variation in EMI language use can reveal contextual and disciplinary differences in information and discourse organization that might be the result of a pedagogical approach or academic culture. Such refined understanding of linguistic features and variation can then be translated into pedagogical practices or can inform the support systems of EMI contexts. In her most recent work, Jablonkai (Reference Jablonkai, Pun and Curle2021) argues that a corpus-based approach is well placed to focus on such aspects of language use in EMI. This chapter demonstrates how a specific corpus-based analytical framework, Biber’s (Reference Biber1988) multidimensional (MD, henceforth) analysis, can be applied to explore variation in language use in an EMI context.

Singapore as an EMI Environment

English is the medium of instruction at the National University of Singapore, and it is one of the nation’s four official languages (English, Mandarin, Malay, and Tamil). The promotion of English as first the working language and later as a first language was “to ensure that Singapore would be plugged into the global economy” (Iswaran, Reference Iswaran2009, as cited in Hanington & Renandya, Reference Hanington, Renandya, Park and Spolsky2017). English as an official language entails that it is one among the three others that can be used in all official situations and especially in government-related contexts and service fronts. The choice of language is decided upon by the user(s) in context. To adopt English as a first language reflects the choice that parents may make within the family if they are effectively bilingual in English and one other language, typically one of the official languages mentioned earlier. At some point, English was promoted within the family and was also chosen as a medium of instruction as it was the lingua franca that could channel the Singaporean workforce into the global economic stage with its Western-centric technologies.

At the National University of Singapore, English language proficiency levels represented across the local and international undergraduate population (including students from China, India, Indonesia, and the European countries whose first language is not English) range from IELTS 6 to 9. In a general sense, local students tend to represent the 7–9 strand while international students are more likely to be within the 6–7 strand. This score range is roughly equivalent to a B2 and above in the Common European Framework of Reference for Languages. This indicates that proficiency levels correlate with an effective command of the language despite some inaccuracies, inappropriate usage, and misunderstandings accompanied by the ability to use and understand a fairly complex language, particularly in familiar situations to the students, where they have full operational command of the language (https://takeielts.britishcouncil.org/teach-ielts/test-information/ielts-scores-explained). By the end of a freshman’s second semester at the university, most students would have read an academic communication module (forty-eight-hour module) that prepares them with basic academic reading and writing competencies.

Another important aspect specific to an EMI environment, and in Singapore as well, is teachers’ proficiency levels in English. Due to space limitations, no further discussion about that is given here, yet it is important to mention that there is a range of local and foreign talent in the teacher group represented in the Singapore EMI corpus (SEMIC) used for this study.

The Conceptual Framework to Investigate Variation in Language Use

Sociolinguistic studies on the relationship between language use and social context (Hymes, Reference Hymes and Giglioli1972) inspired scholars interested in language variation and register studies (e.g., see Biber and Finegan, Reference Biber and Finegan1994). The motivation for such studies is rooted in the theory that language forms have communicative functions and, therefore, are used differently in diverse social contexts depending on the communicative goals, hence fulfilling particular functions in that context (Biber, Reference Biber1988).

While many studies have described these differences using qualitative measures, Biber (Reference Biber1988) developed new methodologies affording large-scale investigations of patterns of language use with quantitative, corpus linguistic, approaches resulting from sizable and principled collections of naturally occurring texts (a corpus) (Csomay & Crawford, Reference Csomay and Crawford2024). He built a representative sample of language use from a variety of spoken and written contexts, where context equates a “speech situation” (Biber, Reference Biber1988, p. 28) whether in the spoken or written mode. The first phase of Biber’s analytical procedures involves a detailed situational analysis, and the second phase includes linguistic analyses based on quantitative linguistic measures, followed by a detailed qualitative analysis interpreting the relationships between context and language use as the broadly defined dimensions of linguistic variation are determined.

Applying this analytical framework, studies reflect three types of register (functional) analyses. The first type includes a number of speech situations from different, broadly defined, registers (e.g., news, academic prose, fiction, conversation) and describes language variation across these registers. For example, the language of university classrooms is compared to that of face-to-face-conversation and academic prose (Biber et al., Reference Biber, Conrad, Reppen, Byrd and Helt2002; Csomay, Reference Csomay2006). The second type examines variation within only one register. For example, Biber et al. (Reference Biber, Conrad, Reppen, Byrd, Helt, Clark, Cortes, Csomay and Urzua2004) and Biber (Reference Biber2006) examine the university as a stand-alone register, where a variety of speech situations within that one context was collected, such as textbooks (including many disciplines), dialogues at service encounters, and university classrooms (including texts from a variety of disciplines, levels of instruction, and class sizes). Finally, the third type, instead of including a variety of speech situations in a particular context, looks at language variation within one single speech situation in a given context, such as university classrooms within the academic context. Studies in this area describe the linguistic characteristics of that one speech situation in itself and also look for relationships between language variation and discourse structure and how different disciplines structure their discourse differently (Csomay, Reference Csomay2005). In each type of study, the underlying analytical framework is the same, a factorial, Multidimensional (MD) analysis, and the goal is to describe variation based on the co-occurring patterns of a large number of linguistic features. This chapter exemplifies the third type of studies where we look at university classrooms alone and describe language variation through their linguistic characteristics. As such, this type of MD analysis is well suited to identify features of language use in EMI classroom discourse in specific academic contexts.

MD Analysis

Conceptually, the primary goal of an MD analysis within register studies is to describe linguistic variation in a register based on a large set of co-occurring linguistic features using multivariate factor analytical techniques. There are two types of MD analyses: (1) a novel MD analysis, which involves running a new factor analysis to help identify dimensions of variation that had not been identified before (see Csomay, Reference Csomay2021, for university classroom discourse), and (2) an additive MD analysis (Berber Sardinha et al., Reference Berber Sardinha, Veirano Pinto, Mayer, Zuppardi, Kauffmann, Berber 258Sardinha and Veirano Pinto2019), whereby the dimensions of variation had already been identified prior to the undergoing study. Typically, additive MDs are applied to see how the new context (i.e., the one we want to describe) adds to the understanding of the linguistic makeup of that register as a whole. For example, Csomay and Wu (Reference Csomay and Wu2020) carried out an additive MD to look for variation in the discourse structure of university classroom discourse in two geographically different contexts. In this chapter, we focus on how an additive MD is performed by way of examining SEMIC. For the steps to carry out a new factorial analysis, please see Chapter 6.

Existing Dimensions of Linguistic Variation in University Class Sessions

Applying an MD analytical framework to North American university classroom discourse, Csomay (Reference Csomay2021, Reference Csomay2022) identified four dimensions of linguistic variation in classroom conduct: (1) context-oriented versus topic-focused instruction; (2) scaffolded narrative type of instruction with personalized framing versus abstract information; (3) expository reporting versus concrete (nonabstract) information; and (4) stance. Each factor, interpreted as a dimension, has a set of linguistic features that systematically co-occur (Appendix A).

The text sample in Extract 1 shows a classroom segment where some of the co-occurring linguistic features on the positive side of Dimension 1 (underlined in Extract 1), context-oriented instruction, are present (e.g., present tense, demonstrative pronouns, second person pronouns, and contractions; see the full list in Appendix A). These features are typically found in class sessions where the instructor makes references to the immediate physical context, evokes a hypothetical situation, or explains the material with the help of visuals present in the classroom.

Extract 1 (Dimension 1 positive features):

Teacher: Kind of. alright. for example, this vector, this vector is one. here the maximum element for this vector is one, no problem, so the infinity norm is one. How about this. this one. why is it the infinity norm for this vector is one. because this one is may be point 8 and 1. so if you take the maximum value of the maximum elements in the vector, yes, it’s one. so indeed, if you look at all these vectors, one of the elements is one and the second element would be less than one. alright, so that’s why you put all the vector that has infinity norm equals to one, it would be a square, it would be a square. now the matrix norm, matrix norm, umm you also have four different matrix norm, but their definition is quite different right. so you see, the one norm for matrix is defined as the maximum column sum, so you you sum up all the columns and you pick the one that has the maximum value. so maximum column sum. and for infinity norm, it’s the maximum row sum. ok, if you have four rows, you will sum, you will take the sum of each row and you compare and you see which one is the the maximum value and that one is the infinity norm. sometimes you can get confused with these two right. one norm and infinity norm, so this is how I remember it. I want to share it with you.

The co-occurring features on the negative side of Dimension 1 (in bold in Extract 2), including nouns, prepositions, and attributive adjectives, appear in classroom sessions where the instructor exposes the students to information, reports to the students about the topic of the class, or quotes a text. In Extract 2, the instructor reads out a sentence or two from a text at hand.

Extract 2 (Dimension 1 negative features):

Teacher: So, I’ll just read this part here. ok? “I propose an intermediate space through which we can understand the transnationality of national cinema. Somewhere between Morris’ yielding to complete transnationalism in her belief in the subjectivity of audiences in different spaces, thereby subscribing to a belief in a definite presence of cultural heritage in modern cinema, and Ashish’s more pessimistic negation of any cultural heritage in cinema due to dominance of transnational profit interests” alright? and then she raises umm the the the uhh the genre or the type of films that have been popularized by Stephen Chow, alright? uhh that that is bordering on some kind of mockery or satire in a in a particular Hong Kong sense, alright? and then she ends with this paragraph. “We are hence not confined to either the complete freedom of Morris’ focus on audience subjectivity or the restriction Ashish’s belief in a complete demise of a cinema of heritage. It is in the opening up itself to the audience subjectivity, through its unabashed integration of commercial interests that heritage is preserved in a form of nostalgia.” ok? so she’s arguing for an agency, right?

The text sample in Extract 3 illustrates how the instructor uses dialogic techniques to draw out the students’ interpretation of the concept that is part of the topic for that particular class. In these types of classes, typically, both instructors and students use linguistic features such as “you know,” “I mean,” that-deletion particularly with mental verbs such as think, emphatics like really, and progressive aspects (in italics in Extract 3). Instructors in these classes use dialogic pedagogical approaches to engage students and help them understand the material.

Extract 3 (Dimension 2 positive features):

Student: Yeah, I think Student 6 is saying that we now clearly want to evolve on like kind of self-identity because it’s you know like in some ways I do not think any of us really knows, who we are. It invokes a lot of self-reflection, and often like

Teacher: Ok uhh huh, but you have had twenty years to figure it out, right?

Student: Because it’s not just a finite process.

Teacher: No? it’s not a finite process?

Student: It’s it’s it’s I mean it’s not a finite process and you keep discovering it.

Teacher: What is not a finite process? answering the question or your sense of identity? is it your sense that you

Student: Both as well.

Teacher: Both.

Student: Both as well ya.

Teacher: So if you say that it’s not finite, so your thinking is that identity is not really, uhh like it’s not a finished product for you?

Student: It’s it’s not immutable ya.

Teacher: Huh, is this a common understanding? like you feel like your identity? how how how not how immutable is it? like do you feel you are a different person every day? every week? every minute?

Student: I think it depends on what I’m doing.

Teacher: It depends on what you are doing?

Student: Ya.

Teacher: Ah, explain.

Cluster Analysis

Conceptually, cluster analyses are exploratory statistical techniques that classify variables into groups based on some measure(s). There are various types of cluster analyses, for example, componential analysis and K-means clustering techniques, each having a different goal and method of classification. For this study, we used K-means clustering techniques on SEMIC. Applied to linguistic data, K-means classifies texts with similar linguistic characteristics into groups that are often called “text-types” (Biber, Reference Biber1995), where the basis for classifying the texts is the dimension scores identified through the MD analysis and, in our example, with four dimensions as described earlier. The classification, or the grouping, is governed by distance measure calculations that the statistical program performs. With K-means clustering, the number of groups are predetermined by the researcher but the number of clusters could be changed until they are interpretable (hence, exploratory statistics). For example, with four dimension scores given to each text, if we first run a three-cluster solution but we find no interpretable results (e.g., none of the groups would have high average scores on at least one or two of the dimensions), we can run a new cluster analysis with two- or four-cluster solutions with no penalty. Unlike other types of tests, such as the t-test, where you cannot repeat runs, with K-means clustering we can rerun solutions as often as we want and until we find interpretable results. For classroom discourse in the North American context, five clusters were identified based on the four dimensions already described (Csomay & Crawford, Reference Csomay and Crawford2022).

Methodology

The steps of an additive MD analysis and cluster analysis are summarized here. In what follows, each step of the analysis is discussed in detail.

1. Build a corpus.
1. a. Record class sessions.
2. b. Transcribe recorded sessions based on preestablished transcribing conventions and save them as text files.
2. Provide a situational analysis of the context.
3. Prepare files for linguistic analysis.
1. a. Run a grammatical tagger on all texts (in this case, the Biber tagger).
2. b. Correct tags for a few grammatical features using the Fixtagger interactive program.
4. Calculate dimension scores.
1. a. Calculate a normed count and a Z-score for each linguistic feature on each of the existing dimensions of variation previously identified (in this case, Csomay, Reference Csomay2021).
2. b. Calculate dimension scores for each class session.
3. c. Describe texts grouped around extratextual features (e.g., discipline, level of instruction).
5. Identify clusters.
1. a. Run K-means clustering program on a software called Statistical Package for the Social Sciences (SPSS).
2. b. Name the clusters (text types; in this case, lecture types).
6. Report on patterns.
1. a. Describe the texts in your corpus based on the clusters and look for patterns based on aspects of the situation (e.g., discipline, level of instruction).

Build a Corpus

In our EMI context (Singapore) a corpus of thirty-one class sessions of about 500,000 words with a total of eight disciplinary areas was compiled at the National University of Singapore. The corpus contains transcriptions of entire class sessions (see the section “Prepare Files for Linguistic Analyses”) of both monologic lectures and seminar-style interactive teaching. Table 14.1 shows the number of words and the number of text files in each discipline.

Table 14.1 Distribution of files in the disciplinary areas in SEMIC

	# of words	# of files
Humanities	110,170	8
Social sciences	62,770	3
Natural sciences	59,110	5
Engineering	122,990	8
Interdisciplinary studies	25,550	2
Computer science	18,980	2
Law	23,200	1
Business	43,920	2
Total	466,690	31

Provide a Situational Analysis of the Context

The first step as we carry out an MD analysis (whether novel or additive) is to describe the context in which the language phenomena of our interest occurs. Biber and Conrad’s (Reference Biber and Conrad2009, pp. 40–47) framework for situational analysis of registers was adopted and modified to describe university classrooms as the speech situation. Following our analysis is a summary of the characteristics in Table 14.2.

Table 14.2 Summary of situational analysis of university lectures

Contextual aspects for situational analysis		University classrooms
Participants	Speaker (addressor), listener (addressee)	Single person (instructor – speaker) lecturing to a group of people (students – listeners); two or more people talking to each other (teacher–student or student–student)
Relations among participants	Knowledge of each other, social status, and social relations among participants	Teacher (with an established career) has higher level of social status than students; in student–student interaction social status is equal
Channel	Mode, medium of communication	Spoken mode
Production circumstances	With or without time for revision	May be planned prior to delivery with available time for revision in the immediate context
Setting	Delivered in shared or unshared physical space and at a private or public place	Shared physical and mental space between speakers and listeners at a public place
Communicative purposes	E.g., expose knowledge, inform, persuade, interact	To inform and to disseminate new information; to facilitate learning new information or recycling known information
Pedagogical purposes	E.g., explain new concepts, solicit response, recycle old information
Topic		Varies greatly by discipline, by individual lectures, and within each lecture
EMI context	ELF status, level of instruction, discipline	ELF status of participants could vary for both participants (instructor and students); typically three levels of instruction (lower- and upper-division undergraduate, graduate); disciplines and subdisciplines vary

Situational Analysis of University Classrooms in the Singapore EMI Context

The primary participants in all university class sessions,Footnote ¹ including EMI Singapore, are the instructor, typically a single person (speaker, addressor) lecturing, and the students (addressees), a variable-sized group of people listening. This is the “default” setup between the participants in this setting; however, either participant is permitted to take up the role of a speaker or a listener, affording other configurations as well. For example, when instructors, as the dominant speaker with control in this setting, allow students to ask questions, the speaker–listener roles change. Alternatively, when it is pair or group work, students take up both the speaker’s and the listener’s roles.

As for participant relations, the instructor always has more knowledge than the students. As accomplished professionals versus novice apprentices, instructors also have a higher level of social status than students. However, in student–student interactions, the level of topic knowledge and social status between the addressor and addressee approximate one another. Knowledge of each other could change through a period of time as the participants may not know each other at first, but through the course of the semester and beyond, they can be come acquainted with each other.

In our corpus of EMI Singapore, all class sessions were given in the spoken mode. They have been (audio) recorded, which then was transcribed for the linguistic analysis.

Instructors typically plan their class sessions prior to delivery in the form of an outline, for example, PowerPoint, or full texts but the actual language produced in an academic lecture is primarily spontaneous. Previous research has shown that monologic lectures are linguistically different from dialogic, interactive class sessions (Csomay, Reference Csomay, Reppen, Fitzmaurice and Biber2002), with potentially more spontaneity in language production for the latter as in face-to-face conversation. SEMIC contains both monologic and interactive class sessions.

University class sessions are in the public domain. The speech event takes place in the present, where participants share the same physical space. The corpus used in this study does not contain asynchronous or synchronous online classes as they were recorded during pre-COVID-19 times.

The central communicative purpose of university class sessions is to inform and to propagate new information to the listener. Each class session then can have its own pedagogical purposes, which could include, but would not be limited to “explaining new concepts, soliciting known information, facilitating the incorporation of new information with old, summarizing recently presented information, or recycling known information” (Csomay & Wu, Reference Csomay and Wu2020, p. 137). In fact, more than one pedagogical purpose can be present even within a single class session (Alsop and Nesi, Reference Alsop, Nesi, Calzolari, Choukri, Declerck, Loftsson, Maegaard, Mariani, Moreno, Odijk and Piperidis2014; Csomay, Reference Csomay2005; Csomay and Wu, Reference Csomay and Wu2020), as, for example, the instructor first explains a new concept, provides examples, and then asks the students to do an activity applying the newly learnt concept.

Finally, the topic of a lecture could vary significantly by discipline, lecture by lecture within one discipline, or even within a single class session in a semester. This situational variable is difficult to describe in general terms.

Additional situational characteristics of the EMI context for this particular speech situation include the participants’ English as a Lingua Franca (ELF) status (e.g., language proficiency level), level of instruction (lower- or upper-division undergraduate, graduate), or discipline and associated subdisciplines.

Table 14.2 summarizes the most important aspects of the situational analysis for this context (adapted and modified from Csomay & Wu, Reference Csomay and Wu2020) as described earlier.

Prepare Files for Linguistic Analyses

The first step in preparing the recordings for the linguistic analyses is to transcribe the audio files. For SEMIC, we followed the same recording protocol and the same transcription conventions as for the corpus based on which the MD was carried out for previous research (Biber et al., Reference Biber, Conrad, Reppen, Byrd, Helt, Clark, Cortes, Csomay and Urzua2004). Accordingly, the recording starts at the beginning time of the class session and each file contains additional information about the setting (e.g., class size, discipline); each speaker change is marked by a new line; reduced forms (e.g., contractions) and nonlinguistic vocalizations (e.g., um) are included in the transcriptions (see Appendix B for details).

The second step is to add grammatical tags to the texts. Since a corpus typically contains at least several hundred thousand words (or millions), the texts are marked up with a computerized automatic grammatical tagger (e.g., CLAWS). For SEMIC, we tagged the texts using the Biber tagger, which was developed based on the descriptions outlined in Biber (Reference Biber1988, pp. 214–221). This tagger was used for various reasons: (1) it is the most widely used part-of-speech (POS) tagger in MD analysis (Gray, Reference Gray, Sardinha and Pinto2019), (2) this tagger was reported to reach a high degree of accuracy of over 90 percent for most linguistic features (Biber & Gray, Reference Biber and Gray2013), and (3) the researchers worked with the tagger, had used it for over twenty years in their research, and had been intimately familiar with how it works. The following example illustrates what a text looks like after running it through the computer program.

Teacher: I want you to have two examples for the class. Um. You can do this.

Teacher: ^spkr+clp + 1++

I ^pp1a + pp1+++

want ^vb++++

you ^pp2 + pp2+++

to ^to++++

have ^vb + hv + vrb++

two ^cd++++

examples ^nns++++

for ^in++++

the ^ati++++

class ^nn++++

um ^uh++++

you ^pp2 + pp2+++

can ^md + pos+++

do ^vb + do+vrb++

this ^dt + pdem+++

Since some orthographic words have more than one function (e.g., participles) they are difficult to disambiguate automatically and subsequent manual editing is essential. One example to illustrate the automatic tagger’s difficulty is the word that since it can function as a demonstrative determiner as in that boy, a pronoun as in look at that, a verb complement as in I think that she was not happy, or a relative pronoun as in the car that he bought. Given the fact that the grammatical functions of that cannot always be identified automatically with high accuracy, these and some other problem tags (e.g., apostrophe s) need to be examined individually and corrected manually. For SEMIC, we used an interactive computer program called “Fixtag.” During the fixtagging process, the preselected problematic grammatical tags (like the ones mentioned here) are brought up on the screen, and then relying on a contextual interpretation of the text, the accurate tag is determined by a human and changed if necessary. These are critical issues when preparing the corpus for analysis as inaccurate tagging might lead to unreliable results.

Calculate Dimension Scores for an Additive MD

The steps to do an additive MD followed by a cluster analysis require the use of Excel and statistical software such as SPSS, Statistical Analysis System (SAS), or R. To calculate the dimension scores, the following steps are taken:

1. Calculate normed counts (to 1,000 words) and Z-scores for each linguistic feature on each of the existing dimensions of variation previously identified (in this case, Csomay, Reference Csomay2021).
2. Calculate dimension scores for each text (in this case, class session).

Normed Counts and Z-Scores

Norming is important for two reasons. First, we are able to compare results across texts. If we have texts of varying lengths in a dataset, a reliable comparison of raw frequency counts is difficult, simply because longer texts afford more occurrences of any feature, resulting in more raw frequency counts. Through norming, we scale the raw frequency scores to an equal text length (in this case to 1,000 words), which allows for intertextual comparisons of the counts. Second, through norming, we are translating nominal frequency scores into interval scores, allowing for them to be continuous data, to which parametric statistical tests can be applied.

To calculate normed counts, take the raw count of a particular linguistic feature (e.g., frequency of nouns) in a text (e.g., 35), divide it by the number of words in the given text file (e.g., 200 words), and multiply it by the number you would like to scale it to (e.g., 1,000).

Normed count = \frac{Raw feature count}{Text length (word count)} * Scaled text length (word count)

In our example, (35/200)*1,000. The normed score for this example is 175, which simply means that if the text length were 1,000 words, with the given rate, we would see 175 nouns in that text file. If we use 1,000 as the number of words to norm to (we can choose to norm to any length, but since earlier studies normed to that text length, for comparative reasons, it may be wise to keep that text length), and we norm each feature count this way in each text, we are able to compare the rates across texts.

Similar to normed counts, Z-scores are also scaled scores but this time to the entire corpus and not to a particular text length. With Z-scores, we standardize individual feature scores to tell us how many standard deviations away a score is from the group or overall mean. The reason this is important is because some features (e.g., relative clauses) are by nature not occurring very frequently. As Biber and Conrad (Reference Biber and Conrad2009) say, “standardized scores measure whether a feature is common or rare in a text relative to the overall average occurrence of that feature” (p. 229).

To calculate the Z-score, we take the normed count of a feature (e.g., the normed count of 175 for nouns) and deduct it from the mean score of that feature for the entire corpus (e.g., 200), and then divide it by the standard deviation of that feature for the entire corpus (e.g., 5.6), resulting in a Z-score of −4.46 in our example. The formula therefore is:

Z = \frac{x - \bar{x}}{sd}

Positive Z-scores indicate scores above the mean while negative ones below the mean. That is, a Z-score of 4.46 is 4.46 standard deviations above the mean, and, as in our example, a Z-score of −4.46 is 4.46 standard deviations below the mean.

Dimension Scores

Based on the Z-scores for each feature, dimension scores could be calculated for each text by deducting the sum of Z-scores of features on the negative side of the dimension from the sum of Z-scores of features on the positive side of the dimension. Taking three texts from SEMIC as an example (Text 1, Text 2, and Text 3), the normed frequency scores and the Z-scores for Dimension 2 for these three texts are listed in Table 14.3.

Table 14.3 Normed counts and Z-scores for Dimension 2 positive and negative features of a sample of texts from SEMIC

		Text 1		Text 2		Text 3
		Normed	Z-score	Normed	Z-score	Normed	Z-score
Positive side	Private verbs	21.30	0.88	18.50	0.23	16.20	−0.31
	Mental verbs	38.30	1.85	26.50	−0.09	24.80	−0.37
	That-deletion	6.30	1.43	3.80	−0.02	4.10	0.16
	“You know”	2.30	0.01	2.60	0.15	2.20	−0.03
	That clause with factive verbs	3.10	−2.06	3.50	−2.14	−2.22	−2.22
	“I mean”	1.50	0.96	0.20	−0.58	0.60	−0.11
	That clause with likelihood verbs	3.40	1.31	2.20	0.30	1.40	−0.37
	Third person pronoun	12.70	−0.09	14.90	0.17	18.40	0.58
	Past tense	17.70	0.98	9.10	−0.40	27.80	2.59
	Aspectual verb	13.00	0.05	10.50	−0.57	12.80	0.00
	Emphatics	6.90	0.07	8.30	0.68	6.30	−0.20
	Progressive aspect	8.50	−0.25	7.10	−0.61	8.70	−0.19
	Hedges	2.30	0.02	0.30	−1.13	3.10	0.47
	Adverbial subordinator: causative	3.00	−0.36	4.30	0.79	2.30	−0.98
Sum			4.80		−3.22		−0.98
Negative side	Abstract noun	23.60	−0.36	20.70	−0.71	17.80	−1.06
Negative side	Quantity noun	8.00	−0.18	8.80	−0.09	10.90	0.81
Sum			−0.54		−0.80		−0.25
Dimension 2 score		5.34		−2.42		−0.73

As Table 14.3 shows, for Text 1, the sum of Z-scores for features on the positive side of Dimension 2 is 4.80 and on the negative side is −0.54. Subtracting one from the other (4.80 – (−)0.54 = 4.80 + 0.54) results in 5.34. For Text 2, the sum of Z-scores for features on the positive side of Dimension 2 is −3.22, and the sum on the negative side of Dimension 2 is −0.80, hence the dimension score for that text is −2.42 (−3.22 – (−)0.80 = −3.22 + 0.80). Finally, for Text 3, the Z-score sum of features on the positive side is −0.98 and on the negative side it is −0.25, hence the dimension score for that text is −0.73 (−0.98 – (−)0.25 = −0.98 + 0.25).

Table 14.4 illustrates scores for each text and for each dimension. We only calculated the dimension scores for three texts on one dimension by hand from the results in Table 14.3 to illustrate how the calculation is done but the same procedure is to be followed for all texts in a corpus.

Table 14.4 Dimension scores of a sample of texts in SEMIC

	Dimension 1	Dimension 2	Dimension 3	Dimension 4
Text 1	24.04	5.34	8.61	−0.36
Text 2	−4.47	−2.42	−3.30	1.42
Text 3	−13.90	−0.73	4.49	−0.43

Identify Clusters

Once the dimension scores for each text are calculated, to determine the empirically based text types, a cluster analysis can be applied with the following steps:

1. Enter the dimension scores for each text into K-means clustering using SPSS, SAS, or R.
2. Determine the number of clusters depending on what allows interpretability.
3. Name clusters based on their dominance on particular dimensions.

As mentioned, since cluster analysis is an exploratory statistic, it is possible to run multiple solutions to arrive at the best possible result that can be interpreted. For the Singapore data we ran a three-, a four-, and a five-cluster solution using SPSS and decided that a five-cluster solution yields to the best representation of classroom conduct. Table 14.5 shows the three- and four-cluster solutions, and Table 14.6 shows the final, five-cluster, solution. The prominent dimensions for each cluster are highlighted.

Table 14.5 A three- and four-cluster solution for the Singapore data based on the four dimension scores

	Three-cluster solution			Four-cluster solution
	1	2	3	1	2	3	4
dim_1	20.20	−1.73	−12.80	9.63	−9.67	−9.47	21.67
dim_2	−0.54	−5.96	3.07	−6.26	2.25	−12.20	5.64
dim_3	1.02	−3.82	5.61	−3.48	3.59	−8.21	6.50
dim_4	−0.43	−0.42	1.06	−0.81	0.97	−0.27	−0.96

Table 14.6 A five-cluster solution for the Singapore data based on the four dimension scores

	Five-cluster solution
	1	2	3	4	5
dim_1	−2.36	17.79	−15.63	−.69	22.01
dim_2	−7.46	8.15	1.77	3.06	−7.06
dim_3	−5.88	7.08	4.62	6.28	−3.52
dim_4	−0.93	−0.82	1.05	1.39	−0.14

The clusters are named after the most dominant dimensions and the dominant linguistic features in them. In this case,

Cluster 1: Informational type of instruction without stance.
Cluster 2: Context-oriented expository type of instruction with scaffolded narrative and with personalized framing.
Cluster 3: Topic-focused instruction.
Cluster 4: Scaffolded exposition with stance.
Cluster 5: Context-oriented informational (abstract and concrete) type of instruction.

After deciding which cluster solution provides the best results, each text gets a cluster number illustrating which cluster they belong to and, by that, also telling us what kind of language is displayed in a particular lecture. Table 14.7 shows the listing for the first five texts in SEMIC.

Table 14.7 Clusters for a sample of texts in SEMIC

Text	Cluster #	Descriptor	Discipline
Text 1	2	Context-oriented expository type of instruction with scaffolded narrative and with personalized framing	Humanities
Text 2	1	Informational type of instruction without stance	Natural sciences
Text 3	3	Topic-focused instruction	Humanities
Text 4	5	Context-oriented informational (abstract and concrete) type of instruction	Engineering
Text 5	1	Informational type of instruction without stance	Engineering

Report Patterns

Further analysis could show some patterns of variation across disciplines, for example. Table 14.8 shows how many different cluster types each disciplinary area has.

Table 14.8 Number and types of clusters in different disciplines

Discipline/Cluster	1	2	3	4	5	Total
Humanities	0	3	3	2	0	8
Natural sciences	5	0	0	0	0	5
Engineering	5	0	0	1	4	10
Interdisciplinary	1	0	0	1	0	2
Social sciences	0	0	2	1	0	3
Business	1	0	1	0	0	2
Law	0	0	1	0	0	1
Total	12	3	7	5	4	31

As Table 14.8 shows, some patterns of variation in language use could be easily detected across disciplines. While natural sciences classes exhibit exclusively Cluster 1, informational type of instruction without stance, engineering classes are split between Cluster 1, and Cluster 5, context-oriented informational (abstract and concrete) type of instruction. Finally, humanities classes seem to exhibit a varying pattern across Cluster 2, context-oriented expository type of instruction with scaffolded narrative and with personalized framing, Cluster 3, topic-focused instruction, and Cluster 4, scaffolded exposition with stance. Two text samples illustrating two cluster types (Cluster 2 and Cluster 1, respectively) follow. In Extract 4, a segment of a Cluster 2 class, the instructor is soliciting responses from the students as the particular topic of the class is being discussed. In this scenario, the instructors are applying a Socratic type of method to scaffold the concepts in the material, and the students, fully engaged, contribute to the dialogue.

Extract 4 – Cluster 2 (Humanities – English):

S: Uhh ours was the Parsons.

T: Parsons? uhuh, ya go ahead.

S: So he said that culture is, hmm, I’m sorry he said that culture is acquired through social interactions and it comprises of your behavior and the outcome of the behavior and is nurtured rather than a product of nature, so…,

T: Yes, so what does that mean when we say nurtured, you know a lot of you are science majors right? what does it mean when you say it’s nurtured rather than a product of nature. what does that actually mean?

S: Umm like nature is genetic.

T: Right. whereas nurture is like, is like your environment. it influences. right. so in other words the cane is not part of nature it’s something that is nurtured right? what do all these definitions have in common? you look at all the definitions, they have something very clear in common. and you you you you heard at least one definition other than the one that you read. what do they have in common? student 7?

S: They just have different variations of the same thing which is actually,

T: Yes. what’s the same thing? what’s in common?

S: Uhh, hmm, is the product of the mind.

T: Product of?

S: It’s the product of the mind.

T: Of the mind. just the mind? I mean is it, how does it get out of the mind? the mind is in here. how does it take shape outside the mind?

S: It governs our behavior.

T: In terms of behavior right? yeah …

Extract 5 illustrates a Cluster 1 type of class. The instructor provides a historical view, focusing on the information to be disseminated to the students.

Extract 5 – Cluster 1 (Engineering – Electric and Computer Science):

Teacher: But that’s handled by computers now and I read a very interesting article a few months ago where a company was advertising that their fiber optic connection between London and New York City ran two milliseconds faster than their competitors. so that, you know, evidently, two milliseconds makes a big difference in the financial world, because if somebody has the information two milliseconds faster than somebody else, you know then, then the arbitrage can already be done and somebody can earn a profit out of that. Now, in the 1860s on the other hand, uhh you know they generated a lot of waste paper from all of these ticker machines that they had running all over the place. uhh they also had lots and lots of telegraph wires strung all over the place. I saw a photograph once of what New York City looked like in the 1880s. and it was a big mess of telegraph wires, just strung you know up on poles all over the city. it looks a lot different now. but you know what they did with all that waste paper was that they would throw it out the windows during ticker tape parades. cause it looks kind of pretty so here’s a ticker tape parade honoring the Apollo astronauts, Apollo 11 astronauts, after they got back from the moon. alright, so, uhh it’s eleven o’clock right now. I would like to take a five-minute break.

The text could be marked up for all the linguistic features that were associated with the relevant dimensions and that mark the two different clusters.

Conclusion

In this chapter, the procedures for an additive MD analysis were described as well as the methodological issues to consider and the decisions to be made for a corresponding cluster analysis. The steps were illustrated on SEMIC, which represents an EMI context. The clusters that were associated with instructional styles and the ways in which disciplines may vary in exhibiting language use, and styles, were briefly examined.

Further patterns in the EMI context could be described depending on other aspects of the context. For example, varying levels of instruction may prompt variation in language use, or classes taught in different geographical areas may exhibit language associated with different teaching styles within one discipline (Csomay & Wu, Reference Csomay and Wu2020).

Further corpus-based MD research in EMI contexts could highlight how communicative purposes and associated language functions in different EMI settings may vary depending on, for example, English proficiency, discipline, or academic culture. Earlier research in EMI concentrated primarily on institutional policy issues and pedagogical and language-related challenges for both students and instructors, but very few studies investigated the actual language use in EMI classrooms applying a systematic linguistic analysis of the discourse in those classrooms. This type of analysis, however, is crucial to inform the decisions of EMI stakeholders, policymakers, and institutional support systems about the specific linguistic demands of EMI in particular academic contexts.

Book contents

Part II - Empirical Chapters (Case Studies)

Information

Introduction

CB-SEM and PLS-SEM Comparability and Debate: When and How to Employ PLS-SEM

Case Study Illustration: Predicting Student Satisfaction in EMI Courses

Measures

Participants

Normal Distribution of Data

Table 5.1 Tests of normality

Choosing PLS-SEM over CB-SEM for This Case Study

Data Analysis with SmartPLS

Model Specification and Path Model Estimation

Evaluation of Measurement Models (Outer Model)

Table 5.2 Measurement model

Table 5.3 Discriminant validity (latent variable correlation and square root of AVE)

Table 5.4 HTMT

Table 5.5 Saturated model results

Evaluation of the Structural Model (Inner Model)

Table 5.6 Path coefficients

MGA

Table 5.7 Gender relationship with student satisfaction

Table 5.8 Path coefficient difference for gender

Discussion

Potential Applications for EMI-Related Contexts

Table 5.9 An overview of factor simplicated in EMI contexts

Introduction

Research Design

Quantitative Methods

Corpus Building and Cleaning

Linguistic Model

Data Analysis

Results Report

Stage 1: Examining the Dataset

Table 6.1 KMO and Bartlett’s test

Stage 2: Conducting the EFA

The Initial Solution

The Final Solution

Table 6.2 Pattern matrix

Stage 3: Offering the Factors Interpretation

Table 6.3 The measure types in Lu (Reference Lu2010)

Answers to the Two Research Questions

Implications of Using EFA for EMI

Conclusion

Fundamental Facts and Principles about Rasch Models

Table 7.1 Rasch Model categories

The Application of Rasch Modeling in Processing Questionnaire Data – Three Case Reports

The Replication of Japanese English Medium of Instruction Attitude Scale in the Higher Education Context of Mainland China – A Case Study

Questionnaire Distribution

Table 7.2 Demographic information of participants

Table 7.3 Factors underlying JEMIAS

Questionnaire Results Analysis and Discussion

Step 1: Model Selection

Step 2: Script Preparation

Table 7.4 Annotated script for MFRM

Step 3: Data Input Excerpt

Table 7.5 MFRM data input file excerpt

Step 4: Subscales and Dimensions

Step 5: More Statistical Support

Table 7.6 Logit scale and infit/outfit information for items of Factor 1

Table 7.7 Logit scales across different academic majors

Table 7.8 Logit scales across different items

Concluding Remarks

Introduction

The Research Project

Quantitative Research Design

Method in Detail

The Participants

Questionnaire Adaptation

The Pilot Test

Data Collection

Data Analysis

Data Reporting

Descriptive Statistics

Table 8.1 Descriptive statistics of the scale

Scale Validation

Table 8.2 Factor loadings from principal axis factor analysis with varimax rotation for a three-factor solution (N = 541)

Correlation of Different Constructs

Table 8.3 Correlation analysis

Demographic Information