To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Chapter 7 covers models with categorical endogenous variables. It examines the consequences of treating such variables as continuous and how to modify SEMs to take account of categorical variables. It begins with single equation regression-like models for binary, ordinal, and count variables and builds to multiequation models. It includes a polychoric correlation approach, models with exogenous observed variables, the treatment of missing values, and alternative modeling approaches for categorical variables.
This chapter provides a comprehensive introduction to supervised learning techniques for classification problems. It begins with logistic regression for binary classification, explaining the sigmoid function and gradient ascent optimization. The chapter then covers softmax regression for multi-class problems, followed by k-nearest neighbors (kNN) as an intuitive distance-based classifier.
Decision trees are explored in detail, including entropy, information gain, and the ID3 algorithm, along with derived decision rules and association rules. Random forests are presented as an ensemble method that addresses overfitting by combining multiple decision trees.
The chapter covers Naive Bayes classification based on Bayes’ theorem, despite its "naive" independence assumption. Finally, Support Vector Machines (SVMs) are introduced for both linear and non-linear classification using maximum margin hyperplanes.
Each technique includes hands-on R programming examples with real datasets, practical applications, and exercises to reinforce learning concepts.
This chapter explores supervised learning techniques where algorithms learn from labeled training data to make predictions. It begins with logistic regression for binary classification problems, using the sigmoid function to output probabilities between 0 and 1. Softmax regression extends this to multi-class problems. The chapter covers k-nearest neighbors (kNN), which classifies data points based on their similarity to training examples. Decision trees use entropy and information gain to create interpretable classification rules, while random forests combine multiple decision trees to reduce overfitting through ensemble methods. Naive Bayes applies Bayes’ theorem with independence assumptions for probabilistic classification, particularly effective for text classification. Finally, support vector machines (SVM) find optimal decision boundaries by maximizing margins between classes. Each technique is demonstrated through hands-on Python examples using real datasets, showing practical applications in various domains from healthcare to finance.
Alternation among restrictive relativizers in written Standard English is undergoing a massive shift from which to that. In corpora of written-edited-published British and American English covering the period from 1961-1992, American English spearheads this change. We study 16,868 restrictive relative clauses with inanimate antecedents from the Brown quartet of corpora. Predictors include additional areas of variation regulated by prescriptivism. We show that: (i) relativizer deletion follows different constraints from the selection of either that or which; (ii) this change is a case of institutionally backed colloquialization-cum-Americanization; and (iii) uptake of the precept correlates with avoidance of the passive voice at the text level but not with other prescriptive rules.
Regression and classification are closely related, as shown in this chapter, which discusses methods used to map a linear regression function into a probablity function by either logistic function (for binary classification) or softmax function (for multi-class classification). According to this probablity function, an unlabeled sample can be assigned to one of the classes. The optimal model parameters in this method can be obtained based on the training set so that either the likelihood or the posterior probability of these parameters are maximized.
This chapter focuses on the core concepts of optimization theory and its application in data science and AI. It begins with a review of differentiable functions of several variables, including the gradient and Hessian matrices, and key results like the Chain Rule and the Mean Value Theorem. The chapter then introduces optimality conditions for unconstrained optimization, explaining first-order and second-order conditions, and the role of convexity in ensuring global optimality. A detailed discussion of the gradient descent algorithm is provided, including its convergence analysis under different assumptions. The chapter concludes with an application to logistic regression, demonstrating how gradient descent is used to optimize the cross-entropy loss function in a supervised learning context. Practical Python examples are integrated throughout to illustrate the theoretical concepts.
This chapter covers regression and classification, where the goal is to estimate a quantity of interest (the response) from observed features. In regression, the response is a numerical variable. In classification, it belongs to a finite set of predetermined classes. We begin with a comprehensive description of linear regression and discuss how to leverage it to perform causal inference. Then, we explain under what conditions linear models tend to overfit or to generalize robustly to held-out data. Motivated by the threat of overfitting, we introduce regularization and ridge regression, and discuss sparse regression, where the goal is to fit a linear model that only depends on a small subset of the available features. Then, we introduce two popular linear models for binary and multiclass classification: Logistic and softmax regression. At this point, we turn our attention to nonlinear models. First, we present regression and classification trees and explain how to combine them via bagging, random forests, and boosting. Second, we explain how to train neural networks to perform regression and classification. Finally, we discuss how to evaluate classification models.
How do I conduct a mixed effects logistic regression of a linguistic variable?This chapter will illustrate the procedures for performing statistical modelling using mixed effects logistic regression with the lme4 package in R. It will review the steps for conducting analyses, for finding the best model for the feature under study, and what to do with it when you find it.
A rich and important area for the applications of linear algebra is machine learning. In machine learning, one aims to achieve optimized or learned understanding of various kinds of real-world phenomena from data collected or observed, without real comprehension of the functioning mechanisms of such phenomena. These functioning mechanisms are often impossible or unpractical to grasp anyway. In this chapter, we present several introductory and fundamental problems in supervised machine learning including linear regression, data classification, and logistic regression and the mathematical and computational methods associated.
In this chapter, new computational models will focus on whether environmental health texts are suitable for parents rather than the general public. Logistic regression models will identify linguistic features that are important contributors to the prediction of the suitability of environmental health materials for parents and caregivers of young children, who are more likely to be affected by environmental health risks such as water pollution, excessive sun exposure, and radiation in natural and indoor environments.
This chapter describes how to characterize data and the distribution of data. We will also describe how the shape of the normal distribution enables hypothesis testing. In the section on regression, we look at how two variables or ways of measuring data are related to each other. We will use simple linear regression as an introduction to multiple regression, the technique used in the development of a number of traditional readability measures. A more sophisticated form of regression is called logistic regression is also discussed, which will be applied in the case studies of Chapters 4 to 6.
Item calibration is an essential issue in modern item response theory based psychological or educational testing. Due to the popularity of computerized adaptive testing, methods to efficiently calibrate new items have become more important than that in the time when paper and pencil test administration is the norm. There are many calibration processes being proposed and discussed from both theoretical and practical perspectives. Among them, the online calibration may be one of the most cost effective processes. In this paper, under a variable length computerized adaptive testing scenario, we integrate the methods of adaptive design, sequential estimation, and measurement error models to solve online item calibration problems. The proposed sequential estimate of item parameters is shown to be strongly consistent and asymptotically normally distributed with a prechosen accuracy. Numerical results show that the proposed method is very promising in terms of both estimation accuracy and efficiency. The results of using calibrated items to estimate the latent trait levels are also reported.
A logistic regression model is suggested for estimating the relation between a set of manifest predictors and a latent trait assumed to be measured by a set of k dichotomous items. Usually the estimated subject parameters of latent trait models are biased, especially for short tests. Therefore, the relation between a latent trait and a set of predictors should not be estimated with a regression model in which the estimated subject parameters are used as a dependent variable. Direct estimation of the relation between the latent trait and one or more independent variables is suggested instead. Estimation methods and test statistics for the Rasch model are discussed and the model is illustrated with simulated and empirical data.
In this paper robustness properties of the maximum likelihood estimator (MLE) and several robust estimators for the logistic regression model when the responses are binary are analysed. It is found that the MLE and the classical Rao's score test can be misleading in the presence of model misspecification which in the context of logistic regression means either misclassification's errors in the responses, or extreme data points in the design space. A general framework for robust estimation and testing is presented and a robust estimator as well as a robust testing procedure are presented. It is shown that they are less influenced by model misspecifications than their classical counterparts. They are finally applied to the analysis of binary data from a study on breastfeeding.
Latent transition models increasingly include covariates that predict prevalence of latent classes at a given time or transition rates among classes over time. In many situations, the covariate of interest may be latent. This paper describes an approach for handling both manifest and latent covariates in a latent transition model. A Bayesian approach via Markov chain Monte Carlo (MCMC) is employed in order to achieve more robust estimates. A case example illustrating the model is provided using data on academic beliefs and achievement in a low-income sample of adolescents in the United States.
Our study aimed to develop and validate a nomogram to assess talaromycosis risk in hospitalized HIV-positive patients. Prediction models were built using data from a multicentre retrospective cohort study in China. On the basis of the inclusion and exclusion criteria, we collected data from 1564 hospitalized HIV-positive patients in four hospitals from 2010 to 2019. Inpatients were randomly assigned to the training or validation group at a 7:3 ratio. To identify the potential risk factors for talaromycosis in HIV-infected patients, univariate and multivariate logistic regression analyses were conducted. Through multivariate logistic regression, we determined ten variables that were independent risk factors for talaromycosis in HIV-infected individuals. A nomogram was developed following the findings of the multivariate logistic regression analysis. For user convenience, a web-based nomogram calculator was also created. The nomogram demonstrated excellent discrimination in both the training and validation groups [area under the ROC curve (AUC) = 0.883 vs. 0.889] and good calibration. The results of the clinical impact curve (CIC) analysis and decision curve analysis (DCA) confirmed the clinical utility of the model. Clinicians will benefit from this simple, practical, and quantitative strategy to predict talaromycosis risk in HIV-infected patients and can implement appropriate interventions accordingly.
This chapter examines the conceptualization and measurement of contact phenomena in the context of bilingualism across various languages. The goal of the chapter is to account for various phonetic contact phenomena in sociolinguistic analysis, as well as providing context for elaborating on quantitative methodologies in sociophonetic contact linguistics. More specifically, the chapter provides a detailed account of global phenomena in modern natural speech contexts, as well as an up-to-date examination of quantitative methods in the field of sociolinguistics. The first section provides a background of theoretical concepts important to the understanding of sociophonetic contact in the formation of sound systems. The following sections focus on several key social factors that play a major part in the sociolinguistic approach to bilingual phonetics and phonology, including language dominance and age of acquisition at the segmental and the suprasegmental levels, as well as topics of language attitudes and perception, and typical quantitative methods used in sociolinguistics.
Taking a simplified approach to statistics, this textbook teaches students the skills required to conduct and understand quantitative research. It provides basic mathematical instruction without compromising on analytical rigor, covering the essentials of research design; descriptive statistics; data visualization; and statistical tests including t-tests, chi-squares, ANOVAs, Wilcoxon tests, OLS regression, and logistic regression. Step-by-step instructions with screenshots are used to help students master the use of the freely accessible software R Commander. Ancillary resources include a solutions manual and figure files for instructors, and datasets and further guidance on using STATA and SPSS for students. Packed with examples and drawing on real-world data, this is an invaluable textbook for both undergraduate and graduate students in public administration and political science.
Access to waste management services is crucial for urban sustainability, impacting public health, environmental well-being, and overall quality of life. This study employs logistic regression analysis on survey data collected from 1,032 household heads residing in Nouakchott, the capital of Mauritania. The survey investigated key household factors that determine access to waste management services. The findings reveal a significant interplay among waste service provision, the presence of cisterns, housing type and size, and access to electricity. Socioeconomic disparity in service access, with poorer housing formats like shacks receiving substandard services. In contrast, areas with robust electrification report better service access, although inconsistencies remain amid power outages. The research highlights the challenges faced by Riyadh municipality, particularly rapid growth and inadequate infrastructure, which hinder waste management efficiency. Overall, the results not only illuminate Nouakchott’s unique challenges in service provision but also propose actionable recommendations for a sustainable urban future. These recommendations aim to inform and guide targeted policies for improving living conditions and environmental sustainability in urban Mauritania.