To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter covers regression and classification, where the goal is to estimate a quantity of interest (the response) from observed features. In regression, the response is a numerical variable. In classification, it belongs to a finite set of predetermined classes. We begin with a comprehensive description of linear regression and discuss how to leverage it to perform causal inference. Then, we explain under what conditions linear models tend to overfit or to generalize robustly to held-out data. Motivated by the threat of overfitting, we introduce regularization and ridge regression, and discuss sparse regression, where the goal is to fit a linear model that only depends on a small subset of the available features. Then, we introduce two popular linear models for binary and multiclass classification: Logistic and softmax regression. At this point, we turn our attention to nonlinear models. First, we present regression and classification trees and explain how to combine them via bagging, random forests, and boosting. Second, we explain how to train neural networks to perform regression and classification. Finally, we discuss how to evaluate classification models.
How do I conduct a mixed effects logistic regression of a linguistic variable?This chapter will illustrate the procedures for performing statistical modelling using mixed effects logistic regression with the lme4 package in R. It will review the steps for conducting analyses, for finding the best model for the feature under study, and what to do with it when you find it.
A rich and important area for the applications of linear algebra is machine learning. In machine learning, one aims to achieve optimized or learned understanding of various kinds of real-world phenomena from data collected or observed, without real comprehension of the functioning mechanisms of such phenomena. These functioning mechanisms are often impossible or unpractical to grasp anyway. In this chapter, we present several introductory and fundamental problems in supervised machine learning including linear regression, data classification, and logistic regression and the mathematical and computational methods associated.
In this chapter, new computational models will focus on whether environmental health texts are suitable for parents rather than the general public. Logistic regression models will identify linguistic features that are important contributors to the prediction of the suitability of environmental health materials for parents and caregivers of young children, who are more likely to be affected by environmental health risks such as water pollution, excessive sun exposure, and radiation in natural and indoor environments.
This chapter describes how to characterize data and the distribution of data. We will also describe how the shape of the normal distribution enables hypothesis testing. In the section on regression, we look at how two variables or ways of measuring data are related to each other. We will use simple linear regression as an introduction to multiple regression, the technique used in the development of a number of traditional readability measures. A more sophisticated form of regression is called logistic regression is also discussed, which will be applied in the case studies of Chapters 4 to 6.
Item calibration is an essential issue in modern item response theory based psychological or educational testing. Due to the popularity of computerized adaptive testing, methods to efficiently calibrate new items have become more important than that in the time when paper and pencil test administration is the norm. There are many calibration processes being proposed and discussed from both theoretical and practical perspectives. Among them, the online calibration may be one of the most cost effective processes. In this paper, under a variable length computerized adaptive testing scenario, we integrate the methods of adaptive design, sequential estimation, and measurement error models to solve online item calibration problems. The proposed sequential estimate of item parameters is shown to be strongly consistent and asymptotically normally distributed with a prechosen accuracy. Numerical results show that the proposed method is very promising in terms of both estimation accuracy and efficiency. The results of using calibrated items to estimate the latent trait levels are also reported.
A logistic regression model is suggested for estimating the relation between a set of manifest predictors and a latent trait assumed to be measured by a set of k dichotomous items. Usually the estimated subject parameters of latent trait models are biased, especially for short tests. Therefore, the relation between a latent trait and a set of predictors should not be estimated with a regression model in which the estimated subject parameters are used as a dependent variable. Direct estimation of the relation between the latent trait and one or more independent variables is suggested instead. Estimation methods and test statistics for the Rasch model are discussed and the model is illustrated with simulated and empirical data.
In this paper robustness properties of the maximum likelihood estimator (MLE) and several robust estimators for the logistic regression model when the responses are binary are analysed. It is found that the MLE and the classical Rao's score test can be misleading in the presence of model misspecification which in the context of logistic regression means either misclassification's errors in the responses, or extreme data points in the design space. A general framework for robust estimation and testing is presented and a robust estimator as well as a robust testing procedure are presented. It is shown that they are less influenced by model misspecifications than their classical counterparts. They are finally applied to the analysis of binary data from a study on breastfeeding.
Latent transition models increasingly include covariates that predict prevalence of latent classes at a given time or transition rates among classes over time. In many situations, the covariate of interest may be latent. This paper describes an approach for handling both manifest and latent covariates in a latent transition model. A Bayesian approach via Markov chain Monte Carlo (MCMC) is employed in order to achieve more robust estimates. A case example illustrating the model is provided using data on academic beliefs and achievement in a low-income sample of adolescents in the United States.
Our study aimed to develop and validate a nomogram to assess talaromycosis risk in hospitalized HIV-positive patients. Prediction models were built using data from a multicentre retrospective cohort study in China. On the basis of the inclusion and exclusion criteria, we collected data from 1564 hospitalized HIV-positive patients in four hospitals from 2010 to 2019. Inpatients were randomly assigned to the training or validation group at a 7:3 ratio. To identify the potential risk factors for talaromycosis in HIV-infected patients, univariate and multivariate logistic regression analyses were conducted. Through multivariate logistic regression, we determined ten variables that were independent risk factors for talaromycosis in HIV-infected individuals. A nomogram was developed following the findings of the multivariate logistic regression analysis. For user convenience, a web-based nomogram calculator was also created. The nomogram demonstrated excellent discrimination in both the training and validation groups [area under the ROC curve (AUC) = 0.883 vs. 0.889] and good calibration. The results of the clinical impact curve (CIC) analysis and decision curve analysis (DCA) confirmed the clinical utility of the model. Clinicians will benefit from this simple, practical, and quantitative strategy to predict talaromycosis risk in HIV-infected patients and can implement appropriate interventions accordingly.
This chapter examines the conceptualization and measurement of contact phenomena in the context of bilingualism across various languages. The goal of the chapter is to account for various phonetic contact phenomena in sociolinguistic analysis, as well as providing context for elaborating on quantitative methodologies in sociophonetic contact linguistics. More specifically, the chapter provides a detailed account of global phenomena in modern natural speech contexts, as well as an up-to-date examination of quantitative methods in the field of sociolinguistics. The first section provides a background of theoretical concepts important to the understanding of sociophonetic contact in the formation of sound systems. The following sections focus on several key social factors that play a major part in the sociolinguistic approach to bilingual phonetics and phonology, including language dominance and age of acquisition at the segmental and the suprasegmental levels, as well as topics of language attitudes and perception, and typical quantitative methods used in sociolinguistics.
Taking a simplified approach to statistics, this textbook teaches students the skills required to conduct and understand quantitative research. It provides basic mathematical instruction without compromising on analytical rigor, covering the essentials of research design; descriptive statistics; data visualization; and statistical tests including t-tests, chi-squares, ANOVAs, Wilcoxon tests, OLS regression, and logistic regression. Step-by-step instructions with screenshots are used to help students master the use of the freely accessible software R Commander. Ancillary resources include a solutions manual and figure files for instructors, and datasets and further guidance on using STATA and SPSS for students. Packed with examples and drawing on real-world data, this is an invaluable textbook for both undergraduate and graduate students in public administration and political science.
Access to waste management services is crucial for urban sustainability, impacting public health, environmental well-being, and overall quality of life. This study employs logistic regression analysis on survey data collected from 1,032 household heads residing in Nouakchott, the capital of Mauritania. The survey investigated key household factors that determine access to waste management services. The findings reveal a significant interplay among waste service provision, the presence of cisterns, housing type and size, and access to electricity. Socioeconomic disparity in service access, with poorer housing formats like shacks receiving substandard services. In contrast, areas with robust electrification report better service access, although inconsistencies remain amid power outages. The research highlights the challenges faced by Riyadh municipality, particularly rapid growth and inadequate infrastructure, which hinder waste management efficiency. Overall, the results not only illuminate Nouakchott’s unique challenges in service provision but also propose actionable recommendations for a sustainable urban future. These recommendations aim to inform and guide targeted policies for improving living conditions and environmental sustainability in urban Mauritania.
Many of the preceding chapters involved optimization formulations: linear least squares, Procrustes, low-rank approximation, multidimensional scaling. All these have analytical solutions, like the pseudoinverse for minimum-norm least squares problems and the truncated singular value decomposition for low-rank approximation. But often we need iterative optimization algorithms, for example if no closed-form minimizer exists, or if the analytical solution requires too much computation and/or memory (e.g., singular value decomposition for large problems. To solve an optimization problem via an iterative method, we start with some initial guess and then the algorithm produces a sequence that hopefully converges to a minimizer. This chapter describes the basics of gradient-based iterative optimization algorithms, including preconditioned gradient descent (PGD) for the linear LS problem. PGD uses a fixed step size, whereas preconditioned steepest descent uses a line search to determine the step size. The chapter then considers gradient descent and accelerated versions for general smooth convex functions. It applies gradient descent to the machine learning application of binary classification via logistic regression. Finally, it summarizes stochastic gradient descent.
It is January 28, 1986. While the world was watching, just 73 seconds after take-off, the Challenger Space Shuttle exploded, killing all seven astronauts on board. The crew included the teacher Christa McAuliffe who would have lectured schoolchildren from space. An important factor that contributed to the disaster was the extremely low temperature at launch. “Extreme” here means “well below temperatures experienced at previous launches”. In this chapter, we give a short overview of the errors that contributed to the explosion. These errors range from purely managerial errors to technical as well as statistical errors. Our discussion includes a statistical analysis of the malfunctioning of so-called rubber O-rings as a function of temperature at launch. As a prime example of efficient risk communication we also recall the press conference at which the physics Nobel Prize winner, Richard Feynman, made his famous “piece-of-rubber-in-ice-water” presentation. This exposed the cause of the accident in all clarity.
The germination percentage (GP) is commonly employed to estimate the viability of a seed population. Statistical methods such as analysis of variance (ANOVA) and logistic regression are frequently used to analyse GP data. While ANOVA has a long history of usage, logistic regression is considered more suitable for GP data due to its binomial nature. However, both methods have inherent issues that require attention. In this study, we address previously unexplored challenges associated with these methods and propose the utilization of a likelihood ratio test as a solution. We demonstrate the advantages of employing the likelihood ratio test for GP data analysis through simulations and real data analysis.
Alternating Dat-Nom/Nom-Dat verbs in Icelandic are notorious for instantiating two diametrically opposed argument structures: the Dat-Nom and the Nom-Dat construction. We conduct a systematic study of the relevant verbs to uncover the factors steering the alternation. This involves a comparison of 15 verbs, five alternating ones, and as a control, five Nom-Dat verbs and five non-alternating Dat-Nom verbs. Our findings show that alternating verbs instantiate the Nom-Dat construction 54% of the time and the Dat-Nom construction 46% of the time on average for four of five verbs when both arguments are full NPs. However, in configurations with a nominative pronoun, the Nom-Dat construction takes precedence over the Dat-Nom construction. Also, for the double-NP configuration, a logistic regression analysis identifies indefiniteness and length as two key predictors, apart from nominative case marking. We demonstrate that the latter systematically correlates with discourse-prominence, which we show, upon closer inspection, correlates with topicality.
Chapter 3 demonstrates how the mathematics of turning Ordinary Least Squares (OLS) regression inside out can be generalized to Generalized Linear Models (GLM) including logistic, Poisson, negative binomial, random intercept, and fixed effects models.
As mentioned in the previous chapter, the perceptron does not perform smooth updates during training, which may slow down learning, or cause it to miss good solutions entirely in real-world situations. In this chapter, we will discuss logistic regression, a machine learning algorithm that elegantly addresses this problem. We also extend the vanilla logistic regression, which was designed for binary classification, to handle multiclass classification. Through logistic regression, we introduce the concept of cost function (i.e., the function we aim to minimize during training), and gradient descent, the algorithm that implements this minimization procedure.