To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter, we explore an unsupervised learning problem: estimating a distribution function from two-dimensional data. Although there is no response variable, the workflow mirrors that of supervised learning. We select the best-fitting function within a family by maximising the sum of the log of the distribution's values at the observed data points. As in supervised learning, excessive flexibility leads to overfitting, while insufficient flexibility leads to underfitting. We use cross-validation to identify a function family that achieves a happy medium.
In this chapter, we examine our first supervised learning problem, focusing on how to construct prediction functions and assess their performance. Given data consisting of predictor–response pairs, we can learn the parameters of a prediction function by minimising a loss, such as the residual sum of squares, which measures the discrepancy between actual and predicted responses. Using more flexible families of prediction functions typically reduces loss on the training data, but excessive flexibility can lead to overfitting: fitting to noise rather than the systematic component of the relationship. Overfitting results in poor prediction performance on new, unseen data. To estimate how a prediction method will perform on unseen data, we use cross-validation. However, when we compare many prediction methods using cross-validation, the best-performing method often appears better than it truly is; its apparent performance is an unreliable guide to its future accuracy. Prior knowledge is crucial for selecting plausible prediction methods to compare. Finally, we can use bootstrapping to quantify uncertainty in prediction functions and their predictions.
In this chapter, we explain how to estimate the prediction error of a regression model. The training error (the average of the squared residuals) under-estimates the prediction error. Instead, we use cross-validation that involves separating the data into one part for fitting the model and one part for estimating the prediction error. We can use the estimated prediction error to choose among a set of possible regression models.
This article proposes a mixed-effects machine-learning framework for modeling complex, nonlinear relations between predictors and continuous outcomes in multilevel cross-classified data. The proposed method, termed LMM–XGBoost, embeds extreme gradient boosting (XGBoost) within a linear mixed model (LMM) to combine flexible modeling of nonlinear and interaction effects with random effects that model dependence. In addition, an iterative estimation procedure for LMM–XGBoost is developed, a group-aware permutation importance measure that respects multilevel dependence is proposed, and a combined-group cross-validation (CV) strategy for hyperparameter tuning, out-of-fold (OOF) prediction, and importance estimation is developed for cross-classified designs. The simulation study shows that the proposed estimation method for LMM–XGBoost yields good parameter recovery under non-zero random-effect variances. In addition, relative to standard LMM and XGBoost, LMM–XGBoost achieves lower OOF prediction error and more accurate recovery of variable importance. The study further shows that combined-group CV and group-aware permutation importance yield less biased error estimates and substantially higher agreement with the true importance rankings than conventional permutation measures. An empirical application using the Add Health study illustrates how the proposed methods can identify important factors across multiple domains associated with adolescent depressive symptoms.
Surrogate models have gained widespread popularity for their effectiveness in replacing computationally expensive numerical analyses, particularly in scenarios such as design optimization procedures, requiring hundreds or thousands of simulations. While one-shot sampling methods—where all samples are generated in a single stage without prior knowledge of the required sample size—are commonly adopted in the creation of surrogate models, these methods face significant limitations. Given that the characteristics of the underlying system are generally unknown prior to training, adopting one-shot sampling can lead to suboptimal model performance or unnecessary computational costs, especially in complex or high-dimensional problems. This paper addresses these challenges by proposing a novel, model-independent adaptive sampling approach with batch selection, termed Cross-Validation Batch Adaptive Sampling for High-Efficiency Surrogates (CV-BASHES). CV-BASHES is first validated using two analytical functions to explore its flexibility and accuracy under different configurations, confirming its robustness. Comparative studies on the same functions with two state-of-the-art methods, maximum projection (MaxPro) and scalable adaptive sampling (SAS), demonstrate the superior accuracy and robustness of CV-BASHES. Its applicability is further demonstrated through a geotechnical application, where CV-BASHES is used to develop a surrogate model to predict the horizontal deformation of a diaphragm wall supporting a deep excavation. Results show that CV-BASHES efficiently selects training samples, reducing the dataset size while maintaining high surrogate accuracy. By offering more efficient sampling strategies, CV-BASHES streamlines and enhances the process of creating machine learning models as surrogates for tackling complex problems in general engineering disciplines.
This chapter focuses on data collection methods, analysis approaches, and evaluation techniques in data science. It covers various data collection methods including surveys (with different question types like multiple-choice, Likert scales, and open-ended questions), interviews, focus groups, diary studies, and user studies in lab and field settings.
The chapter distinguishes between quantitative methods (using numerical measurements and statistical analysis) and qualitative methods (observing behaviors, attitudes, and opinions through techniques like grounded theory and constant comparison). It also discusses mixed-method approaches that combine both methodologies.
For evaluation, the chapter explains model comparison metrics including precision, recall, F-measure, ROC curves, AIC, and BIC. It covers validation techniques like training-testing splits, A/B testing, and cross-validation methods. The chapter emphasizes that data science involves pre-data collection planning and post-analysis evaluation, not just data processing.
This chapter focuses on data collection methods, analysis approaches, and evaluation techniques in data science. It covers various data collection methods including surveys (with different question types like multiple-choice, Likert scales, and open-ended questions), interviews, focus groups, diary studies, and user studies in lab and field settings.
The chapter distinguishes between quantitative methods (using numerical measurements and statistical analysis) and qualitative methods (observing behaviors, attitudes, and opinions through techniques like grounded theory and constant comparison). It also discusses mixed-method approaches that combine both methodologies.
For evaluation, the chapter explains model comparison metrics including precision, recall, F-measure, ROC curves, AIC, and BIC. It covers validation techniques like training-testing splits, A/B testing, and cross-validation methods. The chapter emphasizes that data science involves pre-data collection planning and post-analysis evaluation, not just data processing.
In latent space item response models (LSIRMs), subjects and items are embedded in a low-dimensional Euclidean latent space. As such, interactions among persons and/or items can be revealed that are unmodeled in conventional item response theory models. Current estimation approach for LSIRMs is a fully Bayesian procedure with Markov Chain Monte Carlo, which is, while practical, computationally challenging, hampering applied researchers to use the models in a wide range of settings. Therefore, we propose an LSIRM based on two variants of regularized joint maximum likelihood (JML) estimation: penalized JML and constrained JML. Owing to the absence of integrals in the likelihood, the JML methods allow for various models to be fit in limited amount of time. This computational speed facilitates a practical extension of LSIRMs to ordinal data, and the possibility to select the dimensionality of the latent space using cross-validation. In this study, we derive the two JML approaches and address different issues that arise when using maximum likelihood to estimate the LSIRM. We present a simulation study demonstrating acceptable parameter recovery and adequate performance of the cross-validation procedure. In addition, we estimate different binary and ordinal LSIRMs on real datasets pertaining to deductive reasoning and personality. All methods are implemented in R package ‘LSMjml’ which is available from CRAN.
This chapter considers some basic concepts of essentail importance in supervised learning, of which the fundamental task is to model the given dataset (training set) so that the model prediction matches the given data optimally in certain sense. As typically the form of the model is predetermined, the task of supervised learning is essentially to find the optimal parameters of the model in either of two ways: (a) the least squares estimation (LSE) method that minimizes the squared error between the model prediction and observed data, or (b) the maximum A posteriori (MAP) method that maximizes the posterior probability of the model parameters given the data is maximized. The chapter further considers some important issues including overfitting, underfitting, and bias-variance tradeoff, faced by all supervised learning methods based on noisy data, and then some specific methods to address such issues, including cross-validation, regularization, and ensemble learning.
This chapter explores advanced applications of network machine learning for multiple networks. We introduce anomaly detection in time series of networks, identifying significant structural changes over time. The chapter then focuses on signal subnetwork estimation for network classification tasks. We present both incoherent and coherent approaches, with incoherent methods identifying edges that best differentiate between network classes, and coherent methods leveraging additional network structure to improve classification accuracy. Practical applications, such as classifying brain networks, are emphasized throughout. These techniques apply to collections of networks, providing a toolkit for analyzing and classifying complex, multinetwork datasets. By integrating previous concepts with new methodologies, we offer a framework for extracting insights and making predictions from diverse network structures with associated attributes.
What is the optimal level of questionnaire detail required to measure bilingual language experience? This empirical evaluation compares alternative measures of language exposure of varying cost (i.e., questionnaire detail) in terms of their performance as predictors of oral language outcomes. The alternative measures were derived from Q-BEx questionnaire data collected from a diverse sample of 121 heritage bilinguals (5–9 years of age) growing up in France, the Netherlands and the UK. Outcome data consisted of morphosyntax and vocabulary measures (in the societal language) and parental estimates of oral proficiency (in the heritage language). Statistical modelling exploited information-theoretic and cross-validation approaches to identify the optimal language exposure measure. Optimal cost–benefit was achieved with cumulative exposure (for the societal language) and current exposure in the home (for the heritage language). The greatest level of questionnaire detail did not yield more reliable predictors of language outcomes.
Typical Bayesian methods for models with latent variables (or random effects) involve directly sampling the latent variables along with the model parameters. In high-level software code for model definitions (using, e.g., BUGS, JAGS, Stan), the likelihood is therefore specified as conditional on the latent variables. This can lead researchers to perform model comparisons via conditional likelihoods, where the latent variables are considered model parameters. In other settings, however, typical model comparisons involve marginal likelihoods where the latent variables are integrated out. This distinction is often overlooked despite the fact that it can have a large impact on the comparisons of interest. In this paper, we clarify and illustrate these issues, focusing on the comparison of conditional and marginal Deviance Information Criteria (DICs) and Watanabe–Akaike Information Criteria (WAICs) in psychometric modeling. The conditional/marginal distinction corresponds to whether the model should be predictive for the clusters that are in the data or for new clusters (where “clusters” typically correspond to higher-level units like people or schools). Correspondingly, we show that marginal WAIC corresponds to leave-one-cluster out cross-validation, whereas conditional WAIC corresponds to leave-one-unit out. These results lead to recommendations on the general application of the criteria to models with latent variables.
In item response theory (IRT), it is often necessary to perform restricted recalibration (RR) of the model: A set of (focal) parameters is estimated holding a set of (nuisance) parameters fixed. Typical applications of RR include expanding an existing item bank, linking multiple test forms, and associating constructs measured by separately calibrated tests. In the current work, we provide full statistical theory for RR of IRT models under the framework of pseudo-maximum likelihood estimation. We describe the standard error calculation for the focal parameters, the assessment of overall goodness-of-fit (GOF) of the model, and the identification of misfitting items. We report a simulation study to evaluate the performance of these methods in the scenario of adding a new item to an existing test. Parameter recovery for the focal parameters as well as Type I error and power of the proposed tests are examined. An empirical example is also included, in which we validate the pediatric fatigue short-form scale in the Patient-Reported Outcome Measurement Information System (PROMIS), compute global and local GOF statistics, and update parameters for the misfitting items.
The use of multilevel VAR(1) models to unravel within-individual process dynamics is gaining momentum in psychological research. These models accommodate the structure of intensive longitudinal datasets in which repeated measurements are nested within individuals. They estimate within-individual auto- and cross-regressive relationships while incorporating and using information about the distributions of these effects across individuals. An important quality feature of the obtained estimates pertains to how well they generalize to unseen data. Bulteel and colleagues (Psychol Methods 23(4):740–756, 2018a) showed that this feature can be assessed through a cross-validation approach, yielding a predictive accuracy measure. In this article, we follow up on their results, by performing three simulation studies that allow to systematically study five factors that likely affect the predictive accuracy of multilevel VAR(1) models: (i) the number of measurement occasions per person, (ii) the number of persons, (iii) the number of variables, (iv) the contemporaneous collinearity between the variables, and (v) the distributional shape of the individual differences in the VAR(1) parameters (i.e., normal versus multimodal distributions). Simulation results show that pooling information across individuals and using multilevel techniques prevent overfitting. Also, we show that when variables are expected to show strong contemporaneous correlations, performing multilevel VAR(1) in a reduced variable space can be useful. Furthermore, results reveal that multilevel VAR(1) models with random effects have a better predictive performance than person-specific VAR(1) models when the sample includes groups of individuals that share similar dynamics.
Using Carroll's external analysis, several studies have found that unfolding models account for more, although seldom significantly more, variance in preferences than Tucker's vector model. In studies of sociometric ratings and political preferences, the unfolding model again rarely outpredicted the vector model by a significant amount. Yet on cross-validation, the unfolding model consistently accounted for more variance. Results suggest that sometimes significance tests are less sensitive than cross-validation procedures to the small but consistent superiority of the unfolding model. Future researchers may wish to use significance tests and cross-validation techniques in comparing models.
A highly popular method for examining the stability of a data clustering is to split the data into two parts, cluster the observations in Part A, assign the objects in Part B to their nearest centroid in Part A, and then independently cluster the Part B objects. One then examines how close the two partitions are (say, by the Rand measure). Another proposal is to split the data into k parts, and see how their centroids cluster. By means of synthetic data analyses, we demonstrate that these approaches fail to identify the appropriate number of clusters, particularly as sample size becomes large and the variables exhibit higher correlations.
It is widely believed that a joint factor analysis of item responses and response time (RT) may yield more precise ability scores that are conventionally predicted from responses only. For this purpose, a simple-structure factor model is often preferred as it only requires specifying an additional measurement model for item-level RT while leaving the original item response theory (IRT) model for responses intact. The added speed factor indicated by item-level RT correlates with the ability factor in the IRT model, allowing RT data to carry additional information about respondents’ ability. However, parametric simple-structure factor models are often restrictive and fit poorly to empirical data, which prompts under-confidence in the suitablity of a simple factor structure. In the present paper, we analyze the 2015 Programme for International Student Assessment mathematics data using a semiparametric simple-structure model. We conclude that a simple factor structure attains a decent fit after further parametric assumptions in the measurement model are sufficiently relaxed. Furthermore, our semiparametric model implies that the association between latent ability and speed/slowness is strong in the population, but the form of association is nonlinear. It follows that scoring based on the fitted model can substantially improve the precision of ability scores.
We first discuss a phenomenon called data mining. This can involve multiple tests on which variables or correlations are relevant. If used improperly, data mining may associate with scientific misconduct. Next, we discuss one way to arrive at a single final model, involving stepwise methods. We see that various stepwise methods lead to different final models. Next, we see that various configurations in test situations, here illustrated for testing for cointegration, lead to different outcomes. It may be possible to see which configurations make most sense and can be used for empirical analysis. However, we suggest that it is better to keep various models and somehow combine inferences. This is illustrated by an analysis of the losses in airline revenues in the United States owing to 9/11. We see that out of four different models, three estimate a similar loss, while the fourth model suggests only 10 percent of that figure. We argue that it is better to maintain various models, that is, models that stand various diagnostic tests, for inference and for forecasting, and to combine what can be learned from them.
from
Part III
-
Methodological Challenges of Experimentation in Sociology
Davide Barrera, Università degli Studi di Torino, Italy,Klarita Gërxhani, Vrije Universiteit, Amsterdam,Bernhard Kittel, Universität Wien, Austria,Luis Miller, Institute of Public Goods and Policies, Spanish National Research Council,Tobias Wolbring, School of Business, Economics and Society at the Friedrich-Alexander-University Erlangen-Nürnberg
This chapter addresses the often-misunderstood concept of validity. Much of the methodological discussion around sociological experiments is framed in terms of internal and external validity. The standard view is that the more we ensure that the experimental treatment is isolated from potential confounds (internal validity), the more unlikely it is that the experimental results can be representative of phenomena of the outside world (external validity). However, other accounts describe internal validity as a prerequisite of external validity: Unless we ensure internal validity of an experiment, little can be said of the outside world. We contend in this chapter that problems of either external or internal validity do not necessarily depend on the artificiality of experimental settings or on the laboratory–field distinction between experimental designs. We discuss the internal–external distinction and propose instead a list of potential threats to the validity of experiments that includes "usual suspects" like selection, history, attrition, and experimenter demand effects and elaborate on how these threats can be productively handled in experimental work. Moreover, in light of the different types of experiments, we also discuss the strengths and weaknesses of each regarding threats to internal and external validity.
We can easily find ourselves with lots of predictors. This situation has been common in ecology and environmental science but has spread to other biological disciplines as genomics, proteomics, metabolomics, etc., become widespread. Models can become very complex, and with many predictors, collinearity is more likely. Fitting the models is tricky, particularly if we’re looking for the “best” model, and the way we approach the task depends on how we’ll use the model results. This chapter describes different model selection approaches for multiple regression models and discusses ways of measuring the importance of specific predictors. It covers stepwise procedures, all subsets, information criteria, model averaging and validation, and introduces regression trees, including boosted trees.