We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter discusses how to build probabilistic models that include both discrete and continuous variables. Mathematically, this is achieved by defining them as random variables within the same probability space. In practice, the variables are manipulated using their marginal and conditional distributions. We define the conditional pmf of a discrete random variable given a continuous variable, and the conditional probability density of a continuous random variable given a discrete variable. We use these objects to build mixture models and apply them to model height in a population. Next, we describe Gaussian discriminant analysis, a classification method based on mixture models with Gaussian conditional distributions, and apply it to diagnose Alzheimer's disease. Then, we explain how to perform clustering using Gaussian mixture models and leverage the approach to cluster NBA players. Finally, we introduce the framework of Bayesian statistics which enables us to explicitly encode our uncertainty about model parameters, and use it to analyze poll data from the 2020 United States presidential election.
Factor score indeterminacy is a characteristic property of factor analysis (FA) models. This research introduces a novel procedure, regression-based factor score exploration (RFE), which uniquely determines factor scores and simultaneously estimates other parameters of the FA model. RFE uniquely determines factor scores by minimizing a loss function that balances FA and multivariate regression, regulated by a tuning parameter. Theoretical aspects of RFE, including the uniqueness of factor scores, the relationship between observed and latent variables, and rotational indeterminacy, are examined. Additionally, clustering-based factor exploration (CFE) is presented as a variant of RFE, derived by generalizing the penalty term to enable the clustering of factor scores. It is demonstrated that CFE creates cluster structures more accurately than the existing method. A simulation study shows that the proposed procedures accurately recover true parameter matrices even in the presence of error-contaminated data, with lower computational demand compared to existing methods. Real data examples illustrate that the proposed procedures provide interpretable results, demonstrating high relevance to the factor scores obtained by existing methods.
Avocado is a delicious fruit crop having great economic importance. Understanding the extent of variability present in the existing germplasm is important to identify genotypes with specific traits and their utilization in crop improvement. The information on genetic variability with respect to morphological and biochemical traits in Indian avocados is limited and as it has hindered genetic improvement of the crop. In the current study, 83 avocado accessions from different regions of India were assessed for important 17 morphological and 8 biochemical traits. The results showed the existence of wide variability for traits such as fruit weight (75.88–934.12 g), pulp weight (48.08–736.19 g), seed weight (6.37–32.62 g), FRAP activity (27.65–119.81 mg AEAC/100 g), total carotenoids (0.96–7.17 mg/100 g), oil content (4.91–25.49%) and crude fibre (6.85–20.75%) in the studied accessions. The first three components of principal component analysis explained 54.79 per cent of total variance. Traits such as fruit weight, pulp weight, seed weight, moisture and oil content contributed more significantly towards total variance compared to other traits. The dendrogram constructed based on Euclidean distance wards minimum variance method divided 83 accessions into two major groups and nine sub clusters suggesting wide variability in the accessions with respect to studied traits. In this study, superior accessions for important traits such as fruit size (PA-102, PA-012), high pulp recovery (PA-036, PA-082,), thick peel (PA-084, PA-043, PA-011, PA-008), high carotenoids (PA-026, PA-096) and high oil content (PA-044, PA-043, PA-046, PA-045) were identified which have potential utility in further crop improvement programmes.
Distinguishing between different phases of matter and detecting phase transitions are some of the most central tasks in many-body physics. Traditionally, these tasks are accomplished by searching for a small set of low-dimensional quantities capturing the macroscopic properties of each phase of the system, so-called order parameters. Because of the large state space underlying many-body systems, success generally requires a great deal of human intuition and understanding. In particular, it can be challenging to define an appropriate order parameter if the symmetry breaking pattern is unknown or the phase is of topological nature and thus exhibits nonlocal order. In this chapter, we explore the use of machine learning to automate the task of classifying phases of matter and detecting phase transitions. We discuss the application of various machine learning techniques, ranging from clustering to supervised learning and anomaly detection, to different physical systems, including the prototypical Ising model that features a symmetry-breaking phase transition and the Ising gauge theory which hosts a topological phase of matter.
This paper uses a two-step approach to modelling the probability of a policyholder making an auto insurance claim. We perform clustering via Gaussian mixture models and cluster-specific binary regression models. We use telematics information along with traditional auto insurance information and find that the best model incorporates telematics, without the need for dimension reduction via principal components. We also utilise the probabilistic estimates from the mixture model to account for the uncertainty in the cluster assignments. The clustering process allows for the creation of driving profiles and offers a fairer method for policyholder segmentation than when clustering is not used. By fitting separate regression models to the observations from the respective clusters, we are able to offer differential pricing, which recognises that policyholders have different exposures to risk despite having similar covariate information, such as total miles driven. The approach outlined in this paper offers an explainable and interpretable model that can compete with black box models. Our comparisons are based on a synthesised telematics data set that was emulated from a real insurance data set.
Although the fundamental idea of having cells focalised to be ’seen’ one by one by a detection system remains unchanged, flow cytometry technologies evolve. This chapter provides an overview of recent progress in this evolution. From a technical point of view, cameras can provide images of each of these cells together with their fluorescent properties, or the whole spectrum of emitted light can be collected. Markers coupled to heavy metals allow to detect each cell immunophenotype by mass spectrometry. On the analysis side, artificial intelligence and machine learning are developing for unsupervised analysis, saving time before a much better supervision of small populations.
This paper presents a cross-language study of lexical semantics within the framework of distributional semantics. We used a wide range of predefined semantic categories in Mandarin and English and compared the clusterings of these categories using FastText word embeddings. Three techniques of dimensionality reduction were applied to mapping 300-dimensional FastText vectors into two-dimensional planes: multidimensional scaling, principal components analysis, and t-distributed stochastic neighbor embedding. The results show that t-SNE provides the clearest clustering of semantic categories, improving markedly on PCA and MDS. In both languages, we observed similar differentiation between verbs, adjectives, and nouns as well as between concrete and abstract words. In addition, the methods applied in this study, especially Procrustes analysis, make it possible to trace subtle differences in the structure of the semantic lexicons of Mandarin and English.
This paper presents a new hierarchical classes model, called Tucker2-HICLAS, for binary three-way three-mode data. As any three-way hierarchical classes model, the Tucker2-HICLAS model includes a representation of the association relation among the three modes and a hierarchical classification of the elements of each mode. A distinctive feature of the Tucker2-HICLAS model, being closely related to the Tucker3-HICLAS model (Ceulemans, Van Mechelen & Leenen, 2003), is that one of the three modes is minimally reduced and, hence, that the differences among the association patterns of the elements of this mode are maximally retained in the model. Moreover, as compared to Tucker3-HICLAS, Tucker2-HICLAS implies three rather than four different types of parameters and as such is simpler to interpret. Two types of Tucker2-HICLAS models are distinguished: a disjunctive and a conjunctive type. An algorithm for fitting the Tucker2-HICLAS model is described and evaluated in a simulation study. The model is illustrated with longitudinal data on interpersonal emotions.
This paper describes the conjunctive counterpart of De Boeck and Rosenberg's hierarchical classes model. Both the original model and its conjunctive counterpart represent the set-theoretical structure of a two-way two-mode binary matrix. However, unlike the original model, the new model represents the row-column association as a conjunctive function of a set of hypothetical binary variables. The conjunctive nature of the new model further implies that it may represent some conjunctive higher order dependencies among rows and columns. The substantive significance of the conjunctive model is illustrated with empirical applications. Finally, it is shown how conjunctive and disjunctive hierarchical classes models relate to Galois lattices, and how hierarchical classes analysis can be useful to construct lattice models of empirical data.
Two-mode binary data matrices arise in a variety of social network contexts, such as the attendance or non-attendance of individuals at events, the participation or lack of participation of groups in projects, and the votes of judges on cases. A popular method for analyzing such data is two-mode blockmodeling based on structural equivalence, where the goal is to identify partitions for the row and column objects such that the clusters of the row and column objects form blocks that are either complete (all 1s) or null (all 0s) to the greatest extent possible. Multiple restarts of an object relocation heuristic that seeks to minimize the number of inconsistencies (i.e., 1s in null blocks and 0s in complete blocks) with ideal block structure is the predominant approach for tackling this problem. As an alternative, we propose a fast and effective implementation of tabu search. Computational comparisons across a set of 48 large network matrices revealed that the new tabu-search heuristic always provided objective function values that were better than those of the relocation heuristic when the two methods were constrained to the same amount of computation time.
In this paper, hierarchical and non-hierarchical tree structures are proposed as models of similarity data. Trees are viewed as intermediate between multidimensional scaling and simple clustering. Procedures are discussed for fitting both types of trees to data. The concept of multiple tree structures shows great promise for analyzing more complex data. Hybrid models in which multiple trees and other discrete structures are combined with continuous dimensions are discussed. Examples of the use of multiple tree structures and hybrid models are given. Extensions to the analysis of individual differences are suggested.
A monotone invariant method of hierarchical clustering based on the Mann-Whitney U-statistic is presented. The effectiveness of the complete-link, single-link, and U-statistic methods in recovering tree structures from error perturbed data are evaluated. The U-statistic method is found to be consistently more effective in recovering the original tree structures than either the single-link or complete-link methods.
Extended redundancy analysis (ERA), a generalized version of redundancy analysis (RA), has been proposed as a useful method for examining interrelationships among multiple sets of variables in multivariate linear regression models. As a limitation of the extant RA or ERA analyses, however, parameters are estimated by aggregating data across all observations even in a case where the study population could consist of several heterogeneous subpopulations. In this paper, we propose a Bayesian mixture extension of ERA to obtain both probabilistic classification of observations into a number of subpopulations and estimation of ERA models within each subpopulation. It specifically estimates the posterior probabilities of observations belonging to different subpopulations, subpopulation-specific residual covariance structures, component weights and regression coefficients in a unified manner. We conduct a simulation study to demonstrate the performance of the proposed method in terms of recovering parameters correctly. We also apply the approach to real data to demonstrate its empirical usefulness.
A Generalized INDCLUS model, termed GINDCLUS, is presented for clustering three-way two-mode proximity data. In order to account for the heterogeneity of the data, both a partition of the subjects into homogeneous classes and a covering of the objects into groups are simultaneously determined. Furthermore, the availability of information which is external to the three-way data is exploited to better account for such heterogeneity: the weights of both classifications are linearly linked to external variables allowing for the identification of meaningful classes of subjects and groups of objects. The model is fitted in a least-squares framework, and an efficient Alternating Least-Squares algorithm is provided. An extensive simulation study and an application on benchmark data are also presented.
In many psychological research domains stimulus-response profiles are explained by conjecturing a sequential process in which some variables mediate between stimuli and responses. Charting sequential processes is often a complex task because (1) many possible mediating variables may exist, and (2) interindividual differences may occur in the relationship between these mediating variables and the response. Recently, Ceulemans and Van Mechelen (Psychometrika 73(1):107–124, 2008) addressed these challenges by developing the CLASSI model. A major drawback of CLASSI is that it requires information about the same set of stimuli for all participants (i.e., crossed data), whereas recently a number of data gathering techniques have been proposed in which the set of stimuli differs across participants, yielding nested data. Therefore we present the CLASSI-N model, which extends the CLASSI model to nested data. A simulated annealing algorithm is proposed. The results of a simulation study are discussed as well as an application to data concerning depression.
In this paper, we consider a class of models for two-way matrices with binary entries of 0 and 1. First, we consider Boolean matrix decomposition, conceptualize it as a latent response model (LRM) and, by making use of this conceptualization, generalize it to a larger class of matrix decomposition models. Second, probability matrix decomposition (PMD) models are introduced as a probabilistic version of this larger class of deterministic matrix decomposition models. Third, an algorithm for the computation of the maximum likelihood (ML) and the maximum a posteriori (MAP) estimates of the parameters of PMD models is presented. This algorithm is an EM-algorithm, and is a special case of a more general algorithm that can be used for the whole class of LRMs. And fourth, as an example, a PMD model is applied to data on decision making in psychiatric diagnosis.
A three-way three-mode extension of De Boeck and Rosenberg's (1988) two-way two-mode hierarchical classes model is presented for the analysis of individual differences in binary object × attribute arrays. In line with the two-way hierarchical classes model, the three-way extension represents both the association relation among the three modes and the set-theoretical relations among the elements of each model. An algorithm for fitting the model is presented and evaluated in a simulation study. The model is illustrated with data on psychiatric diagnosis. Finally, the relation between the model and extant models for three-way data is discussed.
Quite a few studies in the behavioral sciences result in hierarchical time profile data, with a number of time profiles being measured for each person under study. Associated research questions often focus on individual differences in profile repertoire, that is, differences between persons in the number and the nature of profile shapes that show up for each person. In this paper, we introduce a new method, called KSC-N, that parsimoniously captures such differences while neatly disentangling variability in shape and amplitude. KSC-N induces a few person clusters from the data and derives for each person cluster the types of profile shape that occur most for the persons in that cluster. An algorithm for fitting KSC-N is proposed and evaluated in a simulation study. Finally, the new method is applied to emotional intensity profile data.
In this paper, we propose a cluster-MDS model for two-way one-mode continuous rating dissimilarity data. The model aims at partitioning the objects into classes and simultaneously representing the cluster centers in a low-dimensional space. Under the normal distribution assumption, a latent class model is developed in terms of the set of dissimilarities in a maximum likelihood framework. In each iteration, the probability that a dissimilarity belongs to each of the blocks conforming to a partition of the original dissimilarity matrix, and the rest of parameters, are estimated in a simulated annealing based algorithm. A model selection strategy is used to test the number of latent classes and the dimensionality of the problem. Both simulated and classical dissimilarity data are analyzed to illustrate the model.
A least squares algorithm for fitting additive trees to proximity data is described. The algorithm uses a penalty function to enforce the four point condition on the estimated path length distances. The algorithm is evaluated in a small Monte Carlo study. Finally, an illustrative application is presented.