We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Chapter 8 presents random forests for regression, which – at least in some situations – may outperform the least-squares-based regression methods. The chapter discusses bagging in the context of regression applications of random forests, the algorithm for splitting nodes in regression trees, and the variable importance metrics applicable to regression.
Chapter 7 is dedicated to regularized regression methods, which – by penalizing models that are too complex – are capable of providing a reasonable tradeoff between bias and variance. Ridge regression implements L2 regularization, which results in more generalizable models, but does not perform any feature selection. L1 penalty used by the lasso allows, however, for simultaneous regularization and feature selection. The elastic net algorithm combines the two approaches by applying both L1 and L2 penalties, which allows for solutions combining the advantages of both ridge regression and the lasso. The chapter concludes by discussing a general class of Lq-regularized least squares optimization problems.
Chapter 15 starts with the identification of essential patterns by analyzing distributions of groups of variables (with similar patterns) among a large number of optimal-size biomarkers generated by parallel feature selection experiments. A similar approach leads to the identification of the essential variables of those essential patterns. As a result, the final multivariate biomarker identified via this method is most likely to represent a real population pattern associated with biological processes underlying changes in the investigated response variable. Furthermore, having the variables of the final biomarker associated with their respective essential patterns facilitates biological interpretation of the biomarker.
Chapter 2, while continuing with the concepts of importance to multivariate analysis of high-dimensional data, adds considerations related to the curse of dimensionality and descriptions of common misconceptions. Discussed is the fallacy of applying to high-dimensional data the same methods that were successfully used in low-dimensional settings. Explained are also misconceptions based on driving biomarker discovery studies by univariate and unsupervised approaches.
Chapters 14 and 15 describe a method for the identification of parsimonious and robust multivariate biomarkers that may also have the best chance for plausible biological interpretation. The method is based on multistage signal enhancement and identification of essential patterns. Chapter 14 covers the first logical part of this method – the multistage signal enhancement approach leading to the identification of a pool of potentially important variables.
Chapter 6 starts with the description of multiple regression. Even if it is unlikely for multiple regression to be used as the primary method for multivariate biomarker discovery based on high-dimensional data, presenting this classical method provides the necessary background for regression analysis and highlights the weaknesses of multiple regression, which will be addressed by the subsequently presented methods. This chapter also presents partial least squares regression (PLSR), which by performing supervised dimensionality reduction addresses some weaknesses of multiple regression; however, by not performing any feature selection, PLSR does not reduce noise that is typically abundant in high-dimensional data.
Chapter 13 discusses neural networks and deep learning; included is a presentation of deep convolutional networks that seem to have a great potential in the classification of medical images.
Chapter 5 is dedicated to the most important part of predictive modeling for biomarker discovery based on high-dimensional data – multivariate feature selection. When dealing with sparse biomedical data whose dimensionality is much higher than the number of training observations, the crucial issue is to overcome the curse of dimensionality by using methods capable of elevating signal (predictive information) from the overwhelming noise. One way of doing this is to perform many (hundreds or thousands) parallel feature selection experiments based on different random subsamples of the original training data and then aggregating their results (for example, by analyzing the distribution of variables among the results of those parallel experiments). Two designs of such parallel feature selection experiments are discussed in detail: one based on recursive feature elimination, and the other on implementing the stepwise hybrid selection with T2. The chapter includes also descriptions of three evolutionary feature selection algorithms: simulated annealing, genetic algorithms, and particle swarm optimization.
Chapter 11 presents classification with support vector machines – details of the algorithms for linear and nonlinear SVMs. Discussed are also kernel functions, hyperparameters, variable importance measures, and cost-sensitive SVMs.
Chapter 3 provides an overview of all elements of the predictive modeling process, from the selection of training and test data sets, parallel multivariate feature selection experiments and deciding on an optimal multivariate biomarker, to building, tuning, validating, and testing predictive models implementing the optimal biomarker. Discussed are also such topics as bias-variance tradeoff, segmentation models, and committees of predictive models.
Chapter 1 focuses on terminology and basic concepts of the area, and places multivariate biomarker discovery in the context of biomarker studies and personalized medicine. For ease of reference, included are also short descriptions of some of the terms and concepts introduced and discussed in various parts of the book.
Chapter 10 covers the random forests algorithm for classification. Presented are also the impurity metrics applicable to splitting nodes in classification trees (Gini, entropy, and misclassification impurity), as well as permutation-based and impurity-based variable importance measures.