To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Chapters 14 and 15 describe a method for the identification of parsimonious and robust multivariate biomarkers that may also have the best chance for plausible biological interpretation. The method is based on multistage signal enhancement and identification of essential patterns. Chapter 14 covers the first logical part of this method – the multistage signal enhancement approach leading to the identification of a pool of potentially important variables.
Chapter 6 starts with the description of multiple regression. Even if it is unlikely for multiple regression to be used as the primary method for multivariate biomarker discovery based on high-dimensional data, presenting this classical method provides the necessary background for regression analysis and highlights the weaknesses of multiple regression, which will be addressed by the subsequently presented methods. This chapter also presents partial least squares regression (PLSR), which by performing supervised dimensionality reduction addresses some weaknesses of multiple regression; however, by not performing any feature selection, PLSR does not reduce noise that is typically abundant in high-dimensional data.
Chapter 13 discusses neural networks and deep learning; included is a presentation of deep convolutional networks that seem to have a great potential in the classification of medical images.
Chapter 5 is dedicated to the most important part of predictive modeling for biomarker discovery based on high-dimensional data – multivariate feature selection. When dealing with sparse biomedical data whose dimensionality is much higher than the number of training observations, the crucial issue is to overcome the curse of dimensionality by using methods capable of elevating signal (predictive information) from the overwhelming noise. One way of doing this is to perform many (hundreds or thousands) parallel feature selection experiments based on different random subsamples of the original training data and then aggregating their results (for example, by analyzing the distribution of variables among the results of those parallel experiments). Two designs of such parallel feature selection experiments are discussed in detail: one based on recursive feature elimination, and the other on implementing the stepwise hybrid selection with T2. The chapter includes also descriptions of three evolutionary feature selection algorithms: simulated annealing, genetic algorithms, and particle swarm optimization.
Chapter 11 presents classification with support vector machines – details of the algorithms for linear and nonlinear SVMs. Discussed are also kernel functions, hyperparameters, variable importance measures, and cost-sensitive SVMs.
Chapter 3 provides an overview of all elements of the predictive modeling process, from the selection of training and test data sets, parallel multivariate feature selection experiments and deciding on an optimal multivariate biomarker, to building, tuning, validating, and testing predictive models implementing the optimal biomarker. Discussed are also such topics as bias-variance tradeoff, segmentation models, and committees of predictive models.
Chapter 1 focuses on terminology and basic concepts of the area, and places multivariate biomarker discovery in the context of biomarker studies and personalized medicine. For ease of reference, included are also short descriptions of some of the terms and concepts introduced and discussed in various parts of the book.
Chapter 10 covers the random forests algorithm for classification. Presented are also the impurity metrics applicable to splitting nodes in classification trees (Gini, entropy, and misclassification impurity), as well as permutation-based and impurity-based variable importance measures.
Chapter 12 presents discriminant analysis – a classical (and powerful) supervised learning approach for classification. Discussed are Fisher’s discriminant analysis, as well as Gaussian linear, quadratic, and regularized discriminant analysis. The chapter concludes with a discussion of partial least squares discriminant analysis, which is still popular in some application areas, even if its application to high-dimensional data is likely to result in solutions that are suboptimal in terms of predictive abilities and interpretability (alternative approaches are recommended).
Chapter 17 describes the second real-life study, whose goal is the identification of multivariate biomarkers for liver cancer. This study implements parallel recursive feature elimination experiments coupled with random forests and support vector machines. Included are also considerations for rebalancing class proportions. Three multivariate biomarkers for liver cancer have been identified. The study has been performed in an R environment, and R scripts for all of its steps are provided.
Chapters 16 presents the first of the two real-life multivariate biomarker discovery studies included in the book. The goal of this study – which implements the method presented in Chapters 14 and 15 – is to identify the essential gene expression patterns and a multivariate biomarker common for multiple types of cancer. This study is based on the TCGA RNA-Seq data of 3,528 patients and 20,530 gene expression variables; the data represent five tumor types of five different tissues. A parsimonious multivariate biomarker (consisting of ten genes) with high sensitivity and specificity has been identified.
Chapter 9 presents support vector regression (SVR), a relatively newer supervised learning algorithm for predictive regression modeling, which – like random forests for regression – also may outperform the least-squares-based methods. Discussed is ε-insensitive loss used by SVR, the ε-tube concept, as well as algorithms for linear and nonlinear SVRs.
Chapter 4 provides a detailed coverage of methods for the evaluation of predictive models: the methods applicable to regression models implementing estimation biomarkers, as well as methods evaluating binary and multiclass classification models. Discussion of resampling techniques is accompanied by accentuating the danger of information leakage and by emphasizing the paramount importance of avoiding internal validation. Discussion of metrics for the evaluation of classification biomarkers includes the issue of proper and improper interpretation of sensitivity and specificity, illustrated by an example of a screening biomarker targeting a population with low prevalence of the tested disease. For such biomarkers, positive predictive value may be unacceptably low even when the biomarker has a very high specificity and sensitivity. Discussed in this chapter are also misclassification costs and incorporating them into cost-sensitive classification.