To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Chapter 4 provides a detailed coverage of methods for the evaluation of predictive models: the methods applicable to regression models implementing estimation biomarkers, as well as methods evaluating binary and multiclass classification models. Discussion of resampling techniques is accompanied by accentuating the danger of information leakage and by emphasizing the paramount importance of avoiding internal validation. Discussion of metrics for the evaluation of classification biomarkers includes the issue of proper and improper interpretation of sensitivity and specificity, illustrated by an example of a screening biomarker targeting a population with low prevalence of the tested disease. For such biomarkers, positive predictive value may be unacceptably low even when the biomarker has a very high specificity and sensitivity. Discussed in this chapter are also misclassification costs and incorporating them into cost-sensitive classification.
Given a sequence of independent random vectors taking values in ${\mathbb R}^d$ and having common continuous distribution function F, say that the $n^{\rm \scriptsize}$th observation sets a (Pareto) record if it is not dominated (in every coordinate) by any preceding observation. Let $p_n(F) \equiv p_{n, d}(F)$ denote the probability that the $n^{\rm \scriptsize}$th observation sets a record. There are many interesting questions to address concerning pn and multivariate records more generally, but this short paper focuses on how pn varies with F, particularly if, under F, the coordinates exhibit negative dependence or positive dependence (rather than independence, a more-studied case). We introduce new notions of negative and positive dependence ideally suited for such a study, called negative record-setting probability dependence (NRPD) and positive record-setting probability dependence (PRPD), relate these notions to existing notions of dependence, and for fixed $d \geq 2$ and $n \geq 1$ prove that the image of the mapping pn on the domain of NRPD (respectively, PRPD) distributions is $[p^*_n, 1]$ (resp., $[n^{-1}, p^*_n]$), where $p^*_n$ is the record-setting probability for any continuous F governing independent coordinates.
Aging ships and offshore structures face harsh environmental and operational conditions in remote areas, leading to age-related damages such as corrosion wastage, fatigue cracking, and mechanical denting. These deteriorations, if left unattended, can escalate into catastrophic failures, causing casualties, property damage, and marine pollution. Hence, ensuring the safety and integrity of aging ships and offshore structures is paramount and achievable through innovative healthcare schemes. One such paradigm, digital healthcare engineering (DHE), initially introduced by the final coauthor, aims at providing lifetime healthcare for engineered structures, infrastructure, and individuals (e.g., seafarers) by harnessing advancements in digitalization and communication technologies. The DHE framework comprises five interconnected modules: on-site health parameter monitoring, data transmission to analytics centers, data analytics, simulation and visualization via digital twins, artificial intelligence-driven diagnosis and remedial planning using machine and deep learning, and predictive health condition analysis for future maintenance. This article surveys recent technological advancements pertinent to each DHE module, with a focus on its application to aging ships and offshore structures. The primary objectives include identifying cost-effective and accurate techniques to establish a DHE system for lifetime healthcare of aging ships and offshore structures—a project currently in progress by the authors.
It is proved that for families of stochastic operators on a countable tensor product, depending smoothly on parameters, any spectral projection persists smoothly, where smoothness is defined using norms based on ideas of Dobrushin. A rigorous perturbation theory for families of stochastic operators with spectral gap is thereby created. It is illustrated by deriving an effective slow two-state dynamics for a three-state probabilistic cellular automaton.
This retrospective study compared central line-associated bloodstream infection (CLABSI) rates per 1 000 central line days, and overall mortality before and during the COVID-19 pandemic in adult, paediatric, and neonatal ICU patients at King Abdul-Aziz Medical City-Riyadh who had a central line and were diagnosed with CLABSI according to the National Healthcare Safety Network standard definition. The study spanned between January 2018 and December 2019 (pre-pandemic), and January 2020 and December 2021 (pandemic). SARS-CoV-2 was confirmed by positive RT-PCR testing. The study included 156 CLABSI events and 46 406 central line days; 52 and 22 447 (respectively) in pre-pandemic, and 104 and 23 959 (respectively) during the pandemic. CLABSI rates increased by 2.02 per 1 000 central line days during the pandemic period (from 2.32 to 4.34, p < 0.001). Likewise, overall mortality rates increased by 0.86 per 1 000 patient days (from 0.93 to 1.79, p = 0.003). Both CLABSI rates (6.18 vs. 3.7, p = 0.006) and overall mortality (2.72 vs. 1.47, p = 0.014) were higher among COVID-19 patients compared to non-COVID-19 patients. The pandemic was associated with a substantial increase in CLABSI-associated morbidity and mortality.
We develop and demonstrate a computationally cheap framework to identify optimal experiments for Bayesian inference of physics-based models. We develop the metrics (i) to identify optimal experiments to infer the unknown parameters of a physics-based model, (ii) to identify optimal sensor placements for parameter inference, and (iii) to identify optimal experiments to perform Bayesian model selection. We demonstrate the framework on thermoacoustic instability, which is an industrially relevant problem in aerospace propulsion, where experiments can be prohibitively expensive. By using an existing densely sampled dataset, we identify the most informative experiments and use them to train the physics-based model. The remaining data are used for validation. We show that, although approximate, the proposed framework can significantly reduce the number of experiments required to perform the three inference tasks we have studied. For example, we show that for task (i), we can achieve an acceptable model fit using just 2.5% of the data that were originally collected.
Econometricians have usefully separated study of estimation into identification and statistical components. Identification analysis, which assumes knowledge of the probability distribution generating observable data, places an upper bound on what may be learned about population parameters of interest with finite-sample data. Yet Wald’s statistical decision theory studies decision-making with sample data without reference to identification, indeed without reference to estimation. This paper asks if identification analysis is useful to statistical decision theory. The answer is positive, as it can yield an informative and tractable upper bound on the achievable finite-sample performance of decision criteria. The reasoning is simple when the decision-relevant parameter (true state of nature) is point-identified. It is more delicate when the true state is partially identified and a decision must be made under ambiguity. Then the performance of some criteria, such as minimax regret, is enhanced by randomizing choice of an action in a controlled manner. I find it useful to recast choice of a statistical decision function as selection of choice probabilities for the elements of the choice set.
This paper extends the spurious factor analysis of Onatski and Wang (2021, Spurious factor analysis. Econometrica, 89(2), 591–614.) to high-dimensional data with heterogeneous local-to-unit roots. We find a spurious factor phenomenon similar to that observed in the data with unit roots. Namely, the “factors” estimated by the principal components analysis converge to principal eigenfunctions of a weighted average of the covariance kernels of the demeaned Ornstein–Uhlenbeck processes with different decay rates. Thus, such “factors” reflect the structure of the strong temporal correlation of the data and do not correspond to any cross-sectional commonalities, that genuine factors are usually associated with. Furthermore, the principal eigenvalues of the sample covariance matrix are very large relative to the other eigenvalues, creating an illusion of the “factors”capturing much of the data’s common variation. We conjecture that the spurious factor phenomenon holds, more generally, for data obtained from high frequency sampling of heterogeneous continuous time (or spacial) processes, and provide an illustration.
Risk measurement and econometrics are the two pillars of actuarial science. Unlike econometrics, risk measurement allows taking into account decision-makers’ risk aversion when analyzing the risks. We propose a hybrid model that captures decision-makers’ regression-based approach to study risks, focusing on explanatory variables while paying attention to risk severity. Our model considers different loss functions that quantify the severity of the losses that are provided by the risk manager or the actuary. We present an explicit formula for the regression estimators for the proposed risk-based regression problem and study the proposed results. Finally, we provide a numerical study of the results using data from the insurance industry.
In this paper, we generalize the concept of functional dependence (FD) from time series (see Wu [2005, Proceedings of the National Academy of Sciences 102, 14150–14154]) and stationary random fields (see El Machkouri, Volný, and Wu [2013, Stochastic Processes and Their Applications 123, 1–14]) to nonstationary spatial processes. Within conventional settings in spatial econometrics, we define the concept of spatial FD measure and establish a moment inequality, an exponential inequality, a Nagaev-type inequality, a law of large numbers, and a central limit theorem. We show that the dependent variables generated by some common spatial econometric models, including spatial autoregressive (SAR) models, threshold SAR models, and spatial panel data models, are functionally dependent under regular conditions. Furthermore, we investigate the properties of FD measures under various transformations, which are useful in applications. Moreover, we compare spatial FD with the spatial mixing and spatial near-epoch dependence proposed in Jenish and Prucha ([2009, Journal of Econometrics 150, 86–98], [2012, Journal of Econometrics 170, 178–190]), and we illustrate its advantages.
Multivariate biomarker discovery is increasingly important in the realm of biomedical research, and is poised to become a crucial facet of personalized medicine. This will prompt the demand for a myriad of novel biomarkers representing distinct 'omic' biosignatures, allowing selection and tailoring treatments to the various individual characteristics of a particular patient. This concise and self-contained book covers all aspects of predictive modeling for biomarker discovery based on high-dimensional data, as well as modern data science methods for identification of parsimonious and robust multivariate biomarkers for medical diagnosis, prognosis, and personalized medicine. It provides a detailed description of state-of-the-art methods for parallel multivariate feature selection and supervised learning algorithms for regression and classification, as well as methods for proper validation of multivariate biomarkers and predictive models implementing them. This is an invaluable resource for scientists and students interested in bioinformatics, data science, and related areas.
Edited by
R. A. Bailey, University of St Andrews, Scotland,Peter J. Cameron, University of St Andrews, Scotland,Yaokun Wu, Shanghai Jiao Tong University, China
We begin by illustrating the interplay between questions of scientific interest and the use of data in seeking answers. Graphs provide a window through which meaning can often be extracted from data. Numeric summary statistics and probability distributions provide a form of quantitative scaffolding for models of random as well as nonrandom variation. Simple regression models foreshadow the issues that arise in the more complex models considered later in the book. Frequentist and Bayesian approaches to statistical inference are contrasted, the latter primarily using the Bayes Factor to complement the limited perspective that p-values offer. Akaike Information Criterion (AIC) and related "information" statistics provide a further perspective. Resampling methods, where the one available dataset is used to provide an empirical substitute for a theoretical distribution, are introduced. Remaining topics are of a more general nature. RStudio is one of several tools that can help in organizing and managing work. The checks provided by independent replication at another time and place are an indispensable complement to statistical analysis. Questions of data quality, of relevance to the questions asked, of the processes that generated the data, and of generalization, remain just as important for machine learning and other new analysis approaches as for more classical methods.
Multiple linear regression generalizes straight line regression to allow multiple explanatory (or predictor) variables, in this chapter under the normal errors assumption. The focus may be on accurate prediction. Or it may, alternatively or additionally, be on the regression coefficients themselves. Simplistic interpretations of coefficients can be grossly misleading. Later chapters elaborate on the ideas and methods developed in this chapter, applying them in new contexts. The attaching of causal interpretations to model coefficients must be justified both by reference to subject area knowledge and by careful checks to ensure that they are not artefacts of the correlation structure. There is attention to regression diagnostics, to assessment, and comparison of models. Variable selection strategies can readily over-fit. Hence the importance of training/test approaches and cross-validation. The potential is demonstrated for errors in x to seriously bias regression coefficients. Strong multicollinearity leads to large variance inflation factors.
This chapter moves from regression to methods that focus on the pattern presented by multiple variables, albeit with applications in regression analysis. A strong focus is to find patterns that beg further investigation, and/or replace many variables by a much smaller number that capture important structure in the data. Methodologies discussed include principal components analysis and multidimensional scaling more generally, cluster analysis (the exploratory process that groups “alike” observations) and dendogram construction, and discriminant analysis. Two sections discuss issues for the analysis of data, such as from high throughput genomics, where the aim is to determine, from perhaps thousands or tens of thousands of variables, which are shifted in value between groups in the data. A treatment of the role of balance and matching in making inferences from observational data then follows. The chapter ends with a brief introduction to methods for multiple imputation, which aims to use multivariate relationships to fill in missing values in observations that are incomplete, allowing them to have at least some role in a regression or other further analysis.