We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Decision-making in congenital cardiac care, although sometimes appearing simple, may prove challenging due to lack of data, uncertainty about outcomes, underlying heuristics, and potential biases in how we reach decisions. We report on the decision-making complexities and uncertainty in management of five commonly encountered congenital cardiac problems: indications for and timing of treatment of subaortic stenosis, closure or observation of small ventricular septal defects, management of new-onset aortic regurgitation in ventricular septal defect, management of anomalous aortic origin of a coronary artery in an asymptomatic patient, and indications for operating on a single anomalously draining pulmonary vein. The strategy underpinning each lesion and the indications for and against intervention are outlined. Areas of uncertainty are clearly delineated. Even in the presence of “simple” congenital cardiac lesions, uncertainty exists in decision-making. Awareness and acceptance of uncertainty is first required to facilitate efforts at mitigation. Strategies to circumvent uncertainty in these scenarios include greater availability of evidence-based medicine, larger datasets, standardised clinical assessment and management protocols, and potentially the incorporation of artificial intelligence into the decision-making process.
This chapter gives a brief overview of Bayesian hypothesis testing. We first describe a standard Bayesian analysis of a single binomial response, going through the prior distribution choice and explaining how the posterior is calculated. We then discuss Bayesian hypothesis testing using the Bayes factor, a measure of how much the posterior odds of believing in one hypothesis changes from the prior odds. We show, using a binomial example, how the Bayes factor may be highly dependent on the prior distribution, even with extremely large sample sizes. We next discuss Bayes hypothesis testing using decision theory, reviewing the intrinsic discrepancy of Bernardo, as well as the loss functions proposed by Freedman. Freedman’s loss functions allow the posterior belief in the null hypothesis to equal the p-value. We next discuss well-calibrated null preferences priors, which applied to parameters from the natural exponential family (binomial, negative binomial, Poisson, normal), also give the posterior belief in the null hypothesis equal to valid one-sided p-values, and give credible intervals equal to valid confidence intervals.
The chapter addresses testing when using models. We review linear models, generalized linear models, and proportional odds models, including issues such as checking model assumptions and separation (e.g., when one covariate completely predicts a binary response). We discuss the Neyman–Scott problem, that is, when bias for a fixed parameter estimate can result when the number of nuisance parameters grows with the sample size. With clustered data, we compare mixed effects models and marginal models, pointing out that for logistic regression and other models the fixed effect estimands are different in the two type of models. We present simulations showing that many models may be interpreted as a multiple testing situation, and adjustments should often be made if testing for many effects in a model. We discuss model selection using methods such as Akaike’s information criterion, the lasso, and cross-validation. We compare different model selection processes and their effect on the Type I error rate for a parameter from the final chosen model.
This chapter covers paired data, such as comparing responses before and after a treatment in one group of individuals. The sign test (also called the exact McNemar’s test when responses are binary) is compared to a median test on the differences of responses within pairs, and we show that the sign test is often more appropriate. We give confidence intervals compatible with the sign test. We discuss parameters associated with the Wilcoxon signed-rank test (often more powerful than the sign test) and assumptions needed to give associated confidence intervals. When we can assume symmetric distribution on the differences within pairs, the t-test is another option, and we discuss asymptotic relative efficiency for choosing between the t-test and Wilcoxon signed-rank test. We compare parameterizing the treatment effect as differences or ratios. We discuss tests using Pearson’s and Spearman’s correlation and Kendal’s tau, and present confidence intervals assuming normality. When the paired data represent different assays or raters, then agreement coefficients are needed (e.g., Cohen’s kappa, or Lin’s concordance correlation coefficient).
This chapter defines statistical hypothesis tests mathematically. Those tests assume two sets of probability models, called the null and alternative hypotheses. A decision rule is a function that depends on the data and a specified ?-level and determines whether or not to reject the null hypothesis. We define concepts related to properties of hypothesis tests such as Type I and II error rates, validity, size, power, invariance, and robustness. The definitions are general but are explained with examples such as testing a binomial parameter, or Wilcoxon–Mann–Whitney tests. P-values are defined as the smallest ?-level for observed data for which we would reject the null at that level and all larger levels. Confidence sets and confidence intervals are defined in relation to a series of hypothesis tests with changing null hypotheses. Compatibility between p-value functions and confidence intervals is defined, and an example with Fisher’s exact test shows that compatibility is not always present for some common tests.
This chapter deals with either clustering, where every individual within each cluster has the same treatment, or stratification, where there are individuals with different treatments within each stratum. For the studies with clustering, we compare two individual-level analysis methods (generalized estimating equations and random effects models), and a cluster-level analysis (performing a t-test on the means from each cluster). We simulate cluster analyses when the effect is or is not related to cluster size. In the stratification context, we explore Simpson’s paradox, where the direction of the within stratum effects is different from the direction of the overall effect. We show the appropriate analysis of data that are consistent with Simpson’s paradox should adjust for the strata or not depending on the study design. We discuss the stratification adjusted tests of Mantel and Haenszel, van Elteren, and quasi-likelihood binomial or Poisson models. We compare meta-analysis using fixed effects or random effects (e.g., Dersimonian–Laird method). Finally, we describe confidence intervals for directly standardized rates.
This chapter first describes group sequential methods, where interim tests of a study are done and the study may be stopped either for efficacy (if a large enough early treatment effect is seen) or for futility (if it is unlikely that a treatment effect will be significant if the study goes to completion). We compare two methods for group sequential analysis with equally spaced looks, the Pocock and the O’Brien–Fleming methods, both based on the Brownian motion model. Flexible versions of these methods are developed using alpha spending function approach, where the decision to perform an interim analysis may be based on information independent of the study up to that point. We discuss adjustments when the Brownian motion model assumption does not hold, and estimation and confidence intervals after stopping early. Next, we discuss the Bauer–KÖhne and Proschan–Hunsberger two-stage adaptive methods which bound the Type I error rate. These methods partition the study into two stages. The Stage 1 data allows three decisions (1) stop and declare significance, (2) stop for futility, and (3) continue the study with sample size for the second stage based on the first stage data.
This chapter deals with studies with k groups.When the k groups are unordered, we use k-sample tests, and when the k groups are ordered, we use trend tests.For k-sample tests with categorical responses, we describe the chi-squared test and Fisher’s exact test (also called the Freeman–Halton test). For k-sample tests with numeric responses, we cover one-way ANOVA, studentized range test, and the Kruskal–Wallis test. For the one-way ANOVA and studentized range tests, we give some associated effect parameters and their confidence intervals. For trends tests, we describe the Cochran–Armitage trend test for binary responses, and the Jonckheere–Terpstra test for numeric responses. We discuss familywise error rate when performing follow-up tests of pairwise comparisons in the k-sample studies. When k = 3, after rejecting the k-sample test, then subsequent pairwise tests may be done without correction, but otherwise (k > 3) corrections are necessary for strongly controlling the Type I error rate (e.g., for all pairwise comparisons: Tukey–Kramer, Tukey–Welch, and studentized range procedures, or for many-to-one comparisons: Dunnett’s procedure).
This chapter discusses censored time-to-event data. We review and define right-censored and interval-censored data and common assumptions associated with them, focusing on standard cases when the independent censoring assumption holds. We define the Kaplan–Meier estimator, the nonparametric maximum likelihood estimator (NPMLE) for the survival distribution for right censored data. We describe the beta product confidence procedure which gives pointwise confidence intervals for it, with better coverage than the standard Greenwood intervals. We describe the NPMLE for the survival distribution for interval censored data using the E-M algorithm. We compare the proportional hazards or proportional odds models. For both right- and interval-censored data, we describe the score tests from the proportional hazards or odds models, and show they are different forms of weighted logrank tests. We cover testing the difference in survival distributions at a specific point in time. We discuss issues with interpreting the proportional hazards model causally, showing that a model with individual proportional hazards does not equal the usual population proportional hazards model.
The chapter focuses on two-sample studies with binary responses, mostly on the case where each sample has an independent binomial response. We discuss three parameters of interest based on the functions of the two binomial parameters: the difference, ratio, or odds ratio of the two parameters. The difference and odds ratio have symmetry equivariance, but the ratio does not. The odds ratio is useful for case-control studies. We compare two versions of the two-sided Fisher’s exact test, and recommend the central one. We describe compatible confidence intervals with the Fisher’s exact test using any of the three parameters of interest. Unconditional exact tests generally have more power than conditional ones, such as Fisher’s exact test, but are computationally more complicated. We recommend a modified Boschloo unconditional exact test with associated confidence intervals to have good power. We discuss the Berger–Boos adjustment, and mid-p methods. We compare several methods with respect to confidence interval coverage. We end with a different study design used with COVID-19 vaccines, where the number of total events is fixed in advance.
In this chapter we begin with definitions of standard missing data assumptions: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Under MNAR, the probability that a response is missing may depend on the missing data values. For example, if the response is death, if individuals drop out of the study if they are very sick, and if we do not or cannot measure the variables that indicate which individuals are very sick, then that is MNAR. In the MNAR case, we consider several sensitivity analysis methods: worse case imputation, opposite arm imputation, and tipping point analysis. The tipping point analysis changes the imputed missing data systematically until the inferential results change (e.g., from significant to not significant). In the MAR case, we consider in a very simple case models such as regression imputation and inverse probability weighted estimators. We simulate two scenarios (1) when the MAR model is correctly specified, and (2) when the MAR model is misspecified. Finally, we briefly describe multiple imputation for missing data in a simple MAR scenario.
This chapter addresses multiplicity in testing, the problem that if many hypotheses are tested then unless some multiplicity adjustment is made, the probability of falsely rejecting at least one hypothesis can be unduly high. We define familywise error rate (FWER), the probability that at least one null hypothesis in the family of hypotheses is rejected. We discuss which sets of hypotheses should be grouped into families. We define the false discovery rate (FDR). We describe simple adjustments based only on p-values of the hypotheses in the family, such the Bonferroni, Holm, and Hochberg procedures for FWER control and the Benjamini–Hochberg adjustment for FDR control. We discuss max-t type inferences for controlling FWER in linear models, or other models with asymptotically normal estimators. We describe resampling-based multiplicity adjustments. We demonstrate graphical methods, showing, for example, gatekeeping and fallback methods, and allowing for more complicated methods. We briefly present logical constraints for hypotheses and the theoretically important closure method.
This chapter briefly covers many general methods for calculating p-values and confidence intervals. We discuss likelihood ratios for inferences with a one-dimensional parameter. Pivot functions are defined (e.g., the probability integral transformation). Basic results for normal and asymptotic normal inferences are given, such as some central limit theorems and the delta method. Three important likelihood-based asymptotic methods (the Wald, score, and likelihood ratio test) are defined and compared. We describe the sandwich method for estimating variance, which requires fewer assumptions than the likelihood-based methods. General permutation tests are presented, along with implementation descriptions including equivalent forms, permutational central limit theorem, and Monte Carlo methods. The nonparametric bootstrap is described, as well as some bootstrap confidence interval methods such as the BCa methods. The melding method of combining two confidence intervals is described, which gives an automatically efficient method to calculate confidence intervals for the differences or ratios of two parameters. Finally, we discuss within-cluster resampling.
The chapter covers inferences with ordinal or numeric responses, with the focus on medians or means. We discuss choosing between the mean and median for describing the central tendency. We give an exact test and associated exact central confidence interval for the median that is applicable without making assumptions on the distribution. For the mean, we show the need for making some restrictive assumptions on the class of distributions for testing the mean, otherwise tests on the mean are not possible. We discuss the one-sample t-test, and how with the normality assumption it is uniformly most powerful unbiased test. We show through some asymptotic results and simulations that with less restrictive assumptions the t-test can still be approximately valid. By simulation, we compare the t-test to some bootstrap inferential methods on the mean, suggesting that the bootstrap-t interval is slightly better for skewed data. We discuss making inferences on rate or count data after making either Poisson or overdispersed Poisson assumptions on the counts. Finally, we discuss testing the variance, standard deviation, or coefficient of variation under certain normality assumptions.
This chapter first focuses on goodness-of-fit tests. A simple case is testing for normality (e.g., the Shapiro–Wilks test). We generally recommend against this because large sample sizes can find statistically significant differences even if those differences are not important, and vice versa. We show Q-Q plots to graphically check for the largeness of departures from normality. We discuss the Kolmogorov–Smirnoff test for any difference between two distributions. We review goodness-of-fit tests for contingency tables (Pearson’s chi-squared test and Fisher’s exact test) and for logistic regression (the Hosmer–Lemeshow test). The rest of the chapter is devoted to equivalence or noninferiority tests. The margin of equivalence or noninferiority must be prespecified, and for noninferiority tests of a new drug against a standard, the margin should be larger than the difference between the placebo and the standard. We discuss the constancy assumption and biocreep. We note that while poor design (poor compliance, poor study population choice, poor measurement) generally decreases power in superiority design, these can lead to high Type I error rates in noninferiority designs.