To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter covers paired data, such as comparing responses before and after a treatment in one group of individuals. The sign test (also called the exact McNemar’s test when responses are binary) is compared to a median test on the differences of responses within pairs, and we show that the sign test is often more appropriate. We give confidence intervals compatible with the sign test. We discuss parameters associated with the Wilcoxon signed-rank test (often more powerful than the sign test) and assumptions needed to give associated confidence intervals. When we can assume symmetric distribution on the differences within pairs, the t-test is another option, and we discuss asymptotic relative efficiency for choosing between the t-test and Wilcoxon signed-rank test. We compare parameterizing the treatment effect as differences or ratios. We discuss tests using Pearson’s and Spearman’s correlation and Kendal’s tau, and present confidence intervals assuming normality. When the paired data represent different assays or raters, then agreement coefficients are needed (e.g., Cohen’s kappa, or Lin’s concordance correlation coefficient).
This chapter defines statistical hypothesis tests mathematically. Those tests assume two sets of probability models, called the null and alternative hypotheses. A decision rule is a function that depends on the data and a specified ?-level and determines whether or not to reject the null hypothesis. We define concepts related to properties of hypothesis tests such as Type I and II error rates, validity, size, power, invariance, and robustness. The definitions are general but are explained with examples such as testing a binomial parameter, or Wilcoxon–Mann–Whitney tests. P-values are defined as the smallest ?-level for observed data for which we would reject the null at that level and all larger levels. Confidence sets and confidence intervals are defined in relation to a series of hypothesis tests with changing null hypotheses. Compatibility between p-value functions and confidence intervals is defined, and an example with Fisher’s exact test shows that compatibility is not always present for some common tests.
This chapter deals with either clustering, where every individual within each cluster has the same treatment, or stratification, where there are individuals with different treatments within each stratum. For the studies with clustering, we compare two individual-level analysis methods (generalized estimating equations and random effects models), and a cluster-level analysis (performing a t-test on the means from each cluster). We simulate cluster analyses when the effect is or is not related to cluster size. In the stratification context, we explore Simpson’s paradox, where the direction of the within stratum effects is different from the direction of the overall effect. We show the appropriate analysis of data that are consistent with Simpson’s paradox should adjust for the strata or not depending on the study design. We discuss the stratification adjusted tests of Mantel and Haenszel, van Elteren, and quasi-likelihood binomial or Poisson models. We compare meta-analysis using fixed effects or random effects (e.g., Dersimonian–Laird method). Finally, we describe confidence intervals for directly standardized rates.
This chapter first describes group sequential methods, where interim tests of a study are done and the study may be stopped either for efficacy (if a large enough early treatment effect is seen) or for futility (if it is unlikely that a treatment effect will be significant if the study goes to completion). We compare two methods for group sequential analysis with equally spaced looks, the Pocock and the O’Brien–Fleming methods, both based on the Brownian motion model. Flexible versions of these methods are developed using alpha spending function approach, where the decision to perform an interim analysis may be based on information independent of the study up to that point. We discuss adjustments when the Brownian motion model assumption does not hold, and estimation and confidence intervals after stopping early. Next, we discuss the Bauer–KÖhne and Proschan–Hunsberger two-stage adaptive methods which bound the Type I error rate. These methods partition the study into two stages. The Stage 1 data allows three decisions (1) stop and declare significance, (2) stop for futility, and (3) continue the study with sample size for the second stage based on the first stage data.
This chapter deals with studies with k groups.When the k groups are unordered, we use k-sample tests, and when the k groups are ordered, we use trend tests.For k-sample tests with categorical responses, we describe the chi-squared test and Fisher’s exact test (also called the Freeman–Halton test). For k-sample tests with numeric responses, we cover one-way ANOVA, studentized range test, and the Kruskal–Wallis test. For the one-way ANOVA and studentized range tests, we give some associated effect parameters and their confidence intervals. For trends tests, we describe the Cochran–Armitage trend test for binary responses, and the Jonckheere–Terpstra test for numeric responses. We discuss familywise error rate when performing follow-up tests of pairwise comparisons in the k-sample studies. When k = 3, after rejecting the k-sample test, then subsequent pairwise tests may be done without correction, but otherwise (k > 3) corrections are necessary for strongly controlling the Type I error rate (e.g., for all pairwise comparisons: Tukey–Kramer, Tukey–Welch, and studentized range procedures, or for many-to-one comparisons: Dunnett’s procedure).
This chapter discusses censored time-to-event data. We review and define right-censored and interval-censored data and common assumptions associated with them, focusing on standard cases when the independent censoring assumption holds. We define the Kaplan–Meier estimator, the nonparametric maximum likelihood estimator (NPMLE) for the survival distribution for right censored data. We describe the beta product confidence procedure which gives pointwise confidence intervals for it, with better coverage than the standard Greenwood intervals. We describe the NPMLE for the survival distribution for interval censored data using the E-M algorithm. We compare the proportional hazards or proportional odds models. For both right- and interval-censored data, we describe the score tests from the proportional hazards or odds models, and show they are different forms of weighted logrank tests. We cover testing the difference in survival distributions at a specific point in time. We discuss issues with interpreting the proportional hazards model causally, showing that a model with individual proportional hazards does not equal the usual population proportional hazards model.
The chapter focuses on two-sample studies with binary responses, mostly on the case where each sample has an independent binomial response. We discuss three parameters of interest based on the functions of the two binomial parameters: the difference, ratio, or odds ratio of the two parameters. The difference and odds ratio have symmetry equivariance, but the ratio does not. The odds ratio is useful for case-control studies. We compare two versions of the two-sided Fisher’s exact test, and recommend the central one. We describe compatible confidence intervals with the Fisher’s exact test using any of the three parameters of interest. Unconditional exact tests generally have more power than conditional ones, such as Fisher’s exact test, but are computationally more complicated. We recommend a modified Boschloo unconditional exact test with associated confidence intervals to have good power. We discuss the Berger–Boos adjustment, and mid-p methods. We compare several methods with respect to confidence interval coverage. We end with a different study design used with COVID-19 vaccines, where the number of total events is fixed in advance.
In this chapter we begin with definitions of standard missing data assumptions: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Under MNAR, the probability that a response is missing may depend on the missing data values. For example, if the response is death, if individuals drop out of the study if they are very sick, and if we do not or cannot measure the variables that indicate which individuals are very sick, then that is MNAR. In the MNAR case, we consider several sensitivity analysis methods: worse case imputation, opposite arm imputation, and tipping point analysis. The tipping point analysis changes the imputed missing data systematically until the inferential results change (e.g., from significant to not significant). In the MAR case, we consider in a very simple case models such as regression imputation and inverse probability weighted estimators. We simulate two scenarios (1) when the MAR model is correctly specified, and (2) when the MAR model is misspecified. Finally, we briefly describe multiple imputation for missing data in a simple MAR scenario.
This chapter addresses multiplicity in testing, the problem that if many hypotheses are tested then unless some multiplicity adjustment is made, the probability of falsely rejecting at least one hypothesis can be unduly high. We define familywise error rate (FWER), the probability that at least one null hypothesis in the family of hypotheses is rejected. We discuss which sets of hypotheses should be grouped into families. We define the false discovery rate (FDR). We describe simple adjustments based only on p-values of the hypotheses in the family, such the Bonferroni, Holm, and Hochberg procedures for FWER control and the Benjamini–Hochberg adjustment for FDR control. We discuss max-t type inferences for controlling FWER in linear models, or other models with asymptotically normal estimators. We describe resampling-based multiplicity adjustments. We demonstrate graphical methods, showing, for example, gatekeeping and fallback methods, and allowing for more complicated methods. We briefly present logical constraints for hypotheses and the theoretically important closure method.
This chapter briefly covers many general methods for calculating p-values and confidence intervals. We discuss likelihood ratios for inferences with a one-dimensional parameter. Pivot functions are defined (e.g., the probability integral transformation). Basic results for normal and asymptotic normal inferences are given, such as some central limit theorems and the delta method. Three important likelihood-based asymptotic methods (the Wald, score, and likelihood ratio test) are defined and compared. We describe the sandwich method for estimating variance, which requires fewer assumptions than the likelihood-based methods. General permutation tests are presented, along with implementation descriptions including equivalent forms, permutational central limit theorem, and Monte Carlo methods. The nonparametric bootstrap is described, as well as some bootstrap confidence interval methods such as the BCa methods. The melding method of combining two confidence intervals is described, which gives an automatically efficient method to calculate confidence intervals for the differences or ratios of two parameters. Finally, we discuss within-cluster resampling.
The chapter covers inferences with ordinal or numeric responses, with the focus on medians or means. We discuss choosing between the mean and median for describing the central tendency. We give an exact test and associated exact central confidence interval for the median that is applicable without making assumptions on the distribution. For the mean, we show the need for making some restrictive assumptions on the class of distributions for testing the mean, otherwise tests on the mean are not possible. We discuss the one-sample t-test, and how with the normality assumption it is uniformly most powerful unbiased test. We show through some asymptotic results and simulations that with less restrictive assumptions the t-test can still be approximately valid. By simulation, we compare the t-test to some bootstrap inferential methods on the mean, suggesting that the bootstrap-t interval is slightly better for skewed data. We discuss making inferences on rate or count data after making either Poisson or overdispersed Poisson assumptions on the counts. Finally, we discuss testing the variance, standard deviation, or coefficient of variation under certain normality assumptions.
This chapter first focuses on goodness-of-fit tests. A simple case is testing for normality (e.g., the Shapiro–Wilks test). We generally recommend against this because large sample sizes can find statistically significant differences even if those differences are not important, and vice versa. We show Q-Q plots to graphically check for the largeness of departures from normality. We discuss the Kolmogorov–Smirnoff test for any difference between two distributions. We review goodness-of-fit tests for contingency tables (Pearson’s chi-squared test and Fisher’s exact test) and for logistic regression (the Hosmer–Lemeshow test). The rest of the chapter is devoted to equivalence or noninferiority tests. The margin of equivalence or noninferiority must be prespecified, and for noninferiority tests of a new drug against a standard, the margin should be larger than the difference between the placebo and the standard. We discuss the constancy assumption and biocreep. We note that while poor design (poor compliance, poor study population choice, poor measurement) generally decreases power in superiority design, these can lead to high Type I error rates in noninferiority designs.
This chapter handles power and sample size estimation used for study design. A very flexible method to estimate power is to simulate. Binomial confidence intervals can be used on the simulated power estimates to determine the number of simulations needed for the precision desired. To determine sample size by simulation, we introduce an algorithm based on methods for dose finding studies, that does not need a large number of replications at each different sample size tried. We present a general normal theory approximation, to give approximate sample sizes for designs that use simple tests such as the two-sample t-test, and the two-sample difference in binomial proportions test. We generalize to cases with unequal allocation between arms, or more complicated tests such as the logrank test. We discuss modifications to sample sizes for nonadherence or missing data.
The chapter begins with a cautionary story of a study on hydroxychloroquine treatment for COVID-19. We detail weaknesses in the study that led to misinterpretation of its results, emphasizing that there is more to proper application of hypothesis tests than calculating a p-value. We then discuss reproducibility in science and the p-value controversy. Some argue that since p-values are often misunderstood and misused that they should be replaced with other statistics. We counter that point of view, arguing that frequentist hypothesis tests, when properly applied, are well suited to address reproducibility issues. This motivates the book which provides guiding principles and tools for designing studies and properly applying hypothesis testing for many different scientific applications. We agree with many of the concerns about overreliance on p-values, hence the approach of the book is to present not just methods for hypothesis tests, but also methods for compatible confidence intervals on parameters that can accompany them. The chapter ends with an overview of the book, describing its level and the intended audience.
This chapter gives a brief, mostly nonmathematical, review of statistical hypothesis tests as used for making scientific inferences, with a focus on determining causality. We compare causation and association, reviewing the idea (popularized by Pearl) that probability models alone cannot determine causation, but other aspects of the study design are important for determining causality such as randomization. We differentiate between observational and experimental studies, with the latter much more suited to making causal inferences. We review the example of hormone replacement therapy and its relationship to heart disease in women, where earlier observational studies suggested that therapy reduced heart disease, while later, a large randomized trial showed the opposite. We discuss general issues with designing studies such as having a clear focus of the primary study question to avoid unintentionally exploring multiple hypotheses. We define and review validity, reliability, selection biases, placebo and Hawthorne effects, regression to the mean, blinding, dependence, intention-to-treat analyses, matching and inverse propensity score weighting.