We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The criteria for evaluating research studies often include large sample size. It is assumed that studies with large sample sizes are more meaningful than those that include a fewer number of participants. This chapter explores biases associated with the traditional application of null hypothesis testing. Statisticians now challenge the idea that retention of the null hypothesis signifies that a treatment is not effective. A finding associated with an exact probability value of p = 0.049 is not meaningfully different from one in which p = 0.051. Yet the interpretation of these two studies can be dramatically different, including the likelihood of publication. Large studies are not necessarily more accurate or less biased. In fact, biases in sampling strategy are amplified in studies with large sample sizes. These problems are of increasing concern in the era of big data and the analysis of electronic health records. Studies that are overpowered (because of very large sample sizes) are capable of identifying statistically significant differences that are of no clinical importance.
Interval estimates of the Pearson, Kendall tau-a and Spearman correlations are reviewed and an improved standard error for the Spearman correlation is proposed. The sample size required to yield a confidence interval having the desired width is examined. A two-stage approximation to the sample size requirement is shown to give accurate results.
The analysis of covariance (ANCOVA) has notably proven to be an effective tool in a broad range of scientific applications. Despite the well-documented literature about its principal uses and statistical properties, the corresponding power analysis for the general linear hypothesis tests of treatment differences remains a less discussed issue. The frequently recommended procedure is a direct application of the ANOVA formula in combination with a reduced degrees of freedom and a correlation-adjusted variance. This article aims to explicate the conceptual problems and practical limitations of the common method. An exact approach is proposed for power and sample size calculations in ANCOVA with random assignment and multinormal covariates. Both theoretical examination and numerical simulation are presented to justify the advantages of the suggested technique over the current formula. The improved solution is illustrated with an example regarding the comparative effectiveness of interventions. In order to facilitate the application of the described power and sample size calculations, accompanying computer programs are also presented.
This paper is concerned with supplementing statistical tests for the Rasch model so that additionally to the probability of the error of the first kind (Type I probability) the probability of the error of the second kind (Type II probability) can be controlled at a predetermined level by basing the test on the appropriate number of observations. An approach to determining a practically meaningful extent of model deviation is proposed, and the approximate distribution of the Wald test is derived under the extent of model deviation of interest.
This paper refers to the exponential family of probability distributions and the conditional maximum likelihood (CML) theory. It is concerned with the determination of the sample size for three groups of tests of linear hypotheses, known as the fundamental trinity of Wald, score, and likelihood ratio tests. The main practical purpose refers to the special case of tests of the class of Rasch models. The theoretical background is discussed and the formal framework for sample size calculations is provided, given a predetermined deviation from the model to be tested and the probabilities of the errors of the first and second kinds.
Chapter 7 introduces statistical power and effect size in hypothesis testing. Guidelines for interpretation of effect size, along with other sources of increasing statistical power, are provided. Point estimation and interval estimation and their relationship to population parameter estimates and the hypothesis-testing process are considered. Statistical significance is highly sensitive to large sample sizes. This means that researchers, in addition to selecting desired statistical significance p-values, need to know the magnitude of the treatment effect or the effect size of the behavior under consideration. Effect size determines sample size, and sample size is intimately related to statistical power or the likelihood of rejecting a false null hypothesis.
Chapter 1 explores the link between the research process and theory and the role of statistics in scientific discovery. Discrete and continuous variables, the building blocks of methodology, take center stage, with clear and elaborate examples and their applicability to scales of measurement and measures of central tendency. Understanding statistics allows us to become better consumers of science and make better judgments and decisions about claims and facts allegedly supported by statistical results.
The proposal of improving reproducibility by lowering the significance threshold to 0.005 has been discussed, but the impact on conducting clinical trials has yet to be examined from a study design perspective. The impact on sample size and study duration was investigated using design setups from 125 phase II studies published between 2015 and 2022. The impact was assessed using percent increase in sample size and additional years of accrual with the medians being 110.97% higher and 2.65 years longer respectively. The results indicated that this proposal causes additional financial burdens that reduce the efficiency of conducting clinical trials.
A strong participant recruitment plan is a major determinant of the success of human subjects research. The plan adopted by researchers will determine the kinds of inferences that follow from the collected data and how much it will cost to collect. Research studies with weak or non-existent recruitment plans risk recruiting too few participants or the wrong kind of participants to be able to answer the question that motivated them. This chapter outlines key considerations for researchers who are developing recruitment plans and provides suggestions for how to make recruiting more efficient.
Edited by
Cait Lamberton, Wharton School, University of Pennsylvania,Derek D. Rucker, Kellogg School, Northwestern University, Illinois,Stephen A. Spiller, Anderson School, University of California, Los Angeles
This chapter assesses how consumer research defines a “field experiment,” takes a look at trends in field experimentation in consumer research journals, explores the advantages and shortcomings of field experimentation, and assesses the status and value of open science practices for field experiments. These assessments render four insights. First, the field of consumer research does not have a consensus on the definition of field experiments, though an established taxonomy helps us determine the extent to which any given field experiment differs from traditional lab settings. Second, about 7 ercent of the published papers in one of the top consumer psychology journals include some form of field experiment – a small but growing proportion. Third, although field experimentation can be useful for providing evidence of external validity and estimating real-world effect sizes, no single lab or field study offers complete generalizable insight. Instead, each well-designed, high-powered study adds to the collection of findings that converge to advance our understanding. Finally, open science practices are useful for bridging scientific findings in field experiments with real-life applications.
Our knowledge and theories about language acquisition are skewed towards urban languages, and primarily English (Kidd & Garcia, 2022). Cristia and colleagues convincingly show that studies on the acquisition of rural languages are scarce. The authors suggest that in rural settings, combining experimental and observational approaches is critical to testing and sharpening our theories about language acquisition. Nevertheless, they also acknowledge the numerous challenges that make it difficult to conduct, analyse and publish this type of work.
Randomization solves the problem of confounding bias; it addresses systematic error, which is the most important source of error, not chance. It equalizes all potential confounding factors, known and unknown, in all groups so that they equally influence the results, and thus can be ignored. Only then can the results of randomized treatment be interpreted at face value and causal inferences made. Sample size and other factors are relevant, though, and small randomized clinical trials (RCTs) can be misleading. Examples are given.
Standard techniques for assessing plumage damage to hens from feather pecking typically require capture and handling. Handling of individual birds for plumage assessment is relatively easy in experimental studies; however, close inspection of individual birds in commercial flocks is less feasible because catching birds is difficult, may compromise bird welfare and affect egg production. The aim of this study was to assess a non-intrusive method for scoring plumage damage in a commercial free-ranging flock of laying hens. Plumage damage was scored within a 2 m distance of the birds, without capture or handling, using a 5-point scale for 5 body regions. The feather scores, recorded at a distance, by two independent scorers were compared (distance scores), and were then compared with feather scores recorded by a scorer who caught and handled the birds to examine the plumage damage closely (capture scores). There was a significant and positive correlation between the distance scores and the capture scores, and the mean correlation coefficient for all plumage score traits was 0.89. There was also a significant and positive correlation between scorers, and the mean correlation coefficient for all plumage score traits was 0.84. The standard deviation of the residual mean difference between scorers and between methods was less than 1 point for individual body regions and less than 1.5 points for the total body score. Large variation in feather damage within a flock and small sample size increased the standard error of the mean total feather score. When feather damage variation within flocks is low (ie little observed feather damage), the current industry standard of scoring a sample of 100 birds is likely to provide a reliable estimate of flock feather damage; however, when there is large variation within birds of a flock (ie considerable observed feather damage) ≥200 birds should be inspected to accurately monitor changes in plumage condition. The non-intrusive method of feather scoring described in this paper may be useful for commercial-scale feather pecking studies or for farmers who need to assess the plumage damage of their flocks reliably, quickly and with minimal disturbance or stress to the birds.
Despite the particular relevance of statistical power to animal welfare studies, we noticed an apparent lack of sufficient information reported in papers published in Animal Welfare to facilitate post hoc calculation of statistical power for use in meta-analyses. We therefore conducted a survey of all papers published in Animal Welfare in 2009 to assess compliance with relevant instructions to authors, the level of statistical detail reported and the interpretation of results regarded as statistically non-significant. In general, we found good levels of compliance with the instructions to authors except in relation to the level of detail reported for the results of each test. Although not requested in the instructions to authors, exact P-values were reported in just over half of the tests but effect size was not explicitly reported for any test, there was no reporting of a priori statistical analyses to determine sample size and there was no formal assessment of non-significant results in relation to type II errors. As a first stage to addressing this we recommend more reporting of a priori power analyses, more comprehensive reporting of the results of statistical analysis and the explicit consideration of possible statistical power issues when interpreting P-values. We also advocate the calculation of effect sizes and their confidence intervals and a greater emphasis on the interpretation of the biological significance of results rather than just their statistical significance. This will enhance the efforts that are currently being made to comply with the 3Rs, particularly the principle of reduction.
In-home pet food testing has the benefit of yielding data which is directly applicable to the pet population. Validated and standardised in-home test protocols need to be available, and here we investigated key protocol requirements for an in-home canine food digestibility protocol. Participants were recruited via an online survey. After meeting specific inclusion criteria, sixty dogs of various breeds and ages received, during 14 consecutive days, a relatively low and high digestible complete dry extruded food containing titanium (Ti) dioxide. Both foods were given for 7 d in a cross-over design. Owners collected faeces daily allowing daily faecal Ti concentrations and digestibility of nitrogen (N), dry matter (DM), crude ash, organic matter (OM), crude fat (Cfat), starch and gross energy (GE) to be determined. Faecal Ti and digestibility values for all nutrients were not different (P > 0·05) from the second day onwards after first consumption for both foods. One day of faecal collection yielded reliable digestibility values with additional collection days not reducing the confidence interval around the mean. Depending on the accepted margin of error, the food and the nutrient of interest, the minimal required sample size was between 9 and 43 dogs. Variation in digestibility values could in part be explained by a dog’s neuter status (N, crude ash) and age (crude ash, Cfat) but not sex and body size. Future studies should focus on further identifying and controlling sources of variation to improve the in-home digestibility protocol and reduce the number of dogs required.
This chapter compares standard frequentist and more recent Bayesian approaches to logistic regression analyses. Starting out from a multifactorial case study of the verb help complemented by either the bare infinitive or the to-infinitive, the key components and the main conceptual differences of frequentist and Bayesian inference are discussed. Conceptually, the Bayesian rationale of directly testing hypotheses on the effects of multiple factors on an outcome variable is argued to be preferable and more sensitive than the conventional approach of testing null hypotheses. On the practical side, Bayesian statistics enables the researcher to recycle and integrate the results of previous analyses based on different datasets as informative priors, which can help improve and stabilize statistical modelling. Recourse to prior research can thus produce synergies and reduce data preparation expense. In cases of data sparsity, it can by the same token enable researchers to analyse small samples. Bayesian methods are thus put forward as powerful tools for overcoming the limitations of isolated corpus studies and for promoting synergies between data collected by individual researchers.
We propose that the representativeness of a corpus directly depends on its suitability for a specific research goal (including the domain and the linguistic feature(s) of interest). Creating a new corpus involves establishing linguistic research question(s), addressing domain considerations, including describing the domain, operationalizing the domain, evaluating the operational domain (relative to the full domain), designing the corpus, and evaluating the corpus (relative to the operational domain), addressing distribution considerations, including defining a linguistic variable and evaluating the required sample size, collecting the corpus, and documenting and reporting corpus design and representativeness. The steps for evaluating an existing corpus are similar: establishing linguistic research question(s), identifying and acquire the corpus and its documentation, addressing domain considerations, including describing the domain and evaluating the operational domain relative to the full domain, and the corpus relative to the operational domain, addressing distribution considerations, including defining a linguistic variable and evaluating the required sample size, and documenting and reporting corpus design and representativeness. We conclude the book by arguing that corpus representativeness is important for both corpus designers/builders, and corpus researchers who need to evaluate whether a corpus is appropriate for their research goals.
Statistical issues are prominent in Alzheimer’s disease (AD) clinical trials due to the enormous challenges in this disease. The complexity of the disease and intervention pathways challenge drug discovery efforts, but measurement and analysis complexities and subjective outcomes also interfere with successful drug development. Variability across disease stage, disease sub-types, comorbidities and concomitant treatments (between-subject), and non-equivalent forms, good and bad days, and rater inconsistencies (within-subject) increase the chance of failure. AD-specific statistical expertise is critical for success, in contrast to most disease areas that require less disease-specific statistical knowledge. Use of global statistical tests and composites, correcting for covariates, and model selection increase the chance of a clearly positive outcome for active treatments and a clearly negative outcome for inactive or harmful treatments.
Medieval manuscripts are invaluable archives of the written history of our past. Manuscripts can be dated and localized paleographically, but this method has its limitations. The Fragmenta membranea manuscript collection at the National Library of Finland has proved difficult to date using paleographic methods. Radiocarbon dating has been applied to manuscripts of parchment before, but a systematic protocol for radiocarbon dating of parchment has not been established with a minimally destructive sampling strategy. In this work, we have established a radiocarbon dating procedure for parchments combining a clean-room based chemical pretreatment process, elemental analyzer combustion, automatic graphitization and accelerator mass spectrometry (AMS) measurements to reduce the AMS target size from a typical 1 mg of carbon. Prolonged acid treatment resulted in improved dating accuracy, since this is consistent with the manufacturing process of medieval parchment involving a lime bath. Two different combustion processes were compared. The traditional closed tube combustion (CTC) method provided a well-established though labor-intensive way to produce 1 mg AMS targets. The Elemental Analyzer-based process (EA-HASE, Elemental Analyzer Helsinki Adaptive Sample prEparation line), is designed for fast combustion and smaller sample sizes. The EA-HASE process was capable of reproducing the simulated radiocarbon ages of known-age samples with AMS graphite target sizes of 0.3 mg of carbon, corresponding to a 3 mm2 area of a typical medieval parchment. The full potential of the process to go down to as little as 50 μg will be further explored in the future in parallel to studies of sample-specific contamination issues.