Your P -values are significant (or not), so what … now what?

Statistical significance, or lack thereof, is often erroneously interpreted as a measure of the magnitude of effects, correlations between variables or practical relevance of research results. However, calculated P -values do not provide any information of this sort. Alternatively, effect sizes as measured by effect size indices provide complementary information to results of statistical hypothesis testing that is crucial and necessary to fully interpret data and then draw meaningful conclusions. Effect size indices have been used extensively for decades in the medical, psychological and social sciences but have received scant attention in the plant sciences. This Technical Update focuses on (1) raising awareness of these important statistical tools for seed science research, (2) providing additional resources useful for incorporating effect sizes into research programmes and (3) encouraging further applications of these tools in our discipline.


Introduction
Consider the hypothetical information presented in Figure 1.Data like this are often followed by enthusiastic statements that observed responses were 'very highly significantly different' (Fig. 1A) or dispirited assertions that differences were 'not statically significant' (Fig. 1B) when assessed against a theoretical level of statistical significance such as 0.05.Researchers then interpret these findings as evidence for large (Fig. 1A) or no (Fig. 1B) effects of the independent variables on response variables.Perhaps you have witnessed such examples at recent meetings or in publications.However, do comparisons of P-values against cut-off values (e.g., α = 0.05) grant researchers the ability to make claims about the magnitude of differences, the strength of association between variables, or the practical relevance of a study?No! They do not (Nickerson, 2000;Ellis, 2010;Aarts et al., 2014;Nuzzo, 2014;Greenland et al., 2016;Wasserstein and Lazar, 2016;Wasserstein et al., 2019).
Unfortunately, the problem of P-value misinterpretation is widespread and chronic.Authors link this problem to: (1) pervasive misunderstandings regarding the fundamentals of statistical hypothesis testing; (2) conflating the original intent of P-values as a test of evidence against a null with the later application of P-values in evidence-based decision-making frameworks and (3) shortcomings of statistical training programmes (Nickerson, 2000;Nuzzo, 2014;Greenland et al., 2016;Pernet, 2016;Wasserstein and Lazar, 2016).Regardless, consensus exists that the use of results from statistical hypothesis testing alone, especially when viewed through the lens of significance or non-significance, distorts conclusions (Nuzzo, 2014;Greenland et al., 2016;Wasserstein and Lazar, 2016;Kimmel et al., 2023).As Wasserstein and Lazar (2016) stated: 'Statistical significance is not equivalent to scientific, human, or economic significance ' [p. 132].In this brief paper, I plan to raise the awareness of effect size measurements as necessary statistical tools; provide resources for further consideration and encourage more widespread use of effect sizes in the seed science literature.
A reminder of the information P-values provide Fundamentally, a P-value represents the probability of observing a summary statistic (e.g., a mean difference between two groups) that is equal to or more extreme than the sample statistic given a specific statistical model (Wasserstein and Lazar, 2016).In more tangible terms, a P-value represents a measure of compatibility between observed data and the expected data if all assumptions of a test model (e.g., the null hypothesis) were correct.The smaller the P-value the more likely that observed data are unusual compared to the test model.Alternatively, the larger the P-value the more likely that observed data are not unusual compared to the test model.That's it!This is all the information a P-value provides the researcher nothing else (Nuzzo, 2014;Greenland et al., 2016;Wasserstein and Lazar, 2016).
Crucially, notice how P-values provide no information regarding the magnitude of differences or the level of association between variables.Nuzzo ( 2014

The power of effect size indexes
It is important to note that sample size affects P-values.For instance, P-values typically decrease as sample size increases due to the impact of random error reduction.Moreover, variability decreases and measurements become more precise in large samples.Such improvements facilitate the detection of smaller differences (Cohen, 1988;Ellis, 2010;Greenland et al., 2016;Wasserstein and Lazar, 2016).This means that trivial differences or associations may be deemed statistically significant (e.g., Fig. 1A) if the sample size is large enough or measurements are highly precise.The reverse is also true.Non-trivial differences may show up as not statistically significant in studies with small sample sizes or imprecise measurements (Cohen, 1988;Ellis, 2010;Greenland et al., 2016;Wasserstein and Lazar, 2016).Alternatively, effect size indices are independent of sample size (Cohen, 1988;Ellis, 2010).
So, what are effect size indices?Effect size indices are statistics that quantify the magnitude of differences between treatment groups or experimental conditions and correlations between variables (Cohen, 1988;Ellis, 2010).Researchers may be familiar with some types of indices (Kallogjeri and Piccirillo, 2023) but may not have interpreted these as effect sizes.For example, the odds ratio, which is often calculated in connection with logistic regression, computes the odds of an event (e.g., fungal contamination) occurring in one group (e.g., seeds treated with fungicide A) compared to another group (e.g., seeds treated with fungicide B).Let's say a subsequent analysis yields an odds ratio equal to 1.86.This means that the odds of fungal contamination in seeds treated with fungicide A is 86% higher than the odds for fungal contamination in seeds treated with fungicide B. Similarly, the hazard ratio (HR), which is associated with regression-based time-to-event analyses in seed biology (McNair et al., 2012;Pérez and Kettner, 2013;Genna et al., 2015;Adegbola and Pérez, 2016;Genna and Pérez, 2016;Pérez and Kane, 2017;Tyler et al., 2017;Campbell-Martínez et al., 2019;Pérez and Chumana, 2020), represents the ratio of estimated hazard rates (i.e., likelihood of germination) between different covariate values (e.g., doses of a germination-stimulating chemical; treated vs. control) over a unit of time (Allison, 2010).Consider an experiment from a germination perspective where a group of seeds received an increasing dose of a germination inhibitor.In this case, the calculated HR equals 0.95.Applying the formula 100 ⋅ (HR − 1) yields the percent change in hazard for each 1-unit increase in the germination inhibitor dose.Therefore, the likelihood of germination decreases by 5% for each 1-unit increase of inhibitor.Other types indices, such as Hedges' g, Cramér's V, or eta 2 (η 2 ), may be less familiar, given the large number (around 70) of indices that exist (Cohen, 1988;Kirk, 2003;Ellis, 2010).
Effect size indices fall into the d or r families.Indices in the d family measure differences between groups.Indices of the r family measure associations between variables (Ellis, 2010;Kallogjeri and Piccirillo, 2023).Ellis (2010, see table 1.1) goes on to subdivide the d family into indices that compare groups on dichotomous outcomes (e.g., odds ratio) and those that compare groups on continuous outcomes (e.g., Cohen's d ).Likewise, the r family is divided into indices assessing correlation (e.g., Cramér's V ) or the proportion of variance (e.g., η 2 ).
Selecting a suitable effect size index requires the consideration of several factors (Ellis, 2010;Kallogjeri and Piccirillo, 2023).For example, researchers should consider the research problem under investigation.This helps to identify study aims, target outcomes, data structure, measurement methods and study design.Next, researchers define whether outcomes or dependent variables are categorical, continuous or time-to-event in nature.Finally, researchers describe the type of analysis being conducted such as correlations, regressions, multivariate analysis or analysis of variance (ANOVA) with multiple groups.Researchers with this information in hand will find it easier to determine which index to use when referring to helpful tabulated resources (Ellis, 2010, see table 1.2) or decision trees (Kallogjeri and Piccirillo, 2023).
With the proper effect size index selected, researchers can then move on to analyses and interpretation.But first, consider these cautions.Different indexes will provide different measurement scales corresponding to what constitutes small, medium or large effects (or association).For example, depending on the scientific discipline, a Pearson's correlation coefficient (r) value of 0.25 could be considered as a small association between variables of interest.However, a Cohen's d value of 0.25 can be deemed a medium effect size (Aarts et al., 2014).Therefore, it is challenging to compare indices that use different effect size criteria unless index conversion formulas are available.In some cases, it may not be possible to convert between indices.Additionally, the criteria for effect sizes of a specific index (e.g., Cohen's d) may not necessarily be applicable across disciplines.A small effect in seed science may not be the same as a small effect in medical research.Consequently, interpretations of effect sizes should be disciplinespecific (Cohen, 1988;Ellis, 2010;Brydges, 2019).Finally, remember to report confidence intervals associated with the calculated effect size index.This provides a measure of precision of the effect size estimate and represents good statistical practice (Greenland et al., 2016;Wasserstein and Lazar, 2016;Wasserstein et al., 2019;Kallogjeri and Piccirillo, 2023).
Researchers in the medical and social sciences have been applying effect size indices in their analyses for decades.Such a robust body of analyses often leads to the standardization and contextualization of small, medium and large effects within a discipline (Cohen, 1988;Ellis, 2010).Alternatively, apart from ecology, the utilization of effect sizes in many plant-related disciplines including seed science has been negligible (Sileshi, 2012).An important outcome is that the standardization of small, medium and large effects for some indices will be absent.To remedy this, Cohen (1988) cautiously suggested using criteria outlined in his publication when no discipline-specific criteria exist.For example, values of Cohen's d = 0.2, 0.5, and 0.8 represent benchmarks for small, medium and large effect sizes.But the use of various general criteria offered by Cohen (1988) must be tempered with the researcher's experience and wisdom.For instance, a five-percentage point difference in a laboratory germination test (e.g., 93 vs. 98%) for lettuce seeds may turn out to be a small effect.Nonetheless, when scaled to the field level, this difference Ellis P (2010) The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of can have a substantial impact since the success of a lettuce crop may rely on each sown seed producing a harvestable head.So, context is essential when interpreting effect sizes.Otherwise, criteria such as small, medium or large may remain ambiguous (Cohen, 1988;Ellis, 2010;Carey et al., 2023).

More information on effect sizes
Reporting and interpretation of effect sizes in the seed science literature is rare (Sileshi, 2012); suggesting that effect sizes represent new concepts for our discipline.If the topic of effect sizes is new to you, then a good place to start is with easy to digest reading materials (Table 1).Fortunately, most statistical analysis programmes have the capacity to calculate many effect size indices (Table 1).These programmes also tend to provide adequate documentation explaining available indices.If statistical programmes are unavailable or inaccessible, then various websites provide applications to calculate effect sizes (Table 1).Similarly, calculations for several effect sizes are straightforward (Cohen, 1988;Ellis, 2010) and can easily be computed in a spreadsheet or by hand if necessary (Table 1).

Concluding remarks
Effect size indices are powerful tools crucial for extending our results beyond mere statistical significance.Moreover, effect sizes are important to ensure that our studies are properly powered rather than underpowered (Cohen, 1988;Ellis, 2010;Nuzzo, 2014;Greenland et al., 2016, Brydges, 2019;Kimmel et al., 2023; also see Table 1).Effect size indices are simple to apply.More importantly, the information these indices yield contributes to more impactful conclusions relevant to broader audiences while moving a discipline forward.Therefore, I strongly encourage the use of effect sizes in future seed science research.
), Greenland et al. (2016), Wasserstein and Lazar (2016), and Wasserstein et al. (2019) offer more complete descriptions of P-value misinterpretations, provide an excellent refresher on what P-values do and do not represent, and explain what not do with P-values while offering meaningful actions researchers can take in the context of statistical analyses.

Figure 1 .
Figure 1.Hypothetical experimental results of seed biology experiments displaying responses that are considered (A) highly and (B) not significantly different according to the common statistical cut-off value of α = 0.05.

Table 1 .
List of additional resources related to effect sizes