Multiple hypothesis testing in experimental economics

John A. List; Azeem M. Shaikh; Yang Xu

doi:10.1007/s10683-018-09597-5

Multiple hypothesis testing in experimental economics

Published online by Cambridge University Press: 14 March 2025

John A. List ,

Azeem M. Shaikh and

Yang Xu

Show author details

John A. List*: Affiliation:
Department of Economics, University of Chicago, 5757 S University Ave, Chicago, IL 60637, USA
Azeem M. Shaikh*: Affiliation:
Department of Economics, University of Chicago, 5757 S University Ave, Chicago, IL 60637, USA
Yang Xu*: Affiliation:
Department of Economics, University of Chicago, 5757 S University Ave, Chicago, IL 60637, USA
*: jlist@uchicago.edu
amshaikh@uchicago.edu
yangxu@uchicago.edu

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

The analysis of data from experiments in economics routinely involves testing multiple null hypotheses simultaneously. These different null hypotheses arise naturally in this setting for at least three different reasons: when there are multiple outcomes of interest and it is desired to determine on which of these outcomes a treatment has an effect; when the effect of a treatment may be heterogeneous in that it varies across subgroups defined by observed characteristics and it is desired to determine for which of these subgroups a treatment has an effect; and finally when there are multiple treatments of interest and it is desired to determine which treatments have an effect relative to either the control or relative to each of the other treatments. In this paper, we provide a bootstrap-based procedure for testing these null hypotheses simultaneously using experimental data in which simple random sampling is used to assign treatment status to units. Using the general results in Romano and Wolf (Ann Stat 38:598–633, 2010), we show under weak assumptions that our procedure (1) asymptotically controls the familywise error rate—the probability of one or more false rejections—and (2) is asymptotically balanced in that the marginal probability of rejecting any true null hypothesis is approximately equal in large samples. Importantly, by incorporating information about dependence ignored in classical multiple testing procedures, such as the Bonferroni and Holm corrections, our procedure has much greater ability to detect truly false null hypotheses. In the presence of multiple treatments, we additionally show how to exploit logical restrictions across null hypotheses to further improve power. We illustrate our methodology by revisiting the study by Karlan and List (Am Econ Rev 97(5):1774–1793, 2007) of why people give to charitable causes.

Keywords

Experiments Multiple hypothesis testing Multiple treatments Multiple outcomes Multiple subgroups Randomized controlled trial Bootstrap Balance

JEL classification

C12: Hypothesis Testing: General C14: Semiparametric and Nonparametric Methods: General

Information

Type: Original Paper
Information: Experimental Economics , Volume 22 , Issue 4 , December 2019 , pp. 773 - 793

DOI: https://doi.org/10.1007/s10683-018-09597-5 [Opens in a new window]
Copyright: Copyright © 2019 Economic Science Association

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

Documentation of our procedures and our Stata and Matlab code can be found at https://github.com/seidelj/mht.

References

Anderson, M (2008). Multiple inference and gender differences in the effects of early intervention: A re-evaluation of the abecedarian, perry preschool, and early training projects. Journal of the American Statistical Association, 103(484), 1481–1495. 10.1198/016214508000000841CrossRef Google Scholar

Bettis, RA (2012). The search for asterisks: Compromised statistical tests and flawed theories. Strategic Management Journal, 33(1), 108–113. 10.1002/smj.97510.1002/smj.975CrossRef Google Scholar

Bhattacharya, J, Shaikh, AM, & Vytlacil, E (2012). Treatment effect bounds: An application to swan-ganz catheterization. Journal of Econometrics, 168(2), 223–243. 10.1016/j.jeconom.2012.01.001CrossRef Google Scholar

Bonferroni, CE (1935). Il calcolo delle assicurazioni su gruppi di teste, Rome: Tipografia del Senato.Google Scholar

Bugni, F., Canay, I., & Shaikh, A. (2015). Inference under covariate-adaptive randomization. Technical report, cemmap working paper, Centre for Microdata Methods and Practice.Google Scholar

Camerer, CF, Dreber, A, Forsell, E, Ho, T-H, Huber, J, Johannesson, M, Kirchler, M, Almenberg, J, Altmejd, A, Chan, T et al., (2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280), 1433–1436. 10.1126/science.aaf0918CrossRef Google Scholar PubMed

Fink, G, McConnell, M, & Vollmer, S (2014). Testing for heterogeneous treatment effects in experimental data: False discovery risks and correction procedures. Journal of Development Effectiveness, 6(1), 44–57. 10.1080/19439342.2013.875054CrossRef Google Scholar

Flory, J. A., Gneezy, U., Leonard, K. L., & List, J. A. (2015a). Gender, age, and competition: The disappearing gap. Unpublished Manuscript.Google Scholar

Flory, JA, Leibbrandt, A, & List, JA (2015). Do competitive workplaces deter female workers? A large-scale natural field experiment on job-entry decisions. The Review of Economic Studies, 82(1), 122–155. 10.1093/restud/rdu03010.1093/restud/rdu030CrossRef Google Scholar

Gneezy, U, Niederle, M, & Rustichini, A (2003). Performance in competitive environments: Gender differences. The Quarterly Journal of Economics, 118(3), 1049–1074. 10.1162/0033553036069849610.1162/00335530360698496CrossRef Google Scholar

Heckman, J, Moon, SH, Pinto, R, Savelyev, P, & Yavitz, A (2010). Analyzing social experiments as implemented: A reexamination of the evidence from the highscope perry preschool program. Quantitative Economics, 1(1), 1–46. 10.3982/QE8Google Scholar PubMed

Heckman, J. J., Pinto, R., Shaikh, A. M., & Yavitz, A. (2011). Inference with imperfect randomization: The case of the perry preschool program. National Bureau of Economic Research Working Paper w16935.CrossRef Google Scholar

Holm, S (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.Google Scholar

Hossain, T, & List, JA (2012). The behavioralist visits the factory: Increasing productivity using simple framing manipulations. Management Science, 58(12), 2151–2167. 10.1287/mnsc.1120.154410.1287/mnsc.1120.1544CrossRef Google Scholar

Ioannidis, J (2005). Why most published research findings are false. PLoS Med, 2(8), e124 10.1371/journal.pmed.002012410.1371/journal.pmed.0020124CrossRef Google Scholar PubMed

Jennions, MD, & Moller, AP (2002). Publication bias in ecology and evolution: An empirical assessment using the ‘trim and fill’ method. Biological Reviews of the Cambridge Philosophical Society, 77(02), 211–222. 10.1017/S146479310100587510.1017/S1464793101005875CrossRef Google Scholar PubMed

Karlan, D, & List, JA (2007). Does price matter in charitable giving? Evidence from a large-scale natural field experiment. The American Economic Review, 97(5), 1774–1793. 10.1257/aer.97.5.177410.1257/aer.97.5.1774CrossRef Google Scholar

Kling, J, Liebman, J, & Katz, L (2007). Experimental analysis of neighborhood effects. Econometrica, 75(1), 83–119. 10.1111/j.1468-0262.2007.00733.x10.1111/j.1468-0262.2007.00733.xCrossRef Google Scholar

Lee, S, & Shaikh, AM (2014). Multiple testing and heterogeneous treatment effects: Re-evaluating the effect of progresa on school enrollment. Journal of Applied Econometrics, 29(4), 612–626. 10.1002/jae.232710.1002/jae.2327CrossRef Google Scholar

Lehmann, E, & Romano, J (2005). Generalizations of the familywise error rate. The Annals of Statistics, 33(3), 1138–1154. 10.1214/00905360500000008410.1214/009053605000000084CrossRef Google Scholar

Lehmann, EL, & Romano, JP (2006). Testing statistical hypotheses, Berlin: Springer.Google Scholar

Levitt, S. D., List, J. A., Neckermann, S., & Sadoff, S. (2012). The behavioralist goes to school: Leveraging behavioral economics to improve educational performance. National Bureau of Economic Research w18165.10.3386/w18165CrossRef Google Scholar

List, JA, & Samek, AS (2015). The behavioralist as nutritionist: Leveraging behavioral economics to improve child food choice and consumption. Journal of Health Economics, 39, 135–146. 10.1016/j.jhealeco.2014.11.00210.1016/j.jhealeco.2014.11.002CrossRef Google Scholar PubMed

Machado, C., Shaikh, A., Vytlacil, E., & Lunch, C. (2013). Instrumental variables, and the sign of the average treatment effect. Unpublished Manuscript, Getulio Vargas Foundation, University of Chicago, and New York University. [2049].Google Scholar

Maniadis, Z, Tufano, F, & List, JA (2014). One swallow doesn’t make a summer: New evidence on anchoring effects. The American Economic Review, 104(1), 277–290. 10.1257/aer.104.1.277CrossRef Google Scholar

Niederle, M, & Vesterlund, L (2007). Do women shy away from competition? Do men compete too much?. The Quarterly Journal of Economics, 122(3), 1067–1101. 10.1162/qjec.122.3.106710.1162/qjec.122.3.1067CrossRef Google Scholar

Nosek, BA, Spies, JR, & Motyl, M (2012). Scientific utopia ii. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7(6), 615–631. 10.1177/1745691612459058CrossRef Google Scholar PubMed

Romano, J. P., & Shaikh, A. M. (2006a). On stepdown control of the false discovery proportion. In Lecture Notes-Monograph Series (pp. 33–50).10.1214/074921706000000383CrossRef Google Scholar

Romano, JP, & Shaikh, AM (2006). Stepup procedures for control of generalizations of the familywise error rate. The Annals of Statistics, 34, 1850–1873. 10.1214/00905360600000046110.1214/009053606000000461CrossRef Google Scholar

Romano, JP, & Shaikh, AM (2012). On the uniform asymptotic validity of subsampling and the bootstrap. The Annals of Statistics, 40(6), 2798–2822. 10.1214/12-AOS105110.1214/12-AOS1051CrossRef Google Scholar

Romano, JP, Shaikh, AM, & Wolf, M (2008). Control of the false discovery rate under dependence using the bootstrap and subsampling. Test, 17(3), 417–442. 10.1007/s11749-008-0126-610.1007/s11749-008-0126-6CrossRef Google Scholar

Romano, JP, Shaikh, AM, & Wolf, M (2008). Formalized data snooping based on generalized error rates. Econometric Theory, 24(02), 404–447. 10.1017/S026646660808017110.1017/S0266466608080171CrossRef Google Scholar

Romano, JP, & Wolf, M (2005). Stepwise multiple testing as formalized data snooping. Econometrica, 73(4), 1237–1282. 10.1111/j.1468-0262.2005.00615.x10.1111/j.1468-0262.2005.00615.xCrossRef Google Scholar

Romano, JP, & Wolf, M (2010). Balanced control of generalized error rates. The Annals of Statistics, 38, 598–633. 10.1214/09-AOS73410.1214/09-AOS734CrossRef Google Scholar

Sutter, M, & Glätzle-Rützler, D (2014). Gender differences in the willingness to compete emerge early in life and persist. Management Science, 61(10), 2339–23354. 10.1287/mnsc.2014.198110.1287/mnsc.2014.1981CrossRef Google Scholar

Westfall, PH, & Young, SS (1993). Resampling-based multiple testing: Examples and methods for p value adjustment, New York: Wiley.Google Scholar

Article contents

Multiple hypothesis testing in experimental economics

Abstract

Keywords

JEL classification

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests