To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
People often probability match: they select choices based on the probability of outcomes. For example, when predicting 10 individual results of a spinner with 7 green and 3 purple sections, many people choose green mostly but not always, even though they would be better off always choosing it (i.e., maximizing). This behavior has perplexed cognitive scientists for decades. Why do people make such an obvious error? Here, we provide evidence that this difficulty may often arise from statistical naïveté: Even when shown the optimal strategy of maximizing, many people fail to recognize that it will produce better payouts than other strategies. In 3 preregistered experiments (N = 907 Americans tested online), participants made 10 choices in a spinner game and estimated the payout for each of 3 strategies: probability matching, maximizing, and 50/50 guessing. The key finding across experiments is that while most maximizers recognize that maximizing results in higher payouts than matching, probability matchers predict similar payouts for each .
In this chapter we start by reviewing the different types of inference procedures: frequentist, Bayesian, parametric and non-parametric. We introduce notation by providing a list of the probability distributions that will be used later on, together with their first two moments. We review some results on conditional moments and carry out several examples. We review definitions of stochastic processes, stationary processes and Markov processes, and finish by introducing the most common discrete-time stochastic processes that show dependence in time and space.
Confidence intervals are ubiquitous in the presentation of social science models, data, and effects. When several intervals are plotted together, one natural inclination is to ask whether the estimates represented by those intervals are significantly different from each other. Unfortunately, there is no general rule or procedure that would allow us to answer this question from the confidence intervals alone. It is well known that using the overlaps in 95% confidence intervals to perform significance tests at the 0.05 level does not work. Recent scholarship has developed and refined a set of tools for inferential confidence intervals that permit inference on confidence intervals with the appropriate type I error rate in many different bivariate contexts. These are all based on the same underlying idea of identifying the multiple of the standard error (i.e., a new confidence level) such that the overlap in confidence intervals matches the desired type I error rate. These procedures remain stymied by multiple simultaneous comparisons. We propose an entirely new procedure for developing inferential confidence intervals that decouples the testing and visualization that can overcome many of these problems in any visual testing scenario. We provide software in R and Stata to accomplish this goal.
When analyzing data, researchers make some choices that are either arbitrary, based on subjective beliefs about the data-generating process, or for which equally justifiable alternative choices could have been made. This wide range of data-analytic choices can be abused and has been one of the underlying causes of the replication crisis in several fields. Recently, the introduction of multiverse analysis provides researchers with a method to evaluate the stability of the results across reasonable choices that could be made when analyzing data. Multiverse analysis is confined to a descriptive role, lacking a proper and comprehensive inferential procedure. Recently, specification curve analysis adds an inferential procedure to multiverse analysis, but this approach is limited to simple cases related to the linear model, and only allows researchers to infer whether at least one specification rejects the null hypothesis, but not which specifications should be selected. In this paper, we present a Post-selection Inference approach to Multiverse Analysis (PIMA) which is a flexible and general inferential approach that considers for all possible models, i.e., the multiverse of reasonable analyses. The approach allows for a wide range of data specifications (i.e., preprocessing) and any generalized linear model; it allows testing the null hypothesis that a given predictor is not associated with the outcome, by combining information from all reasonable models of multiverse analysis, and provides strong control of the family-wise error rate allowing researchers to claim that the null hypothesis can be rejected for any specification that shows a significant effect. The inferential proposal is based on a conditional resampling procedure. We formally prove that the Type I error rate is controlled, and compute the statistical power of the test through a simulation study. Finally, we apply the PIMA procedure to the analysis of a real dataset on the self-reported hesitancy for the COronaVIrus Disease 2019 (COVID-19) vaccine before and after the 2020 lockdown in Italy. We conclude with practical recommendations to be considered when implementing the proposed procedure.
Bringing together idiomatic Python programming, foundational numerical methods, and physics applications, this is an ideal standalone textbook for courses on computational physics. All the frequently used numerical methods in physics are explained, including foundational techniques and hidden gems on topics such as linear algebra, differential equations, root-finding, interpolation, and integration. The second edition of this introductory book features several new codes and 140 new problems (many on physics applications), as well as new sections on the singular-value decomposition, derivative-free optimization, Bayesian linear regression, neural networks, and partial differential equations. The last section in each chapter is an in-depth project, tackling physics problems that cannot be solved without the use of a computer. Written primarily for students studying computational physics, this textbook brings the non-specialist quickly up to speed with Python before looking in detail at the numerical methods often used in the subject.
A major concern in the social sciences is understanding and explaining the relationship between two variables. We showed in Chapter 5 how to address this issue using tabular presentations. In this chapter we show how to address the issue statistically via regression and correlation. We first cover the two concepts of regression and correlation. We then turn to the issue of statistical inference and ways of evaluating the statistical significance of our results. Since most social science research is undertaken using sample data, we need to determine whether the regression and correlation coefficients we calculate using the sample data are statistically significant in the larger population from which the sample data were drawn.
Community detection is one of the most important methodological fields of network science, and one which has attracted a significant amount of attention over the past decades. This area deals with the automated division of a network into fundamental building blocks, with the objective of providing a summary of its large-scale structure. Despite its importance and widespread adoption, there is a noticeable gap between what is arguably the state-of-the-art and the methods which are actually used in practice in a variety of fields. The Elements attempts to address this discrepancy by dividing existing methods according to whether they have a 'descriptive' or an 'inferential' goal. While descriptive methods find patterns in networks based on context-dependent notions of community structure, inferential methods articulate a precise generative model, and attempt to fit it to data. In this way, they are able to provide insights into formation mechanisms and separate structure from noise. This title is also available as open access on Cambridge Core.
From observed data, statistical inference infers the properties of the underlying probability distribution. For hypothesis testing, the t-test and some non-parametric alternatives are covered. Ways to infer confidence intervals and estimate goodness of fit are followed by the F-test (for test of variances) and the Mann-Kendall trend test. Bootstrap sampling and field significance are also covered.
A link is made between epistemology – that is to say, the philosophy of knowledge – and statistics. Hume's criticism of induction is covered, as is Popper's. Various philosophies of statistics are described.
Katz, King, and Rosenblatt (2020, American Political Science Review 114, 164–178) introduces a theoretical framework for understanding redistricting and electoral systems, built on basic statistical and social science principles of inference. DeFord et al. (2021, Political Analysis, this issue) instead focuses solely on descriptive measures, which lead to the problems identified in our article. In this article, we illustrate the essential role of these basic principles and then offer statistical, mathematical, and substantive corrections required to apply DeFord et al.’s calculations to social science questions of interest, while also showing how to easily resolve all claimed paradoxes and problems. We are grateful to the authors for their interest in our work and for this opportunity to clarify these principles and our theoretical framework.
This chapter focuses on critical infrastructures in the power grid, which often rely on Industrial Control Systems (ICS) to operate and are exposed to vulnerabilities ranging from physical damage to injection of information that appears to be consistent with industrial control protocols. This way, infiltration of firewalls protecting the control perimeter of the control network becomes a significant tread. The goal of this chapter is to review identification and intrusion detection algorithms for protecting the power grid, based on the knowledge of the expected behavior of the system.
We develop a model that successfully learns social and organizational human network structure using ambient sensing data from distributed plug load energy sensors in commercial buildings. A key goal for the design and operation of commercial buildings is to support the success of organizations within them. In modern workspaces, a particularly important goal is collaboration, which relies on physical interactions among individuals. Learning the true socio-organizational relational ties among workers can therefore help managers of buildings and organizations make decisions that improve collaboration. In this paper, we introduce the Interaction Model, a method for inferring human network structure that leverages data from distributed plug load energy sensors. In a case study, we benchmark our method against network data obtained through a survey and compare its performance to other data-driven tools. We find that unlike previous methods, our method infers a network that is correlated with the survey network to a statistically significant degree (graph correlation of 0.46, significant at the 0.01 confidence level). We additionally find that our method requires only 10 weeks of sensing data, enabling dynamic network measurement. Learning human network structure through data-driven means can enable the design and operation of spaces that encourage, rather than inhibit, the success of organizations.
Quantitative comparative social scientists have long worried about the performance of multilevel models when the number of upper-level units is small. Adding to these concerns, an influential Monte Carlo study by Stegmueller (2013) suggests that standard maximum-likelihood (ML) methods yield biased point estimates and severely anti-conservative inference with few upper-level units. In this article, the authors seek to rectify this negative assessment. First, they show that ML estimators of coefficients are unbiased in linear multilevel models. The apparent bias in coefficient estimates found by Stegmueller can be attributed to Monte Carlo Error and a flaw in the design of his simulation study. Secondly, they demonstrate how inferential problems can be overcome by using restricted ML estimators for variance parameters and a t-distribution with appropriate degrees of freedom for statistical inference. Thus, accurate multilevel analysis is possible within the framework that most practitioners are familiar with, even if there are only a few upper-level units.
Intensive week-long Summer Schools in Statistics for Astronomers were initiated at Penn State in 2005 and have been continued annually. Due to their popularity and high demand, additional full summer schools have been organized in India, Brazil, Space Telescope Science Institute.
The Summer Schools seek to give a broad exposure to fundamental concepts and a wide range of resulting methods across many fields of statistics. The Summer Schools in statistics and data analysis for young astronomers present concepts and methodologies with hands on tutorials using the data from astronomical surveys.
In this paper, we use queuing theory to model the number of insured households in an insurance portfolio. The model is based on an idea from Boucher and Couture-Piché (2015), who use a queuing theory model to estimate the number of insured cars on an insurance contract. Similarly, the proposed model includes households already insured, but the modeling approach is modified to include new households that could be added to the portfolio. For each household, we also use the queuing theory model to estimate the number of insured cars. We analyze an insurance portfolio from a Canadian insurance company to support this discussion. Statistical inference techniques serve to estimate each parameter of the model, even in cases where some explanatory variables are included in each of these parameters. We show that the proposed model offers a reasonable approximation of what is observed, but we also highlight the situations where the model should be improved. By assuming that the insurance company makes a $1 profit for each one-year car exposure, the proposed approach allows us to determine a global value of the insurance portfolio of an insurer based on the customer equity concept.
Visual displays of data in the parasitology literature are often presented in a way which is not very informative regarding the distribution of the data. An example being simple barcharts with half an error bar on top to display the distribution of parasitaemia and biomarkers of host immunity. Such displays obfuscate the shape of the data distribution through displaying too few statistical measures to explain the spread of all the data and selecting statistical measures which are influenced by skewness and outliers. We describe more informative, yet simple, visual representations of the data distribution commonly used in statistics and provide guidance with regards to the display of estimates of population parameters (e.g. population mean) and measures of precision (e.g. 95% confidence interval) for statistical inference. In this article we focus on visual displays for numerical data and demonstrate such displays using an example dataset consisting of total IgG titres in response to three Plasmodium blood antigens measured in pregnant women and parasitaemia measurements from the same study. This tutorial aims to highlight the importance of displaying the data distribution appropriately and the role such displays have in selecting statistics to summarize its distribution and perform statistical inference.
There is burgeoning interest in predicting road development because of the wide ranging important socioeconomic and environmental issues that roads present, including the close links between road development, deforestation and biodiversity loss. This is especially the case in developing nations, which are high in natural resources, where road development is rapid and often not centrally managed. Characterization of large scale spatio-temporal patterns in road network development has been greatly overlooked to date. This paper examines the spatio-temporal dynamics of road density across the Brazilian Amazon and assesses the relative contributions of local versus neighbourhood effects for temporal changes in road density at regional scales. To achieve this, a combination of statistical analyses and model-data fusion techniques inspired by studies of spatio-temporal dynamics of populations in ecology and epidemiology were used. The emergent development may be approximated by local growth that is logistic through time and directional dispersal. The current rates and dominant direction of development may be inferred, by assuming that roads develop at a rate of 55 km per year. Large areas of the Amazon will be subject to extensive anthropogenic change should the observed patterns of road development continue.