To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter introduces probability. We begin with an informal definition which enables us to build intuition about the properties of probability. Then, we present a more rigorous definition, based on the mathematical framework of probability spaces. Next, we describe conditional probability, a concept that makes it possible to update probabilities when additional information is revealed. In our first encounter with statistics, we explain how to estimate probabilities and conditional probabilities from data, as illustrated by an analysis of votes in the United States Congress. Building upon the concept of conditional probability, we define independence and conditional independence, which are critical concepts in probabilistic modeling. The chapter ends with a surprising twist: In practice, probabilities are often impossible to compute analytically! Fortunately, the Monte Carlo method provides a pragmatic solution to this challenge, allowing us to approximate probabilities very accurately using computer simulations. We apply w 3 × 3 basketball tournament from the 2020 Tokyo Olympics.
This chapter provides a comprehensive overview of the foundational concepts essential for scalable Bayesian learning and Monte Carlo methods. It introduces Monte Carlo integration and its relevance to Bayesian statistics, focusing on techniques such as importance sampling and control variates. The chapter outlines key applications, including logistic regression, Bayesian matrix factorization, and Bayesian neural networks, which serve as illustrative examples throughout the book. It also offers a primer on Markov chains and stochastic differential equations, which are critical for understanding the advanced methods discussed in later chapters. Additionally, the chapter introduces kernel methods in preparation for their application in scalable Markov Chain Monte Carlo (MCMC) diagnostics.
This chapter explores the pivotal role of modeling as a conduit between diverse data representations and applications in real, complex systems. The emphasis is on portraying modeling in terms of multivariate probabilities, laying the foundation for the probabilistic data-driven modeling framework.
This chapter provides an introduction to the main themes of the book and why this is a book about the misuse of language, just as much as the misuse of numbers. Statistics are never just numbers, for the numbers have to be labelled. Because politicians are so distrusted at present, people expect politicians to manipulate statistics. This opening chapter introduces readers to a number of excellent recent books about statistics, most of which have been addressed to non-specialist readers. The topic of statistics is a broad one and can sustain a variety of books with different slants. Unlike other books on statistics, this one looks directly at manipulation and how it occurs. A recurring theme of the book is that the political manipulation of statistics is not typically a single act, but politicians will often manipulate their statisticians to manipulate the official statistics on their behalf. This opening chapter also comments on the book’s style of writing. The authors write aim to write in a clear and non-technical way, and to give special emphasis to the ways that politicians manipulate language when manipulating numbers.
This chapter delves into the complexities and challenges of data science, emphasizing the potential pitfalls and ethical considerations inherent in decision-making based on data. It explores the intricate nature of data, which can be multifaceted, noisy, temporally and spatially disjointed, and often a result of the interplay among numerous interconnected components. This complexity poses significant difficulties in drawing causal inferences and making informed decisions.
A central theme of the chapter is the compromise of privacy that individuals may face in the quest for data-driven insights, which raises ethical concerns regarding the use of personal data. The discussion extends to the concept of algorithmic fairness, particularly in the context of racial bias, shedding light on the need for mitigating biases in data-driven decision-making processes.
Through a series of examples, the chapter illustrates the challenges and potential pitfalls associated with data science, underscoring the importance of robust methodologies and ethical considerations. It concludes with a thought-provoking examination of income inequality as a controversial example of data science in practice. The example highlights the nuanced interplay between data, decisions, and societal impacts.
In this chapter we start by reviewing the different types of inference procedures: frequentist, Bayesian, parametric and non-parametric. We introduce notation by providing a list of the probability distributions that will be used later on, together with their first two moments. We review some results on conditional moments and carry out several examples. We review definitions of stochastic processes, stationary processes and Markov processes, and finish by introducing the most common discrete-time stochastic processes that show dependence in time and space.
Limit theory is developed for least squares regression estimation of a model involving time trend polynomials and a moving average error process with a unit root. Models with these features can arise from data manipulation such as overdifferencing and model features such as the presence of multicointegration. The impact of such features on the asymptotic equivalence of least squares and generalized least squares is considered. Problems of rank deficiency that are induced asymptotically by the presence of time polynomials in the regression are also studied, focusing on the impact that singularities have on hypothesis testing using Wald statistics and matrix normalization. The chapter is largely pedagogical but contains new results, notational innovations, and procedures for dealing with rank deficiency that are useful in cases of wider applicability.
This chapter formally defines a financial market and associated constructs, and lays the foundations for arbitrage pricing and dynamic replication (or hedging) through trading strategies.
Observed choices are random in psychological experiments on perception and in economics experiments on choice. I discuss a number of possible explanations and introduce the random utility model.
Section 1.1 calls attention to the prevalent research practice that studies planning with incredible certitude. Section 1.2 contrasts the conceptions of uncertainty in consequentialist and axiomatic decision theory. Section 1.3 presents the formal structure of consequentialist theory, which is used throughout the book. Section 1.4 explains the prevalent econometric characterization of uncertainty, which distinguishes identification problems and statistical imprecision. Section 1.5 discusses the distinct perspectives on social welfare expressed in various strands of research on planning.
This chapter provides an overview on the use and validity of student samples in the behavioral and social sciences. In some instances, data collected from students can be of limited value or even inappropriate; however, in other cases, this approach provides useful data. I offer three general ways to evaluate the use of student samples. First, consider the research design. Descriptive studies that rely on students to draw inferences about the overall population are likely problematic. Second, statistical controls such as multivariate analyses that adjust for other factors may reduce some of the biases that may be introduced through sampling. Third, consider the theorized mechanism – a clear theoretical mechanism that does not vary based on the demographics of the sample allows us to put more faith in constrained samples. Despite these approaches, and regardless of our methods, statistics, and theoretical mechanism, we should be cautious with generalizability claims.
An empirical social networks study is concerned with what a well-defined social network is like, and whether and how it matters in some context of interest. Designing a successful one requires serious thinking on the front end about what the network is and what it does in theory. This book aims to help researchers do just that. To begin, this chapter motivates this research area with examples from political science, explains why the topic is unique enough to warrant a whole book, and offers guidance on how to know if your research should incorporate networks.
Chapter 1 discusses the motivation for the book and the rationale for its organization into four parts: preliminary considerations, evaluation for classification, evaluation in other settings, and evaluation from a practical perspective. In more detail, the first part provides the statistical tools necessary for evaluation and reviews the main machine learning principles as well as frequently used evaluation practices. The second part discusses the most common setting in which machine learning evaluation has been applied: classification. The third part extends the discussion to other paradigms such as multi-label classification, regression analysis, data stream mining, and unsupervised learning. The fourth part broadens the conversation by moving it from the laboratory setting to the practical setting, specifically discussing issues of robustness and responsible deployment.
This chapter provides an overview of the purpose of the book, namely to help the user of public opinion data develop a systematic analytical approach for understanding, predicting, and engaging public opinion. This includes helping the reader understand how public opinion can be employed as a decision-making input, meaning a factor, or variable, to assess, predict, or influence an outcome. The chapter outlines how information from different disciplines, including cognitive psychology, behavioral economics, and political science, come together to inform the pollster’s work.
Network science has exploded in popularity since the late 1990s. But it flows from a long and rich tradition of mathematical and scientific understanding of complex systems. We can no longer imagine the world without evoking networks. And network data is at the heart of it. In this chapter, we set the stage by highlighting network sciences ancestry and the exciting scientific approaches that networks have enabled, followed by a tour of the basic concepts and properties of networks.
We begin by illustrating the interplay between questions of scientific interest and the use of data in seeking answers. Graphs provide a window through which meaning can often be extracted from data. Numeric summary statistics and probability distributions provide a form of quantitative scaffolding for models of random as well as nonrandom variation. Simple regression models foreshadow the issues that arise in the more complex models considered later in the book. Frequentist and Bayesian approaches to statistical inference are contrasted, the latter primarily using the Bayes Factor to complement the limited perspective that p-values offer. Akaike Information Criterion (AIC) and related "information" statistics provide a further perspective. Resampling methods, where the one available dataset is used to provide an empirical substitute for a theoretical distribution, are introduced. Remaining topics are of a more general nature. RStudio is one of several tools that can help in organizing and managing work. The checks provided by independent replication at another time and place are an indispensable complement to statistical analysis. Questions of data quality, of relevance to the questions asked, of the processes that generated the data, and of generalization, remain just as important for machine learning and other new analysis approaches as for more classical methods.