To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Students will develop a practical understanding of data science with this hands-on textbook for introductory courses. This new edition is fully revised and updated, with numerous exercises and examples in the popular data science tool Python, a new chapter on using Python for statistical analysis, and a new chapter that demonstrates how to use Python within a range of cloud platforms. The many practice examples, drawn from real-life applications, range from small to big data and come to life in a new end-to-end project in Chapter 11. New 'Data Science in Practice' boxes highlight how concepts introduced work within an industry context and many chapters include new sections on AI and Generative AI. A suite of online material for instructors provides a strong supplement to the book, including lecture slides, solutions, additional assessment material and curriculum suggestions. Datasets and code are available for students online. This entry-level textbook is ideal for readers from a range of disciplines wishing to build a practical, working knowledge of data science.
Students will develop a practical understanding of data science with this hands-on textbook for introductory courses. This new edition is fully revised and updated, with numerous exercises and examples in the popular data science tool R, a new chapter on using R for statistical analysis, and a new chapter that demonstrates how to use R within a range of cloud platforms. The many practice examples, drawn from real-life applications, range from small to big data and come to life in a new end-to-end project in Chapter 11. New 'Data Science in Practice' boxes highlight how concepts introduced work within an industry context and many chapters include new sections on AI and Generative AI. A suite of online material for instructors provides a strong supplement to the book, including lecture slides, solutions, additional assessment material and curriculum suggestions. Datasets and code are available for students online. This entry-level textbook is ideal for readers from a range of disciplines wishing to build a practical, working knowledge of data science.
Improving public policies, creating the next generation of AI systems, reducing crime, making hospitals more efficient, addressing climate change, controlling pandemics, and reducing disruption in supply chains are all problems where big picture ideas from analytics science have had large-scale impact. What are those ideas? Who came up with them? Will insights from analytics science help solve even more daunting societal challenges? This book takes readers on an engaging tour of the evolution of analytics science and how it brought together ideas and tools from many different fields – AI, machine learning, data science, OR, optimization, statistics, economics, and more – to make the world a better place. Using these ideas and tools, big picture insights emerge from simplified settings that get at the essence of a problem, leading to superior approaches to complex societal issues. A fascinating read for anyone interested in how problems can be solved by leveraging analytics.
We are entering a new phase in the information revolution driven by the introduction of innovative artificial intelligence (AI) technologies. From the rise of mass media, mass communications and the expansion of the internet, to mobile computing, social networks and Generative AI, this important and authoritative book outlines the key changes over the last thirty years that have led to this moment.
Drawing on established frameworks, theories, historical research and empirical evidence, this book argues that the current wave of AI-driven innovations represents a step-change in how organisations can extract value from data and that this will have significant implications for business innovation and how companies compete. Individual chapters explore (a) the history of the information industry and key milestones in artificial intelligence, (b) an overview of the data and AI landscape, (c) the opportunities and challenges of the AI revolution, (d) the ethical, policy and legal issues of data-driven AI, (e) and scenarios for where the data revolution is heading up to 2030.
In this chapter, we describe how to jointly model continuous quantities, by representing them as multiple continuous random variables within the same probability space. We define the joint cumulative distribution function and the joint probability density function and explain how to estimate the latter from data using a multivariate generalization of kernel density estimation. Next, we introduce marginal and conditional distributions of continuous variables and also discuss independence and conditional independence. Throughout, we model real-world temperature data as a running example. Then, we explain how to jointly simulate multiple random variables, in order to correctly account for the dependence between them. Finally, we define Gaussian random vectors which are the most popular multidimensional parametric model for continuous data, and apply them to model anthropometric data.
This chapter focuses on correlation, a key metric in data science that quantifies to what extent two quantities are linearly related. We begin by defining correlation between normalized and centered random variables. Then, we generalize the definition to all random variables and introduce the concept of covariance, which measures the average joint variation of two random variables. Next, we explain how to estimate correlation from data and analyze the correlation between the height of NBA players and different basketball stats.In addition, we study the connection between correlation and simple linear regression. We then discuss the differences between uncorrelation and independence. In order to gain better intuition about the properties of correlation, we provide a geometric interpretation of correlation, where the covariance is an inner product between random variables. Finally, we show that correlation does not imply causation, as illustrated by the spurious correlation between temperature and unemployment in Spain.
This chapter presents hypothesis testing which is used to evaluate whether the available data provide sufficient evidence to support a certain hypothesis. The main idea is to play devil's advocate and assume a null hypothesis, which contradicts our hypothesis of interest. We explain how to use parametric modeling to implement this idea, and define the p-value. We prove that thresholding the p-value controls the probability of false positives. In addition, we define the power of a test, which quantifies the test's ability to identify positive findings. Next, we show how to perform hypothesis testing without a parametric model, focusing on the permutation test. Then, we discuss multiple testing, a setting where many tests are performed simultaneously. Finally, we provide three reasons why hypothesis testing should not be used as the only stamp of approval for scientific discoveries. First, hypothesis testing does not necessarily identify causal effects; it is complementary to causal inference. Second, small p-values do not imply practical significance. Third, relying on p-values to validate findings produces a strong incentive to cherry-pick results.
This chapter explains how to estimate population parameters from data. We introduce random sampling, an approach that yields accurate estimates from limited data. We then define the bias and the standard error, which quantify the average error of an estimator and how much it varies, respectively. In addition, we derive deviation bounds and use them to prove the law of large numbers, which states that averaging many independent samples from a distribution yields an accurate estimate of its mean. An important consequence is that random sampling provides a precise estimate of means and proportions. However, we caution that this is not necessarily the case, if the data contain extreme values. Next, we discuss the central limit theorem (CLT), according to which averages of independent quantities tend to be Gaussian. We again provide a cautionary tale, warning that this does not hold in the absence of independence. Then, we explain how to use the CLT to build confidence intervals which quantify the uncertainty of estimates obtained from finite data. Finally, we introduce the bootstrap, a popular computational technique to estimate standard errors and build confidence intervals.
This chapter covers regression and classification, where the goal is to estimate a quantity of interest (the response) from observed features. In regression, the response is a numerical variable. In classification, it belongs to a finite set of predetermined classes. We begin with a comprehensive description of linear regression and discuss how to leverage it to perform causal inference. Then, we explain under what conditions linear models tend to overfit or to generalize robustly to held-out data. Motivated by the threat of overfitting, we introduce regularization and ridge regression, and discuss sparse regression, where the goal is to fit a linear model that only depends on a small subset of the available features. Then, we introduce two popular linear models for binary and multiclass classification: Logistic and softmax regression. At this point, we turn our attention to nonlinear models. First, we present regression and classification trees and explain how to combine them via bagging, random forests, and boosting. Second, we explain how to train neural networks to perform regression and classification. Finally, we discuss how to evaluate classification models.
This chapter describes how to model multiple discrete quantities as discrete random variables within the same probability space and manipulate them using their joint pmf. We explain how to estimate the joint pmf from data, and use it to model precipitation in Oregon. Then, we introduce marginal distributions, which describe the individual behavior of each variable in a model, and conditional distributions, which describe the behavior of a variable when other variables are fixed. Next, we generalize the concepts of independence and conditional independence to random variables. In addition, we discuss the problem of causal inference, which seeks to identify causal relationships between variables. We then turn our attention to a fundamental challenge: It is impossible to completely characterize the dependence between all variables in a model, unless they are very few. This phenomenon, known as the curse of dimensionality, is the reason why independence assumptions are needed to make probabilistic models tractable. We conclude the chapter by describing two popular models based on such assumptions: Naive Bayes and Markov chains.
This chapter discusses how to build probabilistic models that include both discrete and continuous variables. Mathematically, this is achieved by defining them as random variables within the same probability space. In practice, the variables are manipulated using their marginal and conditional distributions. We define the conditional pmf of a discrete random variable given a continuous variable, and the conditional probability density of a continuous random variable given a discrete variable. We use these objects to build mixture models and apply them to model height in a population. Next, we describe Gaussian discriminant analysis, a classification method based on mixture models with Gaussian conditional distributions, and apply it to diagnose Alzheimer's disease. Then, we explain how to perform clustering using Gaussian mixture models and leverage the approach to cluster NBA players. Finally, we introduce the framework of Bayesian statistics which enables us to explicitly encode our uncertainty about model parameters, and use it to analyze poll data from the 2020 United States presidential election.
This chapter introduces random variables and explains how to use them to model uncertain numerical quantities that are discrete. We first provide a mathematical definition of random variables, building upon the framework of probability spaces. Then, we explain how to manipulate discrete random variables in practice, using their probability mass function (pmf), and describe the main properties of the pmf. Motivated by an example where we analyze Kevin Durant's free-throw shooting, we define the empirical pmf, a nonparametric estimator of the pmf that does not make strong assumptions about the data. Next, we define several popular discrete parametric distributions (Bernoulli, binomial, geometric, and Poisson), which yield parametric estimators of the pmf, and explain how to fit them to data via maximum-likelihood estimation. We conclude the chapter by comparing the advantages and disadvantages of nonparametric and parametric models, illustrated by a real-data example, where we model the number of calls arriving at a call center.
This chapter begins by defining an averaging procedure for random variables, known as the mean. We show that the mean is linear, and also that the mean of the product of independent variables equals the product of their means. Then, we derive the mean of popular parametric distributions. Next, we caution that the mean can be severely distorted by extreme values, as illustrated by an analysis of NBA salaries. In addition, we define the mean square, which is the average squared value of a random variable, and the variance, which is the mean square deviation from the mean. We explain how to estimate the variance from data and use it to describe temperature variability at different geographic locations. Then, we define the conditional mean, a quantity that represents the average of a variable when other variables are fixed. We prove that the conditional mean is an optimal solution to the problem of regression, where the goal is to estimate a quantity of interest as a function of other variables. We end the chapter by studying how to estimate average causal effects.