To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter focuses on using Python for statistical analysis in data science. It begins with statistics essentials, teaching how to calculate descriptive statistics like mean, median, variance, and standard deviation using NumPy. The chapter covers data visualization techniques using Matplotlib to create histograms, bar charts, and scatterplots for exploring data patterns. Key topics include importing data using Pandas DataFrames, performing correlation analysis to measure relationships between variables, and conducting statistical inference through hypothesis testing. Students learn to implement t-tests for comparing means between two groups and ANOVA for comparing multiple groups. The chapter emphasizes practical applications through hands-on examples, from analyzing family age data to comparing exam scores across different classes. These statistical techniques form the foundation for more advanced data science work, enabling students to extract meaningful insights from datasets and make data-driven decisions.
This chapter provides a comprehensive introduction to supervised learning techniques for classification problems. It begins with logistic regression for binary classification, explaining the sigmoid function and gradient ascent optimization. The chapter then covers softmax regression for multi-class problems, followed by k-nearest neighbors (kNN) as an intuitive distance-based classifier.
Decision trees are explored in detail, including entropy, information gain, and the ID3 algorithm, along with derived decision rules and association rules. Random forests are presented as an ensemble method that addresses overfitting by combining multiple decision trees.
The chapter covers Naive Bayes classification based on Bayes’ theorem, despite its "naive" independence assumption. Finally, Support Vector Machines (SVMs) are introduced for both linear and non-linear classification using maximum margin hyperplanes.
Each technique includes hands-on R programming examples with real datasets, practical applications, and exercises to reinforce learning concepts.
This chapter explores fundamental analytical techniques in data science, distinguishing between data analysis (backward-looking) and data analytics (forward-looking prediction).
Six key analysis categories are covered:
Descriptive Analysis examines current data through statistical measures (mean, median, mode) and visualizations to understand "what is happening."
Diagnostic Analytics investigates "why something happened" using correlation analysis, emphasizing the distinction between correlation and causation.
Predictive Analytics forecasts future outcomes using historical data and regression analysis.
Prescriptive Analytics determines optimal courses of action by analyzing potential decisions.
Exploratory Analysis discovers unknown relationships through visualization when questions aren’t predetermined.
Mechanistic Analysis examines exact variable changes and their effects.
The chapter emphasizes statistical literacy as essential for data scientists, covering key concepts like variable types, frequency distributions, measures of centrality and dispersion, and regression modeling. Hands-on examples demonstrate applications across business, healthcare, and social sciences.
This chapter focuses on applying data science and machine learning techniques to real-world problems using Python. It covers four main applications: clinical data analysis, social media data collection and analysis, and large-scale data processing.
The chapter begins with exploring clinical data from a dermatology study, demonstrating visual exploration, gradient descent regression, random forest classification, and k-means clustering techniques. It then transitions to social media analysis, specifically working with Reddit APIs to collect and analyze posts, examining relationships between variables like post length, scores, and upvotes.
The YouTube section covers API authentication and data collection for video statistics analysis. Finally, the Yelp analysis demonstrates big data processing techniques, exploring user behavior patterns through correlation analysis, regression modeling, and clustering of review data.
The chapter emphasizes practical API usage, data visualization, statistical testing, and the importance of understanding both the problem and data before analysis.
This chapter explores the fundamentals of data in data science, covering data types (structured vs. unstructured), collection sources (open data, social media APIs, multimodal data, synthetic data), and storage formats (CSV, TSV, XML, RSS, JSON). It emphasizes the critical importance of data pre-processing, including data cleaning (handling missing values, smoothing noisy data, data munging), integration, transformation, reduction, and discretization. Through hands-on examples, the chapter demonstrates how to systematically prepare "dirty" real-world data for analysis by addressing inconsistencies, outliers, and missing information. The chapter highlights that data preparation is often half the battle in data science, requiring both technical skills and careful attention to data quality and bias.
This introductory chapter defines data science as a field focused on collecting, storing, and processing data to derive meaningful insights for decision-making. It explores data science applications across diverse sectors including finance, healthcare, politics, public policy, urban planning, education, and libraries. The chapter examines how data science relates to statistics, computer science, engineering, business analytics, and information science, while introducing computational thinking as a fundamental skill. It discusses the explosive growth of data (the 3Vs: velocity, volume, variety) and essential skills for data scientists, including statistical knowledge, programming abilities, and data literacy. The chapter concludes by addressing critical ethical concerns around privacy, bias, and fairness in data science practice.
This chapter focuses on data collection methods, analysis approaches, and evaluation techniques in data science. It covers various data collection methods including surveys (with different question types like multiple-choice, Likert scales, and open-ended questions), interviews, focus groups, diary studies, and user studies in lab and field settings.
The chapter distinguishes between quantitative methods (using numerical measurements and statistical analysis) and qualitative methods (observing behaviors, attitudes, and opinions through techniques like grounded theory and constant comparison). It also discusses mixed-method approaches that combine both methodologies.
For evaluation, the chapter explains model comparison metrics including precision, recall, F-measure, ROC curves, AIC, and BIC. It covers validation techniques like training-testing splits, A/B testing, and cross-validation methods. The chapter emphasizes that data science involves pre-data collection planning and post-analysis evaluation, not just data processing.
This chapter introduces cloud computing platforms essential for modern data science work. It covers three major cloud services: Google Cloud Platform (GCP), Microsoft Azure, and Amazon Web Services (AWS). Students learn to create virtual machines, configure storage, and access cloud resources through SSH connections. The chapter demonstrates hands-on Python development using browser-based IDEs like Google Colab, Azure Machine Learning notebooks, and AWS Cloud9. Key topics include setting up accounts, managing costs through free tiers, and leveraging cloud resources for data science projects. The chapter also covers Hadoop for big data processing and discusses platform migration strategies. Practical exercises guide students through currency conversion programs, interactive calculations, and Olympic year predictions, emphasizing that cloud computing skills are now essential for data science professionals due to scalable processing power and storage capabilities.
This chapter introduces machine learning as a subset of artificial intelligence that enables computers to learn from data and make predictions without explicit programming. It defines machine learning through Tom Mitchell’s formal framework and explores real-world applications like self-driving cars, optical character recognition, and recommendation systems. The chapter focuses on regression as a fundamental machine learning technique, covering both linear modeling approaches and gradient descent algorithms for parameter optimization. Through hands-on examples using R, students learn to implement linear regression and gradient descent from scratch, understanding how models minimize error functions to find optimal parameters. The chapter emphasizes practical application over theoretical derivations.
This chapter introduces cloud computing platforms essential for modern data science work. It covers the three major providers: Google Cloud Platform (GCP), Microsoft Azure, and Amazon Web Services (AWS).
Key topics include setting up virtual machines, configuring SSH access, and running RStudio Server in browser-based environments on each platform. The chapter demonstrates how to migrate data science workflows from local machines to cloud infrastructure, providing scalable computing resources and storage.
Practical examples show installing R and RStudio on cloud VMs, accessing them through web browsers, and managing costs. The chapter emphasizes that cloud computing skills are now essential for data science practitioners, offering dynamic scaling, redundancy, and pay-as-you-use pricing models for computational resources.