To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter introduces unsupervised learning, where algorithms analyze data without predefined labels or target outcomes. It covers three main clustering approaches: agglomerative clustering (bottom-up approach merging similar data points) and divisive clustering (top-down approach, exemplified by k-means algorithm that partitions data into k groups by minimizing distances to centroids).
The chapter explains Expectation Maximization (EM) algorithm for handling incomplete data and finding maximum likelihood parameters in statistical models. It includes a section on reinforcement learning, where agents learn optimal actions through trial-and-error interactions with environments to maximize rewards.
Key topics include distance matrices, dendrograms, cluster evaluation metrics (AIC, BIC), and practical applications. The chapter emphasizes the artistic nature of unsupervised learning, requiring careful design decisions about thresholds, cluster numbers, and technique selection. Hands-on R examples demonstrate each method using real datasets.
This chapter explores the fundamentals of data in data science, covering data types (structured vs. unstructured), collection sources (open data, social media APIs, multimodal data, synthetic data), and storage formats (CSV, TSV, XML, RSS, JSON). It emphasizes the critical importance of data pre-processing, including data cleaning (handling missing values, smoothing noisy data, data munging), integration, transformation, reduction, and discretization. Through hands-on examples, the chapter demonstrates how to systematically prepare "dirty" real-world data for analysis by addressing inconsistencies, outliers, and missing information. The chapter highlights that data preparation is often half the battle in data science, requiring both technical skills and careful attention to data quality and bias.
This chapter introduces machine learning as a subset of artificial intelligence that enables computers to learn from data without explicit programming. It defines machine learning using Tom Mitchell’s formal framework and explores practical applications like self-driving cars, optical character recognition, and recommendation systems. The chapter focuses on regression as a fundamental machine learning technique, explaining linear regression for modeling relationships between variables. A key section covers gradient descent, an optimization algorithm that iteratively finds the best model parameters by minimizing error functions. Through hands-on Python examples, students learn to implement both linear regression and gradient descent algorithms, visualizing how models improve over iterations. The chapter emphasizes practical considerations for choosing appropriate algorithms, including accuracy, training time, linearity assumptions, and the number of parameters, preparing students for more advanced supervised and unsupervised learning techniques.
This chapter focuses on data collection methods, analysis approaches, and evaluation techniques in data science. It covers various data collection methods including surveys (with different question types like multiple-choice, Likert scales, and open-ended questions), interviews, focus groups, diary studies, and user studies in lab and field settings.
The chapter distinguishes between quantitative methods (using numerical measurements and statistical analysis) and qualitative methods (observing behaviors, attitudes, and opinions through techniques like grounded theory and constant comparison). It also discusses mixed-method approaches that combine both methodologies.
For evaluation, the chapter explains model comparison metrics including precision, recall, F-measure, ROC curves, AIC, and BIC. It covers validation techniques like training-testing splits, A/B testing, and cross-validation methods. The chapter emphasizes that data science involves pre-data collection planning and post-analysis evaluation, not just data processing.
This chapter introduces Python as a powerful yet beginner-friendly programming language essential for data science. It covers getting access to Python through direct installation or integrated development environments like Anaconda and Spyder. The chapter teaches fundamental programming concepts including basic operations, data types, and key data structures (lists, tuples, dictionaries, sets, and DataFrames). Students learn to write control structures using if-else statements and while/for loops, create reusable functions, and make programs interactive through user input. The chapter also explains how to install and use Python packages, which extend the language’s capabilities for specialized tasks. Throughout, practical examples demonstrate concepts like leap year calculations, temperature categorization, and sales data analysis. The chapter emphasizes Python’s accessibility, extensive package ecosystem, and suitability for data science applications, positioning it as an ideal tool for solving computational and data analysis problems.
This chapter covers unsupervised learning, where algorithms analyze data without known true labels or outcomes. Unlike supervised learning, the goal is to discover hidden patterns and structures in data.
The chapter explores three main techniques: Agglomerative clustering works bottom-up, starting with individual data points and merging similar ones into larger clusters. Divisive clustering (including k-means) takes a top-down approach, splitting data into smaller groups. Both methods use distance matrices and dendrograms to visualize cluster relationships.
Expectation Maximization (EM) handles incomplete data by iteratively estimating missing parameters using maximum likelihood estimation. Model quality is assessed using AIC and BIC criteria.
The chapter also introduces reinforcement learning, where agents learn optimal actions through trial-and-error interactions with environments, receiving rewards or penalties. Applications include robotics, gaming, and autonomous systems. Throughout, the chapter emphasizes the creative, interpretive nature of unsupervised learning compared to more structured supervised approaches.
This preface introduces “A Hands-On Introduction to Data Science with Python,” designed for advanced undergraduates and graduate students across diverse fields including information science, business, psychology, and sociology. The book requires minimal programming experience but expects computational thinking and basic statistics knowledge. It’s structured in four parts: foundations of data science, tools and platforms (Python programming and cloud computing), machine learning techniques, and real-world applications. The second edition adds "DS in Practice" boxes, cloud computing coverage, AI/ethics discussions, and downloadable datasets. The book emphasizes practical, hands-on learning with 39 solved exercises, 40 try-it-yourself problems, and 57 end-of-chapter problems, making data science accessible to non-technical students.