To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter starts with basic definitions such as types of machine learning (supervised vs. unsupervised learning, classifiers vs. regressors), types of features (binary, categorical, discrete, continuos), metrics (precision, recall, f-measure, accuracy, overfitting), and raw data and then defines the machine learning cycle and the feature engineering cycle. The feature engineering cycle hinges on two types of analysis: exploratory data analysis, at the beginning of the cycle and error analysis at the end of each feature engineering cycle. Domain modelling and feature construction concludes the chapter with particular emphasis on feature ideation techniques.
Building on the dataset presented in the previous chapter, this chapter explores using historical information (as represented by previous DBpedia versions) to perform feature engineering using historical features.Working with historical data makes all Feature Engineering complex but the whole concept of “truth,” the immutablity of the target class is challenged. Great feature for a particular class have to become acceptable features for a different class. Topics covered include imputing timestamped data, lagged features and moving window averaging of the data. Due to unavailability of population data for cities, a second dataset revolving around countries is introduced to perform population prediction using time series ARIMA models over 50 years of data, as provided by the world bank. The chapter exemplifies different methods to blend machine learning with time series models, including using their output as another feature or training a model to predict their errors.
The last chapter using the dataset from chapter 6, this time the dataset is expanded using 80 thousand satellite images obtained from NASA for relative humidity density as it relates to vegetation. This image information is not visible information and thus pre-trained models are not useful for this problem. Instead, traditional computer vision techniques are showcased, involving histograms, local feature extraction, gradients and histograms of gradients. This domain exemplifies how to deal with high level of nuisance noise, how to normalize your features so that a small changes in the way the information was acquired does not completely throws off the underlining machine learning mechanism. These takeaways are valuable for anybody working with sensor data where the acquisition process has a high level of uncertainty. The domain also exemplifies working with large number of low-level sensor data.
This chapter presents a staple of Feature Engineering: the automatic reduction of features, either by direction selection or by projection to a smaller feature space.Central to Feature Engineering are efforts to reduce the number of features, as uninformative features bloat the ML model with unnecessary parameters. In turn, too many parameters then either produces suboptimal results, as they are easy to overfit, or require large amounts of training data. These efforts are either by explicitly dropping certain features (feature selection) or mapping the feature vector, if it is sparse, into a lower, denser dimension (dimensionality reduction). There are also cover some algorithms that perform feature selection as part of their inner computation (embedded feature selection or regularization). Feature selection takes the spotlight within Feature Engineering due to its intrinsic utility for Error Analysis. Some techniques such as feature ablation using wrapper methods are used as the starting step before a feature drill down. Moreover, as feature selection helps build understandable models, it intertwines with Error Analysis as the analysis profits from such understandable models.
This chapter showcases three domains different from the other domains to highlight particular techniques and problems missing from the previous chapters. Each domain uses a small dataset assembled just for the case study and undergoes only one featurization process. The first domain is video domain, where a mouse pointer tracking problem on a video is studied to showcase two key issues in feature engineering for video: speed of processing and reusability of results from a previous feature extraction. The second domain is geographical information systems, where bird migration data is studied to showcase the value of key points as physical summaries of paths and bi-dimensional clustering of points using R-trees. Finally, preference data is studied via movie preferences to showcase large scale imputation of missing features.