We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The purpose of this chapter is to set the stage for the book and for the upcoming chapters. We first overview classical information-theoretic problems and solutions. We then discuss emerging applications of information-theoretic methods in various data-science problems and, where applicable, refer the reader to related chapters in the book. Throughout this chapter, we highlight the perspectives, tools, and methods that play important roles in classic information-theoretic paradigms and in emerging areas of data science. Table 1.1 provides a summary of the different topics covered in this chapter and highlights the different chapters that can be read as a follow-up to these topics.
This introductory chapter lays out the key challenges of statistical inference in general and regression modeling in particular. We present a series of applied examples to show how complex and subtle regression can be, and why a book-length treatment is needed, not just on the mathematics of regression modeling but on how to apply and understand these methods.
To be considered good, a program needs to do what it is supposed to do. The next most important property is that it should be clearly understandable by a human reader, because that is necessary when you want to improve it in any way. More controversial is the question of whether a good program must be concise. As a student you will naturally want to get high marks. Finally, anyone who writes programs needs to be aware of ethics.
This chapter discusses Feature Engineering techniques that look holistically at the feature set, therefore replacing or enhancing the features based on their relation to the whole set of instances and features. Techniques such as normalization, scaling, dealing with outliers and generating descriptive features are covered. Scaling and normalization are the most common, it involves finding the maximum and minimum and changing the values to ensure they will lie in a given interval (e.g., [0, 1] or [−1, 1]). Discretization and binning involve, for example, analyzing a feature that is an integer (any number from -1 trillion to +1 trillion) and realize that it only takes the values 0, 1 and 10 so it can be simplified into a symbolic feature with three values (value0, value1 and value10). Descriptive features is the gathering of information that talks about the shape of the data, the discussion centres around using tables of counts (histograms) and general descriptive features such as maximum, minimum and averages. Outlier detection and treatment refers to looking at the feature values across many instances and realizing some values might present themselves very far from the rest.
Viruses are amazing creatures. They are the most common, the most diverse, and the fastest-evolving biological entities on Earth. They infect every form of life known, “hijacking” the complex machinery of cells and forcing them into submission. Being much smaller and less complex than cells, they have a unique, tiny kit of “tools” able to regulate the essential elements of cells and to “fool” their defense mechanisms. It should be noted that viruses do not exhibit any of the life properties we usually attribute to cells (such as metabolism, development, or sensitivity) other than reproduction. What viruses practically “do” is to enter cells, their “hosts,” and use the cellular machinery to produce new virus particles. It is not surprising that many important discoveries in biology during the last 100 years have been made from, and through, viruses. Viruses have provided fundamental clues to the principles of molecular biology, such as how cells replicate and handle their information and the mechanisms that cause cancers, among many others.
In order to understand what an algorithm is, let’s begin by taking a trip back a few millennia in the past to imagine one of our distant ancestors who had seen his late grandmother bake bread and then tries it himself. He doesn’t really know what to do. He hesitates, starts by boiling grains of wheat in water, then realizes it might be a bad idea. He does what we all do when confronted with a problem that we don’t know how to resolve: we think of solutions, we try them out, we feel our way, counting on a touch of serendipity until we succeed, or not.
In 2012, an employee working on Bing, Microsoft’s search engine, suggested changing how ad headlines display (Kohavi and Thomke 2017). The idea was to lengthen the title line of ads by combining it with the text from the first line below the title, as shown in Figure 1.1.
Qiang Yang, Hong Kong University of Science and Technology,Yu Zhang, Hong Kong University of Science and Technology,Wenyuan Dai,Sinno Jialin Pan, Nanyang Technological University, Singapore