To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter, we introduce attacks/threats against machine learning. A primary aim of an attack is to cause the neural network to make errors. An attack may target the training dataset (its integrity or privacy), the training process (deep learning), or the parameters of the DNN once trained. Alternatively, an attack may target vulnerabilities by discovering test samples that produce erroneous output. The attacks include: (i) TTEs, which make subtle changes to a test pattern, causing the classifier’s decision to change; (ii) data poisoning attacks, which corrupt the training set to degrade accuracy of the trained model; (iii) backdoor attacks, a special case of data poisoning where a subtle (backdoor) pattern is embedded into some training samples, with their supervising label altered, so the classifier learns to misclassify to a target class when the backdoor pattern is present; (iv) reverse-engineering attacks, which query a classifier to learn its decision-making rule; and (v) membership inference attacks, which seek information about the training set from queries to the classifier. Defenses aim to detect attacks and/or to proactively improve robustness of machine learning. An overview is given of the three main types of attacks (TTEs, data poisoning, and backdoors) investigated in subsequent chapters.
In this chapter, we focus on before/during training backdoor defense, where the defender is also the training authority, with control of the training process and responsibility for providing an accurate, backdoor-free DNN classifier. Deployment of a backdoor defense during training is supported by the fact that the training authority is usually more resourceful in both computation and storage than a downstream user of the trained classifier. Moreover, before/during training detection could be easier than post-training detection because the defender has access to the (possibly poisoned) training set and, thus, to samples that contain the backdoor pattern. However, before/during training detection is still highly challenging because it is unknown whether there is poisoning and, if so, which subset of samples (among many possible subsets) is poisoned. A detailed review of backdoor attacks (Trojans) is given, and optimization-based reverse-engineering defense for training set cleansing deployed before/during classifier training is described. The defense is designed to detect backdoor attacks on samples with a human-imperceptible backdoor pattern, as widely considered in existing attacks and defenses. Detection of training set poisoning is achieved by reverse engineering (estimating) the pattern of a putative backdoor attack, considering each class as the possible target class of an attack.
Previous chapters exclusively considered attacks against classifiers. In this chapter, we devise a backdoor attack and defense for deep regression or prediction models. Such models may be used to, for example, predict housing prices in an area given measured features, to estimate a city’s power consumption on a given day, or to price financial derivatives (where they replace complex equation solvers and vastly improve the speed of inference). The developed attack is made most effective by surrounding poisoned samples (with their mis-supervised target values) by clean samples, in order to localize the attack and thus make it evasive to detection. The developed defense involves the use of a kind of query-by-synthesis active learning which trades off depth (local error maximizers) and breadth of search. Both the developed attack and defense are evaluated for an application domain that involves the pricing of a simple (single barrier) financial option.
In this chapter we describe unsupervised post-training defenses that do not make explicit assumptions regarding the backdoor pattern or how it was incorporated into clean samples. These backdoor defenses aim to be “universal.” They do not produce an estimate of the backdoor pattern (which may be valuable information as the basis for detecting backdoor triggers at test time, the subject of Chapter 10). We start by describing a universal backdoor detector that does not require any clean labeled data. This approach optimizes over the input image to the DNN, seeking the input that yields the maximum margin (for each putative target class of an attack). The premise here, under a winner-take-all decision rule, is that backdoors produce much larger classifier margins than those of un-attacked examples. Then a universal backdoor mitigation strategy is described that does leverage a small clean dataset. This optimizes a threshold (tamping down unusually large ReLU activations) for each neuron in the network. In each backdoor attack scenario described, different detection and mitigation strategies are compared, where some mitigation strategies are also known as “unlearning” defenses. Some universal backdoor defenses modify or augment the DNN itself, while others do not.
In this chapter we focus on post-training defense against backdoor data poisoning (Trojans). The defender has access to the trained DNN but not to the training set. The following are examples. (i) Proprietary: a customized DNN model purchased by government or a company without data rights and without training set access. (ii) Legacy: the data is long forgotten or not maintained. (iii) Cell phone apps: the user has no access to the training set for the app classifier. It is also assumed that a clean labeled dataset (no backdoor poisoning) is available with a small number of examples from each of the classes from the domain. This clean labeled dataset is insufficient for retraining and its small size makes its availability a reasonable assumption. Reverse-engineering defenses (REDs) are described including one that estimates putative backdoor patterns for each candidate (source class, target class) backdoor pair and then assesses an order statistic p-value on the sizes of these perturbations. This is successful at detecting subtle backdoor patterns, including sparse patterns involving few pixels, and global patterns where many pixels are modified subtly. A computationally efficient variant is presented. The method addresses additive backdoor embeddings and other embedding functions.
In this chapter we focus on post-training detection of backdoor attacks which replace a patch of pixels by a common backdoor pattern. We focus on scene-plausible perceptible backdoor patterns. Scene-plausibility is important for a perceptible attack to be evasive to human and machine-based detection, whether the attack is physical or digital. Though the focus is on image classification, the methodology could be applied to audio, where for “scene-plausibility” the backdoor pattern does not sound artificial or incongruous, amongst other sounds in the audio clip. For the Neural Cleanse method, the common backdoor pattern may be scene plausible or incongruous. In the latter case, backdoor trigger images (at test time) might be noticed by a human, thus thwarting the attack. The focus here is on defending against patch attacks that are scene plausible, meaning that the backdoor pattern cannot in general be embedded into the same location in every (poisoned) image. For example, a rainbow (one of the attack patterns) must be embedded in the sky (and this location may vary). The main method described builds on RED. It exploits the need for scene-plausibility, and attack “durability,” that the backdoor trigger will be effective in the presence of noise and occlusion.
In this chapter we provide an introduction to deep learning. This includes introducing pattern recognition concepts, neural network architectures, basic optimization techniques (as used by gradient-based deep learning algorithms), and various deep learning paradigms, for example for coping with limited available labeled training data and for improving embedded deep feature representation power. Some of the former include semi-supervised learning, transfer learning, and contrastive learning. Some of the latter include mainstays of deep learning such as convolutional layers, pooling layers, ReLU activations, dropout layers, attention mechanisms, and transformers. Gated recurrent neural networks (such as LSTMs) are not discussed in depth because they are not used in subsequent chapters. Some topics introduced in this chapter, such as neural network inversion and robust classifier training strategies, will be revisited frequently in subsequent chapters, as they form the basis both for attacks against deep learning and for defenses against such attacks.
In this chapter, we focus on post-training backdoor defense for classification problems involving only a few classes, particularly just two classes (K = 2), and involving arbitrary numbers of backdoor attacks, including different backdoor patterns with the same source and/or target classes. In Chapter 6, null models were estimated using (K – 1)2 statistics. For K = 2, only one such statistic is available, which is insufficient for estimating a null density model. Thus, the detection inference approach cannot be directly applied in the two-class case. Other detection statistics, such as the median absolute deviation (MAD) statistic used by Neural Cleanse, are also unsuitable for the two-class case. The developed method relies on high transferability of putative backdoor patterns that are estimated sample-wise, that is, a perturbation specifically designed to cause one sample to be misclassified also induces other (neighboring) samples to be misclassified. Intriguingly, the proposed method works effectively with a common (theoretically derived) detection threshold, irrespective of the classification domain and the particular attack. This is significant, as it may be difficult to set the detection threshold for any method in practice. The proposed method can be applied for various attack embedding functions (additive, patch, multiplicative, etc.).
Previous chapters considered detection of backdoors before/during training and post-training. Here, our objective is to detect use of a backdoor trigger operationally, that is, at test time. Such detection may prevent potentially catastrophic decisions, as well as potentially catching culprits in the act of exploiting a learned backdoor mapping. We also refer to such detection as “in-flight.” A likelihood based backdoor trigger detector is developed and compared against other detectors.
Backdoor attacks have been considered in non-image data domains, including speech and audio, text, as well as for regression applications (Chapter 12). In this chapter, we consider classification of point cloud data, for example, LiDAR data used by autonomous vehicles. Point cloud data differs significantly from images, with the former representing a given scene/object by a collection of points in 3D (or a higher-dimensional) space. Accordingly, point cloud DNN classifiers (such as PointNet) deviate significantly from the DNN architectures commonly used for image classification. So, backdoor (as well as test-time evasion) attacks also need to be customized to the nature of the (point cloud) data. Such attacks typically involve either adding points, deleting points, or modifying (transforming) the points representing a given scene/object. While test-time evasion attacks against point cloud classifiers were previously proposed, in this chapter we develop backdoor attacks against point cloud classifiers (based on insertion of points designed to defeat the classifier, as well as to defeat anomaly detectors that identify point outliers and remove them). We also devise a post-training detector designed to defeat this attack, as well as other point cloud backdoor attacks.
In this chapter we describe reverse-engineering attacks (REAs) on classifiers and defenses against them. REAs involve querying (probing) a classifier to discover its decision rules. One primary application of REAs is to enable TTEs. Another is to reveal a private (e.g., proprietary) classifier’s decision-making. For example, an adversary may seek to discover the workings of a military automated target-recognition system. Early work demonstrates that, with a modest number of (random) queries, which do not rely on any knowledge of the nominal data distribution, one can learn a surrogate classifier on a given domain that closely mimics an unknown classifier. However, a critical weakness of this attack is that random querying makes the attack easily detectable – randomly selected query patterns will typically look nothing like legitimate examples. They are likely to be extreme outliers of all the classes. Each such query is thus individually highly suspicious, let alone thousands or millions of such queries (required for accurate reverse-engineering). However, more recent REAs, which are akin to active learning strategies, are stealthier. Here, we use the ADA method (developed in Chapter 4 for TTE detection) to detect REAs. This method is demonstrated to provide significant detection power against stealthy REAs.
Robust statistics is the study of designing estimators that perform well even when the dataset significantly deviates from the idealized modeling assumptions, such as in the presence of model misspecification or adversarial outliers in the dataset. The classical statistical theory, dating back to pioneering works by Tukey and Huber, characterizes the information-theoretic limits of robust estimation for most common problems. A recent line of work in computer science gave the first computationally efficient robust estimators in high dimensions for a range of learning tasks. This reference text for graduate students, researchers, and professionals in machine learning theory, provides an overview of recent developments in algorithmic high-dimensional robust statistics, presenting the underlying ideas in a clear and unified manner, while leveraging new perspectives on the developed techniques to provide streamlined proofs of these results. The most basic and illustrative results are analyzed in each chapter, while more tangential developments are explored in the exercises.