Assurance monitoring of learning-enabled cyber-physical systems using inductive conformal prediction based on distance learning

Abstract Machine learning components such as deep neural networks are used extensively in cyber-physical systems (CPS). However, such components may introduce new types of hazards that can have disastrous consequences and need to be addressed for engineering trustworthy systems. Although deep neural networks offer advanced capabilities, they must be complemented by engineering methods and practices that allow effective integration in CPS. In this paper, we proposed an approach for assurance monitoring of learning-enabled CPS based on the conformal prediction framework. In order to allow real-time assurance monitoring, the approach employs distance learning to transform high-dimensional inputs into lower size embedding representations. By leveraging conformal prediction, the approach provides well-calibrated confidence and ensures a bounded small error rate while limiting the number of inputs for which an accurate prediction cannot be made. We demonstrate the approach using three datasets of mobile robot following a wall, speaker recognition, and traffic sign recognition. The experimental results demonstrate that the error rates are well-calibrated while the number of alarms is very small. Furthermore, the method is computationally efficient and allows real-time assurance monitoring of CPS.


Introduction
Cyber-physical systems (CPS) can benefit by incorporating machine learning components that can handle the uncertainty and variability of the real world.Typical components such as deep neural networks (DNNs) can be used for performing various tasks such as the perception of the environment.In autonomous vehicles, for example, perception components aim at making sense of the surroundings like recognizing correctly traffic signs.However, such DNNs introduce new types of hazards that can have disastrous consequences and need to be addressed for engineering trustworthy systems.Although DNNs offer advanced capabilities, they must be complemented by engineering methods and practices that allow effective integration in CPS.
A DNN is designed using learning techniques that require specification of the task, a measure for evaluating how well the task is performed, and experience which typically includes training and testing data.Using the DNN during system operation presents challenges that must be addressed using innovative engineering methods.The perception of the environment is a functionality that is difficult to specify, and typically, specifications are based on examples.DNNs exhibit some nonzero error rate, the true error rate is unknown, and only an estimate from a design-time statistical process is known.Furthermore, DNNs encode information in a complex manner and it is hard to reason about the encoding.Nontransparency is an obstacle to monitoring because it is more difficult to have confidence that the model is operating as intended.
Our objective in this paper is to complement the prediction of DNNs with a computation of confidence that can be used for decision making.We consider DNNs used for classification in CPS.In addition to the class prediction, we compute set predictors with a given confidence using the conformal prediction framework (Balasubramanian et al., 2014).We focus on computationally efficient algorithms that can be used for real-time monitoring.An efficient and robust approach must ensure a small and well-calibrated error rate while limiting the number of alarms.This enables the design of monitors which can ensure a bounded small error rate while limiting the number of inputs for which an accurate prediction cannot be made.
The proposed approach is based on conformal prediction (CP;Vovk et al., 2005;Balasubramanian et al., 2014).CP aims at associating reliable measures of confidence with set predictions for problems that include classification and regression.An important feature of the CP framework is the calibration of the obtained confidence values in an online setting which is very promising for real-time monitoring in CPS applications.These methods can be applied for a variety of machine learning algorithms that include DNNs.The main idea is to test if a new input example conforms to the training dataset by utilizing a nonconformity measure (NCM) which assigns a numerical score indicating how different the input example is from the training dataset.The next step is to define a p-value as the fraction of observations that have nonconformity (NC) scores greater than or equal to the NC scores of the training examples which is then used for estimating the confidence of the prediction for the test input.In order to use the approach online, inductive conformal prediction (ICP) has been developed for computational efficiency (Papadopoulos et al., 2007;Balasubramanian et al., 2014).In ICP, the training dataset is split into the proper training dataset that is used for learning and a calibration dataset that is used to compute the predictions for given confidence levels.Existing methods rely on NCMs computed using techniques such as k-nearest neighbors (k-NN) and Kernel Density Estimation and do not scale for high-dimensional inputs in CPS.
DNNs have the ability to compute layers of representations of the input data which can then be used to distinguish between available classes (Hinton, 2007;Bengio, 2009).In our previous work, we developed an approach for mapping high-dimensional inputs into lower-dimensional representations to make the application of ICP possible for assurance monitoring of CPS in real time (Boursinos and Koutsoukos, 2020a).The approach utilizes the vector of the neuron activations in the penultimate layer of the DNN for a particular input.This low-dimensional representation can be used to compute NC scores efficiently for high-dimensional inputs.In problems where the input data are high-dimensional, such as the classification of traffic sign images in autonomous vehicles, ICP based on these learned embedding representations produces confident predictions.Moreover, the execution time and the required memory are significantly lower than using the original inputs, and the approach can be used for real-time assurance monitoring of the DNN.The use of low-dimensional learned embedding representations results in improved performance compared with ICP based on the original inputs.However, the underlying DNN is still trained to perform classification and does not learn necessarily optimal representations for computing NC scores.
The main challenge addressed in this paper is the efficient computation of embedding representations that allows assurance monitoring based on conformal prediction in real time.The novelty of the approach lies on using distance metric learning to generate representations of the input data and use Euclidean distance as a measure of similarity.Unlike training a classifier where each training input is assigned a ground truth label and the objective is to minimize a loss function so that the prediction of the classifier will be the same as the label, in distance metric learning, the inputs are considered in pairs.The associated loss function is defined using pairwise constraints such that its minimization will make representations of inputs that belong to the same class be close to each other and representations of inputs belonging to different classes be far from each other.Preliminary results on using appropriate representations for a robotic navigation benchmark with low-dimensional inputs are presented in Boursinos and Koutsoukos (2020c).
The main contribution of the paper is the leverage of distance metric learning for assurance monitoring of learning-enabled CPS.The proposed approach based on ICP can be used in real time for high-dimensional data that are typically used in CPS.Different NC functions can be used in ICP to evaluate whether new unknown inputs are similar to the data that have been used for training a learning-enabled component such as DNN.An NC function assigns a score to a labeled input reflecting how well it conforms to the training dataset.Because the choice of the NC function is very important, the proposed approach utilizes neural network architectures for distance metric learning based on siamese (Koch et al., 2015) and triplet networks (Hoffer and Ailon, 2015) to learn representations and define NC functions based on the Euclidean distance.Specifically, the proposed functions compute the NC scores of a new labeled input using (1) the labels of its closest neighbors, (2) how far the closest neighbor of the same class is compared with any other neighbor, and (3) how far the label's centroid is compared with the centroids of the other labels.The main benefit of the approach is that by utilizing distance metric learning in ICP, we reduce the computational requirements without sacrificing accuracy or efficiency.
An important advantage of the approach is that it allows the computation of the optimal significance level that can be used by the assurance monitor to ensure a bounded error rate while limiting the number of inputs for which an accurate prediction cannot be made.Unlike most common machine learning classifiers that assign a single label to an input, ICP computes a set of candidate labels that contains the correct class given a selected significance level.Small significance level values reduce the classification errors but may result in set predictors with multiple candidate labels.In autonomous systems, it is not only important to have predictions with well-calibrated confidence but also to be able to choose the desired significance level based on the application requirements.Even though reducing the number of possible classes may be helpful when the information is provide to a human, in an autonomous system, it is desirable that the prediction is unique.Therefore, we assume that set predictions that contain multiple classes lead to a rejection of the input and require human intervention.For this reason, it is desirable to minimize the number of test inputs with multiple predictions.If the prediction is unique, then the monitor ensures a confident prediction with well-calibrated error rate defined by the significance level.If the predicted set contains multiple predictions, the monitor rejects the prediction and raises an alarm.Finally, if the predicted set is empty, the monitor indicates that no label is probable.We distinguish between multiple and no predictions, because they may lead to different action in the system.For example, no prediction may be the result of out-of-distribution inputs while multiple possible predictions may be an indication that the significance level is smaller than the accuracy of the underlying DNN.
The paper presents a comprehensive empirical evaluation of the approach using three datasets for classification problems in CPS of increasing complexity.The first dataset is the SCITOS-G5 robot navigation dataset (Dua and Graff, 2017) for which we use a fully connected feedforward network architecture.The second is a speech recognition dataset which contains audio files of human speech (Kiplagat, n.d.).For this problem, we learn the embedding representations using a DNN with 1D convolutional layers.The third dataset is the German Traffic Sign Recognition Benchmark (GTSRB; Stallkamp et al., 2012).For this dataset, we use a modified version of the VGG16 architecture (Simonyan and Zisserman, 2014) to learn and generate the embedding representations.We used different combinations of NC functions and distance metric learning architectures and compare them with ICP without distance metric learning.The results demonstrate that the selected or computed significance levels bound the error rate in all cases.Moreover, the representations learned by the siamese or triplet networks result in well-formed clusters for different classes and individual training data typically can be captured by their class centroid.Such representations reduce the memory requirements and the execution time overhead while still ensure a bounded small error rate with a limited number of prediction sets containing multiple candidate labels.
Related work on confidence estimation and well-calibrated models for different kind of machine learning methods is presented in the "Related work" section.In the section "Problem formulation", we define the problem and present the proposed architecture.Sections "Distance learning", "ICP based on distance learning", and "Assurance monitoring" present the details of ICP based on distance learning and assurance monitoring.Finally, we evaluate the performance of our suggested approach on three different applications in the section "Evaluation".

Related work
Machine learning components tend to be poorly calibrated.Modern, commonly used DNN architectures typically have a softmax layer to produce a probability-like output for each class.The chosen class is the one with the highest probability; however, this generated probability measure is often higher than the actual posterior probability that the prediction is correct.Other factors that affect the calibration in DNNs are the depth, width, weight decay, and Batch Normalization (Guo et al., 2017).The estimation of accurate error-rate bounds is important as it provides assurance guarantees in safety-critical applications but also makes the decision confidence interpretable by humans.Several approaches have been proposed that compute well-calibrated confidence metrics in different ways, like scaling the DNN softmax outputs or other post-processing algorithms.
The calibration methods generally belong to two categories: parametric and nonparametric.The parametric methods assume that the probabilities follow certain well-known distributions whose parameters are to be estimated from the training data.The Platt's scaling method (Platt, 1999) is proposed for the calibration of Support Vector Machine (SVM) outputs.After the training of an SVM, the method computes the parameters of a sigmoid function to map the outputs into probabilities.Piecewise logistic regression is an extension of Platt scaling and assumes that the log-odds of calibrated probabilities follow a piecewise linear function (Zhang and Yang, 2004).Another variant of Platt scaling is temperature scaling (Guo et al., 2017) which can be applied in DNNs with a softmax output layer.After training of a DNN, a temperature scaling factor T is computed on a validation set to scale the softmax outputs.However, while temperature scaling achieves good calibration when the data in the validation dataset are independent and identically distributed (IID), there is no calibration guarantee under distribution shifts (Ovadia et al., 2019).Experiments in Kumar et al. (2019) show that Platt scaling and temperature scaling are not well-calibrated as it is reported and it is difficult to know how miscalibrated they are.
Histogram binning or quantile binning is a commonly used nonparametric approach with either equal-width or equal-frequency bins.It divides the outputs of a classifier into bins and computes the calibrated probability as the ratio of correct classifications in each bin (Zadrozny and Elkan, 2001).Isotonic Regression is a generalization of histogram binning by jointly optimizing the bin boundaries and bin predictions (Zadrozny and Elkan, 2002).An extention of isotonic regression is a method called ensemble of nearisotonic regression (ENIR) that uses selective Bayesian averaging to ensemble the near-isotonic regression models (Naeini and Cooper, 2018).Adaptive calibration of predictions (ACPs) also use the ratio of correct classifications as the posterior probability in each bin, but it obtains bins from a 95% confidence interval around each individual prediction (Jiang et al., 2012).Estimating calibrated probability is a more significant issue in class imbalance and class overlap problems.Receiver Operating Characteristics (ROC) Binning uses the ROC curves to construct equal-width bins that provide accurate calibrated probabilities that are robust to changes in the prevalence of the positive class (Sun and Cho, 2018).Bayesian binning into quantiles (BBQs) extend the simple histogram-binning calibration method by considering multiple equal-frequency Histogram Binning models and their combination as the calibration result (Naeini et al., 2015).
Another framework developed to produce well-calibrated confidence values is the CP (Vovk et al., 2005;Shafer and Vovk, 2008;Balasubramanian et al., 2014).The conformal prediction framework can be applied to produce calibrated confidence values with a variety of machine learning algorithms with slight modifications.Using CP together with machine learning models such as DNNs is computationally inefficient.In Papadopoulos et al. (2007), the authors suggest a modified version of the CP framework, ICP that has less computational overhead and they evaluate the results using DNNs as undelying model.Deep k-nearest neighbors (DkNN) is an approach based on ICP for classification problems that uses the activations from all the hidden layers of a neural network as features (Papernot and McDaniel, 2018).The method is based on the assumption that when a DNN makes a wrong prediction, there is a specific hidden layer that generated intermediate results that lead to the wrong prediction.Taking into account all the hidden layers can lead to better interpretability of the predictions.In Johansson et al. (2013), the authors present an empirical investigation of decision trees as conformal predictors and analyzed the effects of different split criteria, such as the Gini index and the entropy, on ICP.There are similar evaluations using ICP with random forests (Devetyarov and Nouretdinov, 2010;Bhattacharyya, 2013) as well as SVMs (Makili et al., 2011).The above methods are applied to datasets and show good results when the input data are IID.In Boursinos and Koutsoukos (2020b), we showed that ICP underperforms when the input data are sequential.Individual frames of a sequence might contain partial information regarding the input and more frames might be needed for ICP to reach a confident prediction.The performance of ICP in this case can be improved by designing a feedback-loop configuration that queries the sensors until a single confident decision can be reached.
Confidence bounds can also be generated for regression problems.In this case instead of sets of multiple candidate labels, we have intervals around a point prediction that include the correct prediction with a desired confidence.There are ICP methods for regression problems with different underlying machine learning algorithms.In Papadopoulos et al. (2011), the authors use the k-nearest neighbors regression (k-NNR) as a predictor and evaluate the effects of different nonconformity functions.Random forests can also be used in regression problems.In Johansson et al. (2014), there is a comparison on the generated confidence bounds using k-NNR and DNNs (Papadopoulos and Haralambous, 2011).An alternative framework used to compute confidence bounds on regression problems is the Simultaneous Confidence Bands.The method presented in Sun and Loader (1994) generates linear confidence bounds centered around the point prediction of a regression model.In this approach, the model used for predictions has to be estimated by a sum of linear models.Models that satisfy this condition are the least squares polynomial models, kernel methods, and smoothing splines.Functional principal components (FPC) analysis can be used for the decomposition of an arbitrary regression model to a combination of linear models (Goldsmith et al., 2013).
The findings of the state-of-the-art methods described above illustrate the significance of computing well-calibrated and accurate confidence measures.Typically, the main objective is to complement existing machine learning models that are generally unable to produce an accurate estimation of confidence for their predictions with post-processing techniques in order to compute well-calibrated probabilities.An important advantage of such approaches is that they are independent of the underline predictive machine learning models.Therefore, there is no need to redesign and optimize the objective functions used for training which could lead to optimization tasks with high computational complexity.
Computing well-calibrated confidence is extremely important for designing autonomous systems because accurate measures of confidence are necessary to estimate the risk associated with each decision.The main limitation of existing methods comes from the fact that it is very difficult to select desired confidence values according to the application requirements and ensure bounded error rate.This is especially important in autonomous CPS applications where decisions can be safety critical.Another important challenge is to investigate how the computed confidence measures can be used for decision making by autonomous systems and how to handle data for which a confident decision cannot be taken.
The proposed work based on ICP produces prediction sets and computes a significance level that will bound the expected error rate.Similar to existing methods, since the approach is based on ICP, it can be used with any machine learning component without the need of retraining.ICP methods provide very promising results especially when the input data are not very high-dimensional and there are not stringent time constraints.However, ICP can be impractical when the inputs are, for example, images because of the excessive memory requirements and high execution times.The proposed approach aims to learn appropriate lower-dimensional representations of high-dimensional inputs that make the task of computing confidence measures based on similarities much easier.

Problem formulation
A perception component in a CPS aims to observe and interpret the environment in order to provide information for decision making.For example, in autonomous vehicles, a DNN can be used to classify traffic signs.The problem is to complement the prediction of the DNN with a computation of confidence.An efficient and robust approach must ensure a small and well-calibrated error rate while limiting the number of alarms to enable real-time monitoring.That is, maximize the autonomous operation time while keeping the error-rate bounded according to the application requirements.Finally, the computation of well-calibrated predictions must be computationally efficient for applications with high-dimensional inputs that require fast decision as, for example, in autonomous vehicles.
During the system operation of a CPS, inputs arrive one by one.After receiving each input, the objective is to compute a valid measure of the confidence of the prediction.The objective is twofold: (1) provide guarantees for the error rate of the prediction and (2) design a monitor which limits the number of input examples for which a confident prediction cannot be made.Such a monitor can be used, for example, by generating warnings that require human intervention.
The conformal prediction framework allows computing set predictors for a given confidence expressed as a significance value (Balasubramanian et al., 2014).The confidence is generated by comparing how similar a test is to the training data using different nonconformity functions.In our previous work (Boursinos and Koutsoukos, 2020c) we used DNNs to produce embedding representations for more efficient application of ICP.The additional problem we are solving is the computation of appropriate embedding representations that will lead to more confident decisions.The proposed approach is illustrated in Figure 1.The main idea is to use distance learning and enable DNNs to learn a lowerdimensional representation for each input on an embedding space where the Euclidean distance between the input representations is a measure of similarity between the original inputs themselves.The ICP approach is applied using the low-dimensional embedding representations and estimates the similarity between a new input and the available data in the training set using an NC function.Using such a representation not only reduces the execution time and the memory requirements but is also more efficient in producing useful predictions.Based on a chosen significance level, ICP generates a set of possible predictions.If the computed set contains a single prediction, the confidence is a well-calibrated and a valid indication of the expected error.If the computed set contains multiple predictions or no predictions, an alarm can be raised to indicate the need for additional information.
In CPS, it is desirable to minimize the number of alarms while performing the required computations in real time.An evaluation of the method must be based on metrics that quantify the error rate, the number of alarms, and the computational efficiency.For real-time operation, the time and memory requirements of the monitoring approach must be similar to the computational requirements of the DNNs used in the CPS architecture.Figure 1 illustrates the proposed architecture for assurance monitoring.At design time, a DNN is trained to produce embedding representations using distance metric learning techniques.Then, NC scores are computed for a labeled calibration set that is not used for training of the DNN.During system operation, the assurance monitor employs the trained DNN to map new sensor inputs to lower-dimensional representations.Using the NC scores of the calibration data, the method produces prediction sets including well-calibrated confidence of the predictions.Ideally, a prediction set should include exactly one class to enable decision making.Alarms can be raised if either the prediction set include multiple possible classes or if it does not contain any.

Distance learning
The ICP framework requires computing the similarity between the training data and a test input.This can be done efficiently by learning representations of the inputs for which the Euclidean distance is a metric of similarity, meaning that similar inputs will be close to each other as illustrated in Figure 2.There are different approaches based on DNN architectures that generate embedding representations for distance metric learning.
A siamese network is composed using two copies of the same neural network with shared parameters (Koch et al., 2015) as shown in Figure 3a.During training, each identical copy of the siamese network is fed with different training samples x 1 and x 2 belonging to classes y 1 and y 2 .The embedding representations produced by each network copy are r 1 = Net(x 1 ) and r 2 = Net (x 2 ).The learning goal is to minimize the Euclidean distance between the embedding representations of inputs belonging to the same class and maximize it for inputs belonging to different classes as described below: (1) This optimization problem can be solved using the contrastive loss function (Melekhov et al., 2016): where y is a binary flag equal to 0 if y 1 = y 2 and to 1 if y 1 ≠ y 2 and m is a margin parameter.In particular, when y 1 ≠ y 2 , L = 0, when d(r 1 , r 2 ) ≥ m, otherwise the parameters of the network are updated to produce more distant representations for those two elements.The reason behind the use of the margin is that when the distance between pairs of different classes are large enough and at most m, there is no reason to update the network to put  Another architecture trained to produce embedding representations for distance learning is the triplet network (Hoffer and Ailon, 2015).A triplet is composed using three copies of the same neural network with shared parameters as shown in Figure 3b.The training examples consist of three samples, the anchor sample x, the positive sample x + , and the negative sample x − .The samples x and x + belong to the same class, while x − belongs to a different class.The embedding representations produced by each network copy will be r = Net(x), r + = Net(x + ), and r − = Net(x − ).The optimization problem described by Eq. ( 1) is solved by training the triplet network copies using the triplet loss function: The margin parameter m separates pairs of different classes by at most m and it is used so that the network parameters will not be updated trying to push a pair even further away when a positive sample is already at least m closer to an anchor than a negative sample.Instead, the training is more efficient when harder triplets are used.The input triplets to the network copies can be sampled randomly from the training data.However, as training progresses it is harder to randomly find triplets that produce L(r, r + , r − ) > 0 that will update the triplet network parameters.This leads to slow training and underfitted models.The training can be improved by carefully mining the training data that produce a large loss (Xuan et al., 2019).For each training iteration, first, the anchor training data are randomly chosen.For each anchor, the hardest positive sample is chosen, meaning a sample from the same class as the anchor that is located the furthest away from the anchor.Then, the triplets are formed by mining hard negative samples that satisfy d(r, r − ) < d(r, r + ) or semi-hard negatives that satisfy d(r, r − ) < d(r, r + ) + m.This way the formed triplet batches will produce gradients to update the shared weights between the DNN copies.

ICP based on distance learning
We consider a training set {z 1 , …, z l } of examples, where each z i ∈ Z is a pair (x i , y i ) with x i the feature vector and y i the label of that example.For a given unlabeled input x l+1 and a chosen significance level ϵ, the task is to compute a prediction set Γ ϵ for which P(y l+1 Ó G e ) , e, where y l+1 is the ground truth label of the input x l+1 .ICP computes well-calibrated prediction sets with the underlying assumption that all examples (x i , y i ), i = 1, 2, … are IID generated from the same but typically unknown probability distribution.
Central to the application of ICP is a nonconformity function or NCM which shows how different a labeled input is from the examples in the training set.For a given test example z i with candidate label ỹi , an NC function assigns a numerical score indicating how different the example z i is from the examples in {z 1 , …, z i −1 , z i+1 , …, z n }.There are many possible NC functions that can be used (Vovk et al., 2005;Shafer and Vovk, 2008;Balasubramanian et al., 2014;Johansson et al., 2017;Boursinos and Koutsoukos, 2020a).For example, an NC function can be defined as the number of the k-NN to z l+1 in the training set with label different from the candidate label ỹl+1 (k-NN NCM).The input space is often high-dimensional which makes storing the whole training set impractical and the computation of the NC scores inefficient.To address this challenge, the proposed approach leverages distance metric learning methods to learn representations that enable applying ICP in real time.
Nonconformity functions that can be defined in the embedding space learned by siamese and triplet networks are (1) the k-NN (Papernot and McDaniel, 2018), (2) the one Nearest Neighbor (1-NN; Vovk et al., 2005), and (3) the Nearest Centroid (Balasubramanian et al., 2014).The k-NN NCM finds the k most similar examples of a test input x in the training data and counts how many of those are labeled different from the candidate label y.We denote f : X → V the mapping from the input space X to the embedding space V defined by either a siamese or a triplet network.Using the trained neural network, the encodings v i = f (x i ) are computed and stored for all the training data x i .Given a test input x with encoding v = f (x), we compute the k-NN in V and store their labels in a multi-set Ω.The k-NN NCM of input x with a candidate label y is defined as The 1-NN NCM requires to find the most similar example of a test input x in the training set that is labeled the same as the candidate label y as well as the most similar example in the training set that belongs to any class other than y and is defined as ), and d is the Euclidean distance metric in the V space.
The nearest centroid NCM simplifies the task of computing individual training examples that are similar to a test input when there is a large amount of training data.We expect examples that belong to a particular class to be close to each other in the embedding space, so for each class y i , we compute its centroid m y i = n i j=1 v i j /n i , where v i j is the embedding representation of the jth training example from class y i and n i is the number of training examples in class y i .The NC function is then defined as where v = f(x).It should be noted that for computing the nearest centroid NCM, we need to store only the centroid for each class.
The NC score is an indication of how uncommon a test input is compared with the training data.Input data that come from the same distribution as the training data will produce low NC scores and are expected to lead to more confident classifications while unusual inputs will have higher NC score.However, this measure does not provide clear confidence information by itself, but it can be used by comparing it with NCM scores computed using a calibration set of known labeled data.Consider the training set {z 1 , …, z l }.This set is split into two parts, the proper training set {z 1 , …, z m } of size m < l that will also be used for the training of the siamese or triplet network and the calibration set {z m+1 , …, z l } of size l − m.The NC scores a(x i , y i ), i = m + 1, …, l, of the examples in the calibration set are computed and stored before applying the online monitoring algorithm.Given a test input x with an unknown label y, the method generates a set |Γ ϵ | of possible labels ỹ so that P(y Ó |G e |) , e.For all the candidate labels ỹ, ICP computes the empirical p-value defined as which is the fraction of NC scores of the calibration data that are equal or larger than the NC score of a test input.A candidate label is added to Γ ϵ if p j (x) > ϵ.It is shown in Balasubramanian et al. (2014) that the prediction sets computed by ICP are valid, that is, the probability of error will not exceed ϵ for any ϵ ∈ [0, 1] for any choice of NC function.Our approach focuses on computing small prediction sets in an efficient manner that allow assurance monitoring approach in real time.

Assurance monitoring
In CPS, it is not only important to have predictions with wellcalibrated confidence but also to be able to choose the desired significance level based on the application requirements.ICP computes a prediction set Γ ϵ with a chosen significance level ϵ and Γ ϵ may include any subset of all possible classes.Even though reducing the number of possible classes may be helpful when the information is provided to a human, in an autonomous system it is desirable that the prediction is unique, that is, |Γ ϵ | = 1.Therefore, we assume that set predictions that contain multiple classes, that is, |Γ ϵ | > 1, lead to a rejection of the input and require human intervention.For this reason, it is desirable to minimize the number of test inputs with multiple predictions and we define a monitor with output defined as

⎧ ⎨ ⎩
If the set Γ ϵ contains a single prediction, the monitor outputs out = 1 to indicate a confident prediction with well-calibrated error rate ϵ.If the predicted set contains multiple predictions, the monitor rejects the prediction and raises an alarm.Finally, if the predicted set is empty, the monitor outputs out = 0 to indicate that no label is probable.We distinguish between multiple and no predictions, because they may lead to a different action in the system.For example, no prediction may be the result of out-of-distribution inputs while multiple possible predictions may be an indication that the significance level is smaller than the accuracy of the underlying DNN.Choosing a relatively small significance level that can consistently produce prediction sets with only one class is important.To do this, we apply ICP on the data in the calibration/validation set and compute the smallest significance level ϵ that does not produce any prediction set with |Γ ϵ | > 1.Assuming that the distribution of the test set is the same as the one of the calibration/validation set we expect the same value of ϵ to minimize the prediction sets with multiple classes on the test data.
The assurance monitoring approach is illustrated in Algorithms 1 and 2. Algorithm 1 shows the tasks that need to be performed at design time where, first, a distance metric learning network f is trained using the proper training set (X, Y ) so that the computed embedding representations will form clusters for each class.Then, using the calibration data, both the NC scores A and the optimal significance level ϵ are computed and stored.Algorithm 2 shows the tasks that are performed at runtime for a sensor input x t .The input first needs to be mapped to its embedding representation v t .Then, using the same NC function that is used for the calibration data, we compute the NC scores and the p-values assuming every label j as a candidate label.Then, the p-values p j and ϵ are used to compute the set of candidate labels Γ ϵ .

Evaluation
Our assurance monitor design leverages distance metric learning techniques to compress the input data to lower dimensions in order to make the ICP application more efficient and with lower memory requirements.The objective of the evaluation is to compare how the suggested architecture performs against the baseline for each label j ∈ 1, …, n 9: Compute the nonconformity score a(v c i , j) 10:

Experimental setup
For the evaluation, we experiment with three datasets of variable complexity and input size.First, we use a dataset generated by the SCITOS-G5 mobile robot (Dua and Graff, 2017).This robot is equipped with 24 ultrasound sensors around it that are sampled at a rate of 9 samples per second.Its task is to navigate itself around a room counter-clockwise in close proximity to the walls.The possible actions the robot can take to accomplish this are "Move-Forward", "Sharp-Right-Turn", "Slight-Left-Turn", and "Slight-Right-Turn".The SCITOS-G5 dataset contains 5456 raw values of the ultrasound sensor measurements as well as the decision it took in each sample.Because of the small sensor number, the inputs have one dimension and their size is relatively small.Second, we use a speech recognition dataset which contains 7501 audio samples from speeches of five prominent leaders; Benjamin Netanyahu, Jens Stoltenberg, Julia Gillard, Margaret Thatcher, and Nelson Mandela, made available by the American Rhetoric (Kiplagat, n.d.).Each audio sample has 1 s duration, the sampling rate is 16 kHz and use pulse-code modulation (PCM) encoding.Third, the German Traffic Sign Recognition Benchmark (GTSRB) dataset is a collection of traffic sign images to be classified in 43 classes (each class corresponds to a type of traffic sign) (Stallkamp et al., 2012).The dataset has 26,640 labeled images of various sizes between 15 × 15 and 250 × 250 depending on the distance of the traffic sign to the vehicle.For all datasets, we split the available data so that 10% of the samples is used for testing.From the remaining 90% of the data, 80% is used for training and 20% for calibration and/or validation.In the ICP implementations that use the k-NN NC function, the number of neighbors k are chosen to be 20, 15, and 40, respectively, for the three datasets, values that produce stability to outlier data points.The choice of DNN architectures happened according to the complexity of each application so that they will be simple enough to reduce the computational requirements but at the same time achieve good accuracy and data clustering without overfitting.All the experiments run in a desktop computer equipped with Intel(R) Core(TM) i9-9900K CPU, 32 GB RAM and a Geforce RTX 2080 GPU with 8 GB memory.

Baseline
The proposed approach assigns the original inputs into embedding representations for which the Euclidean distance is a measure of similarity between the inputs themselves.In order to understand the effect of the distance metric learning in ICP, we compare it with the approaches that we used in our previous work (Boursinos and Koutsoukos, 2020a).First, the most basic way of applying ICP is using only the original inputs.Then, we compare it with the approach that we presented in our previous work that uses embedding representations without distance metric learning and this will be the baseline in the following experiments.
The baseline approach computes the embedding representations using the activations of the penultimate layer of a DNN.A DNN is trained as a classifier to predict the class of the input data.The vector of activations of the neurons in the penultimate layer will be considered as the embedding representation of the input.In Figure 4, there is an illustration of how the embedding representations are generated in the baseline using a DNN with four input neurons that classify inputs to two possible classes.The embedding representations are generated in the penultimate layer and are typically reduced in size compared with the inputs.For an accurate comparison between the baseline and the proposed improvements using either the triplet or the siamese network, all of these approaches use the same DNN architecture, meaning that the embedding representations will also be of the same size.

Preprocessing and distance learning
The difficulty to compute the NC functions and the memory demands increase as the input size increases.Here, we see how the original high-dimensional inputs are mapped to lowerdimensional representations so that the application of ICP will be more efficient as well as the Euclidean distance between two embedding representations is a metric of similarity, property useful in the computation of the NC scores.We evaluate how the use of the embedding representations affect the application of ICP when it is applied on datasets of increasing complexity.First, the input data to the SCITOS-G5 mobile robot is a vector of 24 values.We use a fully connected feedforward DNN to generate embedding representations with size 8.The DNN is trained in either a siamese or triplet network for distance metric learning.The triplet network is trained without mining since this is a small dataset.Second, the speech recognition dataset contain audio samples with duration 1 s.For each audio sample, we add different kind of noises like dishwasher, running tap, and exercise bike on half the volume of the speech sample.Then, we use FFT to convert the audio samples to their frequency domain.The sampling rate of the speech files is 16 kHz, so in the frequency domain, it has 8000 components according to the Nyquist-Shannon sampling theorem (Shannon, 1949).A convolutional DNN is used to generate embedding representations of each audio wave in the frequency domain with size 32.In the case when the triplet architecture is used for the DNN's training, the semi-hard negatives mining produce the best results.Finally, the GTSRB dataset contains traffic sign images of variable sizes.In order to be able to use a single DNN to produce embedding representations for the image data, every image is either up-sampled by interpolation or downsampled to 96 × 96 × 3. A convolutional DNN is used to generate embedding representations with size 128.In the triplet case, the training produced better results when mining for hard negatives is used.
We first look at how well the distance metric learning methods cluster data of each class.A commonly used metric of the separation between classes is the Silhouette (Rousseeuw, 1987).For each sample, we first compute the mean distance between i and all other data points in the same cluster in the embedding space Then, we compute the smallest mean distance from i to all the data points in any other cluster The silhouette value is defined as .
Each sample i in the embedding space is assigned a silhouette value − 1 ≤ s(i) ≤ 1 depending on how close and how far it is to samples belonging to the same and different classes, respectively.The closer s(i) is to 1, the closer the sample is to samples of the same class and further from samples belonging to other classes.
To compare the representations learned using the different methods as well as compute how much the clustering improves over the original inputs, we compute the mean silhouette over the training data and the validation data separately.In Table 1, we see that the representations learned by either the siamese or the triplet network form well-defined clusters and are improved over the baseline clusters.On the other hand, the original inputs are not arranged in clusters.

Selecting the significance level
First, we illustrate the assurance monitoring algorithm with a test example from the GSTRB dataset.The left side of Figure 5 shows the image of a 60 km/h speed limit sign.Using nearest centroid as the NC function and the siamese network, Algorithm 2 can be used to generate sets of possible predicted labels.In the following, we vary the significance level ϵ and we report the set predictions.When ϵ ∈ [0.001, 0.004), the possible labels are "Speed limit 50 km/h", "Speed limit 60 km/h", "Speed limit 80 km/h"; when ϵ ∈ [0.004, 0.006), the possible labels are "Speed limit 50 km/h", "Speed limit 60 km/h", and finally, when ϵ ∈ [0.006, 0.0124], the algorithm produces a single prediction "Speed limit 60 km/h" which is obviously correct.For monitoring of CPS, one can either choose ϵ to be small enough given the system requirements or compute ϵ to minimize the number of multiple predictions.Since the number of multiple predictions decreases when ϵ increases, we can select ϵ as the smallest value that eliminates multiple predictions for a calibration/validation set.This can be seen in Figure 6 where for each dataset, the optimal ϵ is selected as the significance level value where the performance curve goes to 0. The nearest centroid NC function is used for the plots in this figure.Table 2 shows the results for the different datasets and the various NC functions.First using the calibration/validation dataset, we select ϵ to eliminate sets of multiple predictions and we report the errors in the predictions for the testing dataset.The algorithm successfully did not generate any set with multiple predictions for the testing datasets for any of the NC functions other than the 1-NN when it was used in the SCITOS-G5 dataset with representations computed with the triplet network.In this particular case, there was no ϵ that could eliminate the prediction sets with multiple classes and even when ϵ = 1, 38.6% of the test inputs produced prediction sets with multiple classes.The error rates are well-calibrated and bounded by the computed or the chosen significance level.One way to compare the different NCMs is by looking at the significance level that is required for ICP to make single predictions.The use of embedding representations could always produce single predictions using significance levels much lower than when the original inputs are used.The significance of the distance metric learning techniques is apparent in the case of the nearest centroid NCM on all the datasets.This is an appealing NCM for its simplicity and the reduced memory requirements.When used as part of the baseline the performance was not as good as the more expensive NCMs.However, leveraging the better clustering that distance metric learning methods achieve, the nearest centroid NCM performs as well or better than the rest of the NCMs on making predictions with low significance level while retaining the computational efficiency.We also evaluate how well the different approaches bound the error rate for two different values of the significance level.The errors are bounded in most cases no matter if embedding representations are used or not.The percentage of set predictions on the test data that have multiple candidate classes tend to increase the lower the chosen ϵ is compared with the estimated optimal ϵ.

Computational efficiency
In order to evaluate if the approach can be used for real-time monitoring of CPS, we measure the execution times and the memory requirements.Different NC functions lead to different execution times and memory requirements.We compare the average execution time over the testing datasets required for generating a prediction set after the model receives a new test input in Table 3.The 1-NN NC function on the input space of the GTSRB dataset has excessive memory requirements.Below we present the computational requirements for each NC function and explain the higher requirements of the 1-NN function in more detail.Table 3 reports the average execution time for each test input and the required memory space using different NC functions.Datasets with high-dimensional inputs are challenging for applying ICP in real time and the results demonstrate the impact of the embedding representations use on the execution times.All the NC functions require storing the calibration NC scores which are used for computing the test p-values online.The DNN weights need to be stored when embedded representations need to be calculated for every new test input.Furthermore, each NCM has a different memory overhead.In the k-NN case, the encodings of the training data are stored in a k−d tree (Bentley, 1975) that is used to compute efficiently the k-NN.This data structure is used both for the k-NN and 1-NN NC functions.In the 1-NN case, it is required to find the nearest neighbor in the training data for each possible class which is computationally expensive resulting in larger execution time.The nearest centroid NC In conclusion, the evaluation results demonstrate that monitoring based on ICP has well-calibrated error rates in all configurations.Furthermore, the use of embedding representations reduces the computational requirements and can lead to decisions with an improved significance level.Using distance metric learning methods, the training data form well-defined clusters that is essential in the case of the nearest centroid NCM.This improvement makes it a good NCM option for all of the used datasets as it performs as well as the other NCMs but with significantly less computational requirements.

Concluding remarks
CPS incorporate machine learning components such as DNNs for performing various tasks such as the perception of the environment.When used for safety-critical applications, they need to satisfy specific requirements that are defined taking into account the acceptable risk and its cost for incorrect decisions.Although DNNs offer advanced capabilities on the decision making process, they cannot provide guarantees on the estimated error rate.To achieve this, they must be complemented by engineering methods and practices that allow effective integration in CPS where an accurate estimate of confidence is needed.
The paper considers the problem of complementing the prediction of DNNs with a well-calibrated confidence.For classification tasks, the inductive conformal prediction framework allows selecting the significance level according to the requirements of each application.This is a parameter that defines the acceptable error rate and is a trade-off between errors and alarms.We presented computationally efficient algorithms based on representations learned by underlying DNN models that make possible for ICP to be used for real-time monitoring.The proposed approach was evaluated on three different benchmarks of increasing complexity from a mobile robot with ultrasound sensors, to speaker recognition and traffic sign recognition.The evaluation results demonstrate that monitoring based on the inductive conformal prediction framework using embedding representations instead of the original inputs has well-calibrated error rates and can minimize the number of alarms when a confident decision cannot be made.When appropriate embedding representations are computed using distance metric learning methods input data that belong to the same class form well-defined clusters.This property is very important when the similarity of a test input to the test data is estimated.That way the training set can be efficiently represented by the centroids of each class which reduces the computational requirements without any loss in performance when compared with the more computationally expensive approaches.
During the experiments, we identified a number of challenges that can lead to poor performance of the proposed method.First, when the datasets are imbalanced both the siamese and the triplet architectures may not learn embedding representations that cluster the under-represented classes well.This affects the efficiency of the NC functions.Second, the training of the triplet networks require mining of training data that will form triplets that lead to large gradients for minimizing the triplet loss function.There is ongoing research for mining algorithms for faster training.One open question for future research is how to utilize all the candidate decisions in the prediction set to deal with the cases when a confident decision cannot be made that will satisfy the significance-level requirements.

Fig. 2 .
Fig. 2. Embedding representations of input images from the traffic sign recognition dataset.

Fig. 6 .
Fig. 6.Performance and calibration curves formed using the validation data from the different datasets using the nearest centroid NC function.
Compute ϵ such that for each i ∈ [1, …, l − m] no more that than 1 of the p-values p ij ≥ ϵ Require: trained siamese or triplet neural network f for distance metric learning Require: test input z t = (x t , y t ) ICP approaches as well as investigate the validity/calibration and efficiency (size of set predictions).

Table 1 .
Clustering comparison using the silhouette coefficient

Table 2 .
ICP performance for the different configurations

Table 3 .
Execution times and memory requirements function requires storing only the centroids for each class and the additional memory required is minimal.