Integrating medical rules to assist attention for sleep apnea detection

Abstract Sleep apnea is one of the most common sleep disorders. The consequences of undiagnosed sleep apnea can be very serious, increasing the risk of high blood pressure, heart disease, stroke, and Alzheimer’s disease over a long period of time. However, many people are often unaware of their condition. The gold standard for diagnosing sleep apnea is nighttime polysomnography monitoring in a specialized sleep laboratory. However, these diagnoses are expensive and the number of beds is limited, and there is insufficient monitoring in terms of time dimension. Existing methods for automated detection use no more than three physiological signals, but all other signals are also associated with the patient’s sleep. In addition, the limited amount of medical real annotation data, especially abnormal samples, lead to weak model generalization capability. The gap between model generalization capability and medical field needs still exists. In this paper, we propose a method for integrating medical interpretation rules into a long short-term memory neural network based on self-attention with multichannel respiratory signals as input. We obtain attention weights through a token-level attention mechanism and then extract key rules of medical interpretation to assist the weights, improving model generalization and reducing the dependence on data volume. Compared with the best prediction performance of existing methods, the average improvements of our method in accuracy, precision, and f1-score are 3.26%, 7.03%, and 1.78%, respectively. The algorithm tested the performance of our model on the Sleep Heart Health Study data set and found that the model outperformed existing methods and could help physicians make decisions in their practices.


Introduction
Sleep apnea (SA) is a sleep-related disease, and it is characterized by difficulty in breathing during sleep [1,2]. The disease can be divided into two categories by its etiology: (1) obstructive sleep apnea (OSA) that is caused by obstruction of the airway by the throat muscles [3] and (2) central sleep apnea (CSA) which is caused by a disturbance in the brain center that controls breathing [4]. People of all ages are at risk of SA. Approximately 200 million people (4% of adult men and 2% of adult women) [5] in the world suffer from sleep-disordered breathing [6,7]. According to report [8,9], in the United States, 93% of middle-aged women with SA and 82% of patients with moderate to severe SA are undiagnosed. Studies [10] have also shown that the prevalence rate of preschool children is 3%. Moreover, SA is associated with ischemic heart disease, cardiovascular dysfunction and stroke [11], daytime sleepiness [12], and could be related to the development of diabetes mellitus type 2 (T2DM) [13].
Currently [14], the gold standard for diagnosing sleep apnea is all-night polysomnography (PSG) in the sleep laboratory [1]. To enable doctors to obtain accurate results [15], PSG records involve at least 11 channels of various physiological signals collected from different sensors, including electroencephalogram (EEG), electrooculogram (EOG), electromyography (EMG), and electrocardiogram (ECG), etc [16]. Due to a large number of sensors mounted to the body, the patients tend to feel uncomfortable [17]. In addition, the PSG service is normally expensive and unavailable for most people [12]. The analysis process is time-consuming and laborious [18]. Generally, the qualified professionals who can diagnose sleep apnea in medical institutions are very limited [13]. Therefore, there is an urgent need to automatic SA detection [19] and help technicians achieve high accuracy and throughput in SA diagnosis [20].
Deep learning has a wide range of applications in the medical field [21,22]. For example, Zhou et al. [23,24] demonstrated a robust framework for needle detection and localization in subretinal injection using microscope-integrated optical coherence tomography based on deep learning. Park et al. [25] proposed a frequency-aware based attention-based LSTM (long short-term memory) for cardiovascular disease that weighs on important medical features using an attention mechanism that considers the frequency of each medical feature. Various automatic methods have been proposed to help diagnose SA. Steenkiste et al. [26] proposed an automatic SA detection method based on LSTM neural networks, which uses the original physiological respiratory signals to automatically learn and extract related characteristics, and to detect possible sleep apnea events. The authors use balanced bootstrapping for the experiments to be conducted each time using an entire minority class and majority classes of the same size. The method achieved an average true positive rate of 80% by using three sensor signals, including abdominal respiratory, thoracic respiration and ECG-derived respiration (EDR). Thorey et al. [27] proposed a fully convolutional and highly parallelizable method based Convolutional Neural Network 1D (CNN1D) that can process signals of any sizes efficiently. Their method reached an average accuracy 81% for sleep apnea severity diagnosis by using more physiological signals. However, existing researches suffer from three limitations: (1) PSG involves multiple signals, but most of the existing methods are based on no more than three signals while all other signals are fully utilized. (2) The amount of labeled data are limited, especially in abnormal samples, which leads to poor generalization ability.
(3) The accuracy of the current algorithm still needs to be improved for practical usage.
To address the above limitations, this work proposes a method which can integrate domain knowledge in the form of medical rules into LSTM neural network which can utilize multichannel respiratory signals based on self-attention mechanism. In this work, we obtain the attention weight through the word-level attention mechanism and then extract the key medical rules from the doctors and place them on the input to obtain the auxiliary weights. Subsequently, the proposed method connects the two weights through a real-valued hyperparameter to guide the attention values. Finally, the hyperparameter is optimized by Bayesian optimization (BO) to obtain a model with better generalization capability.
Toward development of automatic SA detection, the contributions of this work can be summarized as follows: • The proposed method can detect SA by using all signals (including ECG, EEG, thoracic respiratory, etc.) in PSG as multichannel inputs to model data (Section 3.2). The results demonstrate that the effect of multichannel input is superior to that of conventional three-channel input and any single-channel input. • The proposed method integrates the medical rules into model to assist the attention weight, which can improve model generalization and effectively alleviate the dependence on the amount of data in the case of reduced data volume (Section 3.3).
• The proposed method is tested on the publicly available Sleep Heart Health Study dataset and it is shown that our model outperforms existing methods and can help physicians make decisions in practice (Section 4.4.2).

Automatic sleep apnea detection
Previous works have tried to automatically detect sleep apnea using deep neural network (DNN) models, such as LSTM neural networks and convolution neural networks (CNN). Steenkiste et al. [26] used an LSTM neural network to capture temporal information and accurately model the data. A fourth-order low-pass zero-phase shift Butterworth filter was first used to reduce noise in the respiratory signal and automatically predict OSA events based on the expansion and contraction patterns of abdominal respiration, thoracic respiration, and EDR. Haidar et al. [28] performed a binary classification (apnea or normal) based on nasal airflow analysis using a CNN1D classifier and a balanced dataset. The network consists of three convolutional layers, each with 30 filters, and the size of kernel is [5,1], the step size is 5, each filter is followed by a maximum pooling layer with a size of [2,1], and a fully connected layer with a softmax activation function. By evaluating other activation functions, the author chose the activation function ReLU because it has the best accuracy and the fastest training time. Haider et al. [29] also tested CNN1D with three input signals using a hold-out method to analyze nasal airflow, abdominal respiration, and thoracic respiration signals, with 75% of the training and 25% of the test data set. Two back to back convolution layers and a subsampling layer (conv-conv-maxpooling) are used to establish a three-cascading state. However, the physiological signals used in their methods are inconvenient to measure, such as nasal pressure and airflow, which limit application scenarios. Our method can exceed their performance using only a single thoracic respiratory signal.

Logic rules in deep learning
Logic rules embody high-level cognition and structured knowledge in the process of human communication. Incorporating rules into neural networks can be of great help to the learning process. The integration of common sense knowledge has also received a lot of attention in many tasks. Hu et al. [30] proposed a general framework that can use declarative first-order logic rules to improve a variety of neural networks. In particular, this paper developed a repeated knowledge distillation method that can transfer the structured information of logical rules to the weight of the neural network. The framework is implemented on the CNN network for sentence analysis and the RNN network for named entity recognition. Tandon et al. [31] proposed to use common sense knowledge as hard or soft constraints to bias the prediction of neural models for procedural text comprehension tasks. Xu et al. [32] used additional logic loss to enhance the training target as a means of applying soft constraints. The semantic loss used quantifies the probability of generating a satisfactory distribution by randomly sampling from the predicted distribution. Li et al. [3] proposed a framework that uses first-order logic to express knowledge without changing the end-to-end training method and integrates this structured knowledge into the neural network architecture. Our method extracts the key rules of the doctor's interpretation, introduce rule constraints into the neural network, and then use the rules that control attention to augment the network.

Our approach
The architecture of our proposed method is shown in Fig. 1. The following introduction is divided three parts, including problem definition, multichannel model, and integration of rules. The whole process is shown in Fig. 2.

Problem definition
PSG contains a variety of physiological signals of patients, but the current research is limited to only a few of them. In addition to the commonly used signals, thoracic respiratory, abdominal respiratory, and nasal airflow, other signals are also related to the patient's sleep. Due to the different sampling rates of these signals, they have different dimensions. So we divide the PSG signals by sampling rate f s to form  Figure 1. In our architecture,the initial input D is applied to the rule-assisted layer to obtain the auxiliary weight α r , and then α r is combined with the self-attention weight α s to obtain the final weight.

Figure 2. Integrating medical rules into models for apnea detection using PSG signals.
where 0 is normal and 1 is abnormal. Then, we use encode model E( · ; θ ) with different parameters to embedding these different dimensional segments into the same dimensional representations z ch for ch = 1, 2, · · · , k, k = |D 1 | + |D 2 | + · · · + |D s |. Now, given a special PSG singal segmentation d i ∈ R l , we can obtain a feature z i ∈ R m computed as E(d i ; θ i ) where m means the dimension of input after embedding. Then, we can use the same dimensional data Z = {z 1 , z 2 , · · · , z ch } to train a classification model M(·; θ ) for diagnosis sleep apnea disease. Clearly, the predictive capability of such model is limited because the amount of medical real labeling data is limited, especially in abnormal samples, which leads to weak model generalization ability. We propose a method of integrating medical interpretation rules into LSTM neural network with multichannel respiratory signals as input based on self-attention mechanism. First, we process the above features Z ∈ R m×k by attention layer Att() to get the attention weights α s . Then, we build an auxiliary layer Rule() by medical rules to get auxiliary weights α r . Finally, the two parameters are connected by a real number parameter.
The following sections will describe how the above models can be computed in detail.

Multi-channel model
For data D = {D 1 , D 2 , · · · , D s } of different frequencies, we encode the data D separately to the same dimension using LSTM with different parameters. Formally, where z i ∈ R m×1 denotes the features of the same dimension after encoding, m means the dimenson after encoding, E( ·;) represents the embedded model for the i-th signal, d i denotes the ith signal, k denotes the number of signal, and θ i denotes the parameters corresponding to each model. Then we get the next input X = {z 1 , z 2 , · · · , z ch }, X ∈ R m×k to the subsequent classifier.
Here we have a feature X ∈ R m×k as input to classifier. m means the dimension after encoding and k means the number of channel. We choose LSTM as the base model because LSTM neural networks is suitable for modeling sequence data. LSTM is an improved recurrent neural network (RNN) that can solve the problem that RNN cannot handle long-distance dependence. The hidden layer of the original RNN has only one state h, which is very sensitive to short-term inputs. The LSTM adds one state c and lets it save the long-term state, called cell state: Here, h t represents hidden state at time t. At time t, there are three inputs to the LSTM: the input value x t of the network at the current moment, the output value h t−1 of the LSTM at the previous moment, and the state c t−1 of the cell at the previous moment. There are two outputs of LSTM: the output value h t of the LSTM at the current moment and the state c t of the cell at the current moment. Formally, where σ is a logical sigmoid function, tanh is an activation function, W represents the weight matrix, b represents the bias term, and [h t−1 , x t ] represents a concatenation operation with h t−1 and x t . The forget gate f t determines how much of the cell state c t−1 from the previous moment is retained to the current state c t . The input gate i t determines how much of the input x t of the neural network at the current moment is saved to the cell state c t .c t is a new candidate vector created by the tanh layer and is added to the next cell state. The output gate o t controls how much of the cell state c t is output to the current output value h t of the LSTM. Now we integrate all the hidden state vectors into a matrix H. H ∈ R u×k , u means the length of hidden status.

Integration of rules
This section describes the integration of medical rules into the model based on the multichannel model described above, and this section includes token-level self-attention in LSTM, rule-assisted layer, and combination of weights.

Token-level self-attention
Next, we take H as the input and use the dot-product attention mechanism to get attention weight. For easier integration with subsequent output of rule-assisted layer, we need token-level attention α s . To get the token-level attention weights, the weights are multiplied by a parameter vector after getting the dot product attention weights. The computational process is as follows: Here, W 1 , W 2 is a weight matrix with a shape of k by u, w 3 is a vector of parameter with size k, V is the intermediate result, a matrix of similar weights, and α s is attention weight for each token with a size of m.

Rule-assisted layer
The American Academy of Sleep Medicine (AASM) has developed manual [33] for scoring of sleep and related event. The manual provides instructions for scoring sleep stages, respiratory events, and other sleep-related parameters to improve the accuracy and reproducibility of PSG measurements. The key medical rules for detecting sleep apnea events can be described as (1) There is a drop in the peak signal excursion by 90% of pre-event baseline using an oronasal thermal sensor (diagnostic study), positive airway pressure device flow (titration study), or an alternative apnea sensor. (2) The duration of the 90% drop in sensor signal is 10 s.
We will borrow the predicate symbols defined in the natural language processing task. We define two rules to assist and constrain attention: denotes the relatedness, R i denotes the weight after applying the rule to the original input, A i denotes the attention weight obtained based on the internal relatedness, and A i denotes the weight after auxiliary and restriction.
The abnormal respiratory events that will be considered in the diagnosis of SA include apnea and hypopnea. The above rules are for detecting apnea. The difference between hypopnea and apnea lies in the degree of decline. The recommended hypopnea definition requires a 30% or greater drop in flow for 10 s or longer associated with 4% oxygen desaturation. This value of the drop is set as a hyperparameter β, and then BO is used to find the best value.
We extract key medical rules as additional knowledge to assist attention weights. Formally, where d i ∈ D, the detailed process of Rule() is shown in Algorithm 1. We first label each segmentation with the corresponding baseline value using the annotation of the dataset based on each segmentation to obtain the baseline value closest to the corresponding time period. p n represents the normal amplitude of breathing, which is the baseline value. p c represents the signal amplitude of the current period. cnt represents number of slices that are continuously less than the baseline value.

Weight combination
Our purpose is to assist in modifying the attention weight through the restriction of the rule-assisted layer and combine the two in the following way: Here, λ is a non-negative hyperparameter. This hyperparameter determines the degree of restriction of the rule-assisted layer. softmax() ensures that the sum of all calculated weights is 1. The new matrix H r is obtained by multiplying the weight vector α and hidden state h i . H r replaces H as the input of the subsequent fully connected layer. The loss function is the binary crossentropy as defined by where N represents the number of samples for an epoch, y i represents the true binary label of sample i, andŷ i represents the predicted probability of sample i.

Data description
The Sleep Heart Health Study (SHHS) 1 [34,35] is a multicenter cohort study implemented by the National Heart Lung & Blood Institute to determine the cardiovascular and other consequences of sleepdisordered breathing. The SHHS Visit 1 (SHHS-1) dataset represents data from the baseline and first follow-up visits, collected on 6441 individuals between 1995 and 1998. A sample of participants who met the inclusion criteria (age 40 years or older; no history of treatment of sleep apnea; no tracheostomy; no current home oxygen therapy) was invited to participate in the baseline examination of the SHHS, which included an initial polysomnogram. Polysomnograms were obtained in an unattended setting by trained and certified technicians. The recording consisted of: electroencephalogram (EEG), electrocardiogram (ECG), electrooculograms (EOG), electromyogram (EMG), thoracic respiration (TR) and abdominal respiration (AR), nasal airflow (NA), pulse oxygen saturation (SpO2), heart rate (HR), body position and ambient light as shown in the Fig. 3. Each recording has a signal file, event scoring, and epoch staging annotations.

Data processing
The raw physiological signal contains a wide range of noise due to subject motion, electrical interference, measurement noise, and other disturbances. Noise reduction methods are essential and frequently used in any sleep apnea detection method. To extract relevant respiratory information and reduce noise, the physiological respiratory signal is passed through a fourth-order low-pass Butterworth filter with a cutoff frequency of 0.7 Hz [36]. This cutoff frequency is chosen to preserve the main respiratory components while eliminating as much noise as possible [37]. Taking into account, the length of the apnea time in the data set and the doctor's recommendation, the signal is divided into 100 s epochs with a step of 1 s between them and adopts its original frequency. The sample is labeled according to the annotation file provided in the SHHS dataset. Then, we reduced the number of normal samples to approximately the same as the abnormal samples.

Experiment setup
We use LSTM as the basic model for classification, define the step size in LSTM as 4 s, and train an LSTM with a length of 25 given an observation window of 100 s. The LSTM network architecture is as follows: it consists of an LSTM layer and a dropout layer. The function of the dropout layer is to improve the generalization ability of the network to unknown data. Then, a dense layer with the relu activation function is added followed by a dropout layer. Finally, a dense layer with softmax activation function is added. The output produced by this activation function can be interpreted as the probability that the input epoch contains apnea. During training, the time step of the sample is set to 40 depending on the body's breathing cycle, so the shape of input reshapes to b × t × m, b means batch size and t means time step. The ratio of the three in the train set, validation set, and test set is set to 5 : 2 : 3. The test set of all methods remains the same. We use optimization algorithm for stochastic gradient descent as the optimizer, specify a batch size of 128, an epoch of 100, and a learning rate of 0.001. In this work, the proposed all models are implemented on Tensorflow and Keras libraries and simulated using a PowerLeader PR4908P server configured with 8 × 32GB RAM, Intel(R) Xeon(R) Gold 6154 CPU, and TITAN XP GPU.
We will evaluate the performance of the proposed methods and compare it with its counterpart. We use vanilla LSTM as the basic model, denoted as vLSTM. The vLSTM model with token-level selfattention mechanism is denoted as sLSTM. The sLSTM model with the rule-assisted layer is denoted as rLSTM.
The performance of the models is evaluated according to the following test criteria: accuracy

Performance of multichannel model
In this section, we will compare the experimental effects of different signals as inputs. The inputs are divided into single-signal and multichannel signals. Single signal includes EEG, ECG, EOG, EMG, SpO2, HR, TR, AR, and NR. The multichannel signal includes three physician-recommended signals (TR, AR and NR) and PSG signals (all the above single signals). Given the same respiratory signal segmentations and the corresponding test set labels, we measure their prediction performance (i.e., accuracy, precision, recall, and f1-score).
In order to optimize the introduced two hyperparameters λ and β, λ is a non-negative hyperparameter. This hyperparameter determines the degree of restriction of the rule-assisted layer. β represents the amplitude of the signal drop. We first use Bayesian optimization to automatically select the desired hyperparameters. Then, we will use the optimal parameters to build subsequent models. The result of Bayesian optimization is shown in Fig. 4. The higher the performance evaluation of the best candidate, the better the hyperparameter performance of the group. After Bayesian optimization, the hyperparameters we choose are λ = 0.5, β = 0.8.
We did experiments with multiple signals as multichannel inputs to verify the effect of multidimensional data on the detection effect. As shown in Table I, it can be seen that the physician's suggested signal is superior to the other signals from the experimental results, and the performance of nasal airflow is the best in the single signal experiment. The results of multichannel models are overall better than those of single signal models, and the PSG signals with more signals are better. It can be seen that the multichannel model has some improvement in the overall.

Performance of rule-assisted layer
In this section, we compare the proposed methods with two popular sleep apnea detection algorithms and a rule-based method. Given the same respiratory signal segmentations and the corresponding test set labels, we measure their prediction performance (i.e., accuracy, precision, recall, and f1-score).
Next, our proposed method is compared with the existing methods together. The comparison algorithm uses the same data for training. As shown in Table II, the performance of our basic model vLSTM is slightly better than CNN1D, which is the best existing method in terms of accuracy, f1-score, and precision. The sLSTM model that introduces the token-level self-attention mechanism has a certain improvement compared to vLSTM, which shows that the self-attention mechanism can help improve performance. After adding the rule-assisted layer to assist the attention weight, the model rLSTM has a slight decrease in precision compared to sLSTM, but it has a certain improvement in the other three evaluation metrics, especially in accuracy. With the additional domain knowledge, the performance of the proposed rLSTM method is comparable in all evaluation metrics compared to the best prediction performance of existing methods. The average degradation on recall is only 0.0322, but the average improvement is 0.0326, 0.0703, and 0.0178 on accuracy, precision, and f1-score, respectively.

Impact of data volume
To verify whether the rule layer helps to alleviate the need for data, we choose models sLSTM and rLSTM for comparison experiments. We train the models using 100%, 80%, 50%, 30%, and 10% of the training data, respectively, and then validate the models using the same test set. As shown in Fig. 5, the overall performance of rLSTM is better than that of sLSTM. As the amount of data decreases, the overall trend of the two models is decreasing, but it can be seen that the decline of the rLSTM model has a certain degree of relaxation compared with the sLSTM model. This shows that additional domain knowledge can play a certain role in alleviating the need for data.

Conclusion
In this paper, we propose a new method to extract key rules in sleep apnea detection as additional domain knowledge to assist and constrain attention weights to improve the generalization ability of the model and alleviate the need for data. Compared with the current state-of-the-art method, the results of evaluating the model in the same public data set show a considerable improvement. With the additional domain knowledge, the performance of the proposed method is comparable in all evaluation metrics compared to the best prediction performance of existing methods. The average degradation on recall is only 0.0322, but the average improvement is 0.0326, 0.0703, and 0.0178 on accuracy, precision, and f1score, respectively. Our models can benefit from additional external domain knowledge during training and inference, especially in the case of limited training data.

Conflicts of interest. The authors declare no conflicts of interest.
Ethical standards. Not applicable.