Hostname: page-component-5db58dd55d-d6ndz Total loading time: 0 Render date: 2026-05-26T21:20:42.539Z Has data issue: false hasContentIssue false

Predictive maintenance in aircraft engine maintenance using the C-MAPSS dataset: performance comparison and evaluation of machine learning classification algorithms

Published online by Cambridge University Press:  20 February 2026

Hikmetcan Özcan*
Affiliation:
Computer Engineering, Kocaeli University, Türkiye
*
Corresponding author: Hikmetcan Özcan; Email: hikmetcan.ozcan@kocaeli.edu.tr
Rights & Permissions [Opens in a new window]

Abstract

This study assesses classification-based predictive maintenance (PdM) for aircraft engines on the NASA Commercial Modular Aero-Propulsion System Simulation dataset and addresses the lack of wide-scope, unified benchmarks. PdM is cast as a short-term binary task – predicting whether an engine will fail within the next 30 cycles – and a comparison is conducted across 10 machine-learning models (Logistic Regression, Decision Tree, Random Forest, Support Vector Machine, k-Nearest Neighbor, Naïve Bayes, Extreme Gradient Boosting, LightGBM, CatBoost, and Gradient Boosting) and 3 deep-learning models (Multilayer Perceptron, Gated Recurrent Unit, and Long Short-Term Memory). A leakage-aware pipeline applies Min–Max scaling; class imbalance is handled with Synthetic Minority Over-sampling Technique where appropriate; hyperparameters are tuned via GridSearchCV/BayesSearchCV; and performance is reported with accuracy, precision, recall, F1-score, and receiver operating characteristic–area under the curve (ROC–AUC), complemented by Shapley Additive Explanations (SHAP) explainability and nonparametric significance tests. Sequence models delivered the strongest performance: LSTM achieved Accuracy = 0.981 (Macro-F1 = 0.92; ROC–AUC = 0.96), and GRU achieved ROC–AUC = 0.97 with Accuracy = 0.975. Among classical learners, LightGBM reached Accuracy = 0.972 (Macro-F1 = 0.86; ROC–AUC = 0.93). These gains over weaker baselines were statistically significant across folds. Framing PdM as near-term failure classification yields operationally interpretable alerts. Models that explicitly capture temporal dependencies (GRU/LSTM) best track short-horizon failure dynamics, while gradient-boosted trees offer competitive, lightweight alternatives. The benchmark and analysis (including SHAP) provide a reproducible reference for model selection in aviation PdM.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press
Figure 0

Table 1. Comparative summary of related studies on PdM using the C-MAPSS dataset

Figure 1

Table 2. C-MAPSS dataset features

Figure 2

Figure 1. Representative raw time-series from the NASA C-MAPSS turbofan dataset showing run-to-failure behavior across engine cycles and selected sensor channels (s1–s21); these raw outputs underpin subsequent preprocessing and label/RUL derivation.

Figure 3

Table 3. RUL calculation in C-MAPSS data set preprocessing step

Figure 4

Table 4. Dataset after w1 label assignment

Figure 5

Figure 2. Post-Min–Max normalization view of scaled C-MAPSS sensor time-series (e.g., Engine ID = 1), providing visual context for the binary label “failure within w1 = 30 cycles.

Figure 6

Table 5. Classical and deep learning models for time-series benchmarking. The comparison includes classical ML baselines and deep learning architectures, including sequence models (GRU and LSTM) tailored to the temporal dependencies in the C -MAPSS dataset

Figure 7

Table 6. Comparison of Grid Search and Bayesian optimization results in terms of accuracy and computational time

Figure 8

Table 7. Optimized hyperparameters of machine learning and deep learning models determined via GridSearchCV

Figure 9

Figure 3. SHAP summary plots for (a) Logistic Regression, (b) Decision Tree, (c) Random Forest, (d) SVM, (e) KNN, (f) Naïve Bayes, (g) XGBoost, (h) LightGBM, (i) CatBoost, (j) Gradient Boosting, (k) MLP, (l) GRU, and (m) LSTM.

Figure 10

Table 8. Friedman omnibus tests across $ k=6 $ models ($ n=4 $ blocks = folds/datasets; two-sided $ \alpha =0.05 $). Post-hoc pairwise comparisons (Wilcoxon signed-rank with Holm adjustment) were conducted only when the omnibus test was significant

Figure 11

Table 9. Key significant pairwise differences (top 5 per metric by $ \mid \Delta \mid $). $ \Delta $ denotes mean difference (Model A $ - $ Model B)

Figure 12

Table 10. Handling class imbalance: SMOTE versus No-SMOTE on C-MAPSS (failure window $ {w}_1=30 $). Metrics are computed on the fixed test set; SMOTE is applied only on training folds

Figure 13

Table 11. Per-class metrics and overall summaries on the untouched test set. Class 0: normal; Class 1: failure $ \le 30 $ cycles. Weighted-F1, balanced accuracy, and ROC–AUC are left blank until supports/scores are computed

Figure 14

Figure 4. Models actual and predicted RUL predictions versus ground truth for the models: (a) Logistic Regression, (b) Decision Tree, (c) Random Forest, (d) SVM, (e) KNN, (f) Naïve Bayes, (g) XGBoost, (h) LightGBM, (i) CatBoost, (j) Gradient Boosting, (k) MLP, (l) GRU, and (m) LSTM.

Figure 15

Table 12. Comparison of accuracy results of ML/DL models from this study (Soni et al., 2021; Al Hasib et al., 2023; Sharma et al., 2023; Melkumian, 2024)