Hostname: page-component-89b8bd64d-5bvrz Total loading time: 0 Render date: 2026-05-09T01:44:10.488Z Has data issue: false hasContentIssue false

Analysis and forecasting of syphilis trends in mainland China based on hybrid time series models

Published online by Cambridge University Press:  27 May 2024

Zhen D. Wang
Affiliation:
School of Public Health, Shandong Second University, Weifang, China
Chun X. Yang
Affiliation:
School of Public Health, Shandong Second University, Weifang, China
Sheng K. Zhang
Affiliation:
School of Basic Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, China
Yong B. Wang
Affiliation:
School of Public Health, Xinxiang Medical University, Xinxiang, China
Zhen Xu*
Affiliation:
National Key Laboratory of Intelligent Tracking and Forecasting For Infectious Diseases, Chinese Center for Disease Control and Prevention, Beijing, China
Zi J. Feng*
Affiliation:
Chinese Preventive Medicine Association, Beijing, China
*
Corresponding authors: Zhen Xu and Zi J. Feng; Emails: xuzhen@chinacdc.cn; fengzj@chinacdc.cn
Corresponding authors: Zhen Xu and Zi J. Feng; Emails: xuzhen@chinacdc.cn; fengzj@chinacdc.cn
Rights & Permissions [Opens in a new window]

Abstract

Syphilis remains a serious public health problem in mainland China that requires attention, modelling to describe and predict its prevalence patterns can help the government to develop more scientific interventions. The seasonal autoregressive integrated moving average (SARIMA) model, long short-term memory network (LSTM) model, hybrid SARIMA-LSTM model, and hybrid SARIMA-nonlinear auto-regressive models with exogenous inputs (SARIMA-NARX) model were used to simulate the time series data of the syphilis incidence from January 2004 to November 2023 respectively. Compared to the SARIMA, LSTM, and SARIMA-LSTM models, the median absolute deviation (MAD) value of the SARIMA-NARX model decreases by 352.69%, 4.98%, and 3.73%, respectively. The mean absolute percentage error (MAPE) value decreases by 73.7%, 23.46%, and 13.06%, respectively. The root mean square error (RMSE) value decreases by 68.02%, 26.68%, and 23.78%, respectively. The mean absolute error (MAE) value decreases by 70.90%, 23.00%, and 21.80%, respectively. The hybrid SARIMA-NARX and SARIMA-LSTM methods predict syphilis cases more accurately than the basic SARIMA and LSTM methods, so that can be used for governments to develop long-term syphilis prevention and control programs. In addition, the predicted cases still maintain a fairly high level of incidence, so there is an urgent need to develop more comprehensive prevention strategies.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. The cell structure of the LSTM network.Note: The arrow indicates the data flow, where x, s, c, f, i, g, and o denote the input, output, cell state, forget gate, input gate, cell candidate, and output gate in time step t, respectively. σ and tanh denote the sigmoid activation function and the hyperbolic tangent function, which maps the data to (0,1) and (−1,1), respectively. $ \otimes $ are vector operators which represent element-wise multiplication and element-wise addition, respectively

Figure 1

Figure 2. Monthly reported cases of syphilis from January 2004 to November 2023 and the decomposition of the TS data.Note: In (a), the blue curve depicts the monthly reported incidences of syphilis, while the red curve illustrates the long-term trend. Meanwhile, in (b), the blue curve represents the stable seasonal component exhibiting a periodicity of 12 months, and the yellow dash curve portrays the time series post-seasonal component extraction

Figure 2

Figure 3. SARIMA model residuals normality and autocorrelation diagnostics.Note: (a) The frequency distribution of standardized residuals using a histogram. (b) The QQ plots of residuals of the SARIMA model, the red dashed line represents the standard normal distribution. (c,d) The ACF and PACF of residuals, respectively. The stem plots represent the values of ACF and PACF at different lags, and the blue lines indicate the ±2 times standard deviation interval

Figure 3

Table 1. Evaluation of goodness-of-fit of SARIMA, LSTM, SARIMA-LSTM, and SARIMA-NARX models

Figure 4

Figure 4. Fitting and predicting the performance of LSTM (a), SARIMA-LSTM (b), and SARIMA-NARX (c) models with different structures.Note: The dark blue and grey bars represent the RMSE values of the training and test sets, respectively

Figure 5

Figure 5. The simulation process for the training set by the SARIMA-NARX model.Note: (a) The variation of MSE for the training, validation, and test sets during the iteration process. (b) The error between the output values of each component data and the target values, while (c) provides a detailed display of the error magnitude. The blue, yellow, and red dots indicate the target values of the training set, validation set, and test set after simulation using the SARIMA-NARX model, and the blue, yellow, and red crosses denote the outputs of the training set, validation set, and test set, and the yellow stem denotes the error of fitting

Figure 6

Figure 6. Fitting and forecasting performance of the SARIMA, LSTM, SARIMA-LSTM, and SARIMA-NARX models.Note: Panels (a), (c), (e), and (g) denote the fitting and predicting results using the SARIMA, LSTM, SARIMA-LSTM, and SARIMA-NARX models, respectively, the red and yellow curves represent the simulation values for the train set and test set of the TS. Panels (b), (d), (f), and (h) denote the residuals of the SARIMA, LSTM, SARIMA-LSTM, and SARIMA-NARX models, respectively, the blue and yellow stems represent the residuals for the train sets and test sets, respectively

Figure 7

Figure 7. Prediction results from December 2023 to November 2025 of SARIMA-LSTM, and SARIMA-NARX models.Note: The light gray and blue areas respectively represent the forecast intervals of SARIMA-NARX and SARIMA-LSTM. The blue, red, and yellow curves represent the original data, the fitted values of the SARIMA-LSTM and SARIMA-NARX models, and the predicted values of the two models are represented by the red and yellow dashed lines