Modeling of COVID-19 Outbreak Indicators in China Between January and June

Objectives: The objective of this study is to compare the various nonlinear and time series models in describing the course of the coronavirus disease 2019 (COVID-19) outbreak in China. To this aim, we focus on 2 indicators: the number of total cases diagnosed with the disease, and the death toll. Methods: The data used for this study are based on the reports of China between January 22 and June 18, 2020. We used nonlinear growth curves and some time series models for prediction of the number of total cases and total deaths. The determination coefficient (R2), mean square error (MSE), and Bayesian Information Criterion (BIC) were used to select the best model. Results: Our results show that while the Sloboda and ARIMA (0,2,1) models are the most convenient models that elucidate the cumulative number of cases; the Lundqvist-Korf model and Holt linear trend exponential smoothing model are the most suitable models for analyzing the cumulative number of deaths. Our time series models forecast that on 19 July, the number of total cases and total deaths will be 85,589 and 4639, respectively. Conclusion: The results of this study will be of great importance when it comes to modeling outbreak indicators for other countries. This information will enable governments to implement suitable measures for subsequent similar situations.

T he novel coronavirus disease 2019 (2019-nCoV, or COVID-19) epidemic first broke out in Wuhan, China. 1,2 The virus was identified in the second half of December 2019. 3 The epidemiological features of the disease are still unknown, and the number of total cases and deaths varies daily. In the wake of its rapid spread and reports revealing the crucial consequences of this spread, countries adopted strict measures to tackle the disease. However, confirmed positive cases were recorded after the second half of January 2020. Mathematical models used to identify the quantitative description of the outbreak of COVID-19 in this study may provide significant insight into the cessation of the spread of the novel coronavirus. 4 Various indicators are used in the models to describe the course of the outbreak. Among these indicators, the total number of confirmed cases and the total number of deaths are the most commonly used ones.
The objective of this study is to compare the various nonlinear and time series models in describing the course of the COVID-19 outbreak in China. To present this, we focused on 2 indicators: the number of total cases diagnosed with the disease and the number of deaths.

METHODS Data Management
We obtained daily updates of the cumulative number of reported confirmed cases and deaths for the 2019-nCoV pandemic of China between January 22 and June 18, 2020, from Worldometer and WHO websites. 5, 6 In this study, we focus on China. Because it is not only the country where the novel coronavirus emerged and has spread throughout the world but is also the country that has been fighting against the coronavirus for the longest time. Also, due to unpreparedness for the outbreak, the studied period can be observed as a sample of natural course, especially in the first month.

Models for Describing the Course of the Outbreak
The models that we apply for the abovementioned indicators can be categorized into 2 categories: (1) nonlinear growth curves including the Weibull, negative exponential, Von Bertalanffy, Janoscheck, Lundqvist-Korf, and Sloboda models (Table 1) [7][8][9][10][11][12][13] ; and (2) time series models including Box-Jenkins and exponential smoothing methods (Tables 2 and 3). Y t is the observed dependent variable named the number of total cases and the number of total deaths, and t is the independent variable named as time (Table 1). In our models, t is the day. The A term is the asymptotic limit of the number of total cases and the number of total deaths as time goes to infinity; B is the proportion of the number of total cases to the number of total deaths. The k term is the proportion of the maximum increase rate to the highest number of cases or deaths. γ, c, and d are the changing points that occur when the change in the estimated increase rate goes from increase to decrease. AR (p) is the p th degree of autoregressive series. 14 MA (q) refers to the moving average model of order q. In this series, ε_t~WN(0,σ 2 ) is the white noise series. 15 The ARMA(p, q) model is expressed by both AR (p) and MA (q) processes. 16 In the Holt method, L t is the new smoothed value, α is the smoothing coefficient, (0 < α < 1), Y t is the actual value at t th period, β is the smoothing coefficient for trend estimation, (0 < β < 1), T t is the trend predicted value, p is the number of forecasting periods and y tþp is the forecasting value after p period. 17 In the damped trend method, if 0 < φ < 1, the trend is damped, if φ = 1, the equations become identical to Holt's linear trend method. Tashman and Kruk (1996) suggested that there may be value in allocating φ > 1, if applied in a series with a strong tendency, with exponential trend. 18 The Brown's single parameter linear exponential smoothing model is more suitable if there is an increasing or decreasing trend in the time series. In this model, the initial equations y 1 t and y 2 t are obtained by single exponential smoothing and double exponential smoothing, respectively. 19 For the estimation of post m process, the equation is given below. 20 In exponential smoothing methods, the estimations are constantly updated, taking into account recent changes in data. 21 In these methods, the weighted average of past period values is calculated and taken as the estimated value of future periods.
Estimation accuracy of the applied methods were evaluated with BIC, R 2 and MSE. BIC was developed by Gideon E. Schwarz (1978), who gave a Bayesian argument for adopting it. 22 BIC ¼ lnσ 2 e À Á þ kln n ð Þ/n Whereσ 2 e is the error variance

Results of Nonlinear Growth Models for the Number of Total Cases
The parameters estimated and goodness of fit measures of the nonlinear models between January 22 and June 18, 2020, in China were presented in Table 4. R 2 and MSE statistics were used to compare models. The R 2 and MSE values of the Weibull and Janoscheck models were equal. The MSE of the Sloboda model was slightly smaller than the Weibull and Janoscheck models, but R 2 was equal. The Sloboda model can be considered the most suitable model, as it has a smaller MSE value and a larger pseudo R 2 value. The Weibull and Janoscheck models could also be chosen as alternative models.
Autoregressive moving averages model (ARMA(p,q)) Brown's single parameter linear exponential smoothing model

Modeling of COVID-19 Outbreak Indicators in China
The curve for prediction of nonlinear growth models are given in Figure 1.

Results of Time Series Models for the Number of Total Cases
Box-Jenkins and exponential smoothing methods were chosen from the various time series models available for the total number of cases. Autocorrelation (ACF) and partial autocorrelation (PACF) graphs of the series were examined. When the ACF and PACF graphs in Figure 2 were examined, the first degree difference was taken because the series was not stationary at that level. But the stationary assumption had not been provided yet. The difference from the second degree was taken and the series became stationary. According to the ACF and PACF charts, the series quickly approached zero after the first delay in the ACF graph. In this case, because p = 0, d = 2, and q = 1, it was modeled by the integrated first degree moving averages method. In other words, the most suitable time series method was the ARIMA(0,2,1) model. In addition, exponential smoothing methods were used and the model performances were given in Table 5.
The performance of the ARIMA(0,2,1) model is given in Table 6, and it is observed that this model's fits are successful as nonlinear models.
The parameters estimated of the ARIMA(0,2,1) model are given in Table 7.
The ARIMA(0,2,1) model was found to be the most appropriate among different time series models. This model can be written as follows: Forecasting data for future 30 d are given in Table 8.
The number of total cases continues increasingly, albeit at a low speed. The number of total cases is predicted to be 85,589 on July 18, 2020 (Table 8). Observed and predicted values of the total cases are given in Figure 3.

Results of Nonlinear Growth Models for the Number of Total Deaths
The parameters estimated and goodness of fit measures of the nonlinear models for the number of total deaths are presented in Table 9. The most suitable models for predicting the number of total deaths are the Lundqvist-Korf and Sloboda models, respectively ( Table 9). The R 2 values of these models were found to be the highest at 0.963 and also the MSE values of these were lower than the others. The Lundqvist-Korf model

Modeling of COVID-19 Outbreak Indicators in China
can be considered the most suitable one, because mean square error (MSE) is smaller than other models. The curve for prediction of nonlinear models are given in Figure 4.

Results of Time Series Models for the Number of Total Deaths
The most suitable time series model was found to be the Brown linear trend exponential smoothing model among time series models for the number of deaths. The goodness of fit of the various models are given in Table 10, and it was observed that the predictions are as successful as nonlinear models.
The parameters estimated of the Brown linear trend exponential smoothing model are presented in Table 11. The observed and predicted values are given in Figure 5.
The forecasts of the number of total deaths using Holt's linear trend exponential smoothing model for 30 d are given in Table 12. The rate of increase in the number of deaths in China was decreasing, and it was predicted that the number will be between 3343 and 3355 in the period between June 19 and July 18, with a slight increase (Table 12). The Holt linear trend exponential smoothing curve for the exponential smoothing model is given in Figure 5.

DISCUSSION
In this study, we found that the Sloboda model for the number of total cases and the Lundqvist-Korf model for the number of total deaths were the best explanatory models among the nonlinear models used in the study. Also, the ARIMA(0,2,1) model for the number of cases and the Brown linear trend exponential smoothing model for the number of deaths were the most suitable models among the time series models used in the study.
In a different study, the ARIMA model was used on the daily prevalence data of COVID-2019 from January 20, 2020, to February 10, 2020, and the ARIMA(1,2,0) and ARIMA(1,0,4) models were obtained. 23 The logistics, Bertalanffy, and Gompertz models were previously used to estimate the number of cases and deaths from COVID-19 in different regions in China by Jia et al. (2020). 24 In their study, the     Observed and Predicted Values for the Number of Total Cases by ARIMA(0,2,1).
Logistics model was reported to be better than the others. They conducted an extensive research with quasi-experimental analysis methods in various provinces in China and investigated the relationship between population and the number of outbreak cases. Accordingly, they found that the correlation coefficients of the relationship between the population and the number of cases differed by regions. They observed that the number of cases was higher in regions with high populations and that there was a high correlation between them. They concluded factors such as immigration, tourism, and mobility play an important role in this situation. The authors also determined the number of cases using the epidemic growth model. 24,25 On the other hand, Roosa et al. (2020) analyzed the number of cases in some regions of China using the generalized logistic growth model (GLM), the Richards Model and the sub-epidemic model for a short time period (10 d). They found that the number of confirmed cases will continue to increase. They estimated that the predicted case increase (GLM) in the Guangdong and Zhejiang regions would be lower by using the Richards models and that it would be higher using the sub-epidemic model. 26 In a study on the risk of infection when COVID-19 was detected in a cruise ship in China in February 2020, it was noted that the risk of infection among people who have close contact was higher than those who maintained a social distance from others. The estimated number of cases was obtained by the back-calculation method. 27  , and R2. They found that the best method for modeling and estimating the number of total cases was the FPASSA-ANFIS method. 28 Kuniya (2020) estimated the outbreak peak of coronavirus disease in Japan using the susceptible-exposed-infected-removed (SEIR) compartmental model. 29 In another study, the reproduction number of the Wuhan novel coronavirus 2019-nCoV was estimated using the SEIR compartment model. 30 There are many studies on coronavirus disease by various researchers using different statistical methods. Among these studies, the following are highlighted: Yuan

CONCLUSIONS
While some models are simple and give general results, some are complex and provide detailed information, but their results cannot be generalized. 4 Models that were used in the initial phase of the outbreak can be misleading because of a lack of sufficient data. Therefore, short-term predictions should be made for the early stages of the epidemic, and the effects of any measures taken in this process must be taken into consideration by virtue of their results. As for the further stages of pandemics, different models can be used to understand biological systems and to develop models which can be used for the simulation for future similar situations. However, although model assumptions are mostly incompatible with real-world problems, they can capture general behavior and predict the rate of the spread of the outbreak. If large-scale behaviors of a system are correctly identified, certain details can be understood in terms of their impact on these behaviors. Statistical or data-based models that fit curves of the past temporal prevalence of a disease, do not make any assumptions about the internal mechanisms that a mathematical model provides and, hence, have become more popular in infectious diseases. Because the major use of these models is to fit past data and estimate the future, it can also be used for different patterns of the epidemic.
As a result of the literature review, it was observed that the Sloboda model and Lundqvist-Korf model, which gave the best results among the nonlinear models used in this study, have never been used for modeling COVID-19 outbreak indicators before. Our most recent forecasts remained relatively stable. This reflects the impact of the measures implemented by the China government, which likely helped to stabilize the pandemic. The forecasts presented here are based on the assumption that current mitigation efforts will continue. In addition, comparing with other modeling studies on COVID-19, results were obtained for longer periods. Therefore, the results in this study are more favorable in terms of comprehending the biological structure of the outbreak and producing preliminary information for possible similar conditions in the future.