Estimating lengths-of-stay of hospitalised COVID-19 patients using a non-parametric model: a case study in Galicia (Spain)

Estimating the lengths-of-stay (LoS) of hospitalised COVID-19 patients is key for predicting the hospital beds’ demand and planning mitigation strategies, as overwhelming the healthcare systems has critical consequences for disease mortality. However, accurately mapping the time-to-event of hospital outcomes, such as the LoS in the intensive care unit (ICU), requires understanding patient trajectories while adjusting for covariates and observation bias, such as incomplete data. Standard methods, such as the Kaplan-Meier estimator, require prior assumptions that are untenable given current knowledge. Using real-time surveillance data from the first weeks of the COVID-19 epidemic in Galicia (Spain), we aimed to model the time-to-event and event probabilities of patients’ hospitalised, without parametric priors and adjusting for individual covariates. We applied a non-parametric mixture cure model and compared its performance in estimating hospital ward (HW)/ICU LoS to the performances of commonly used methods to estimate survival. We showed that the proposed model outperformed standard approaches, providing more accurate ICU and HW LoS estimates. Finally, we applied our model estimates to simulate COVID-19 hospital demand using a Monte Carlo algorithm. We provided evidence that adjusting for sex, generally overlooked in prediction models, together with age is key for accurately forecasting HW and ICU occupancy, as well as discharge or death outcomes.


Introduction
As of January 2021, SARS-CoV-2 transmission continues to increase in most countries worldwide [1], and in those countries where control has been achieved, resurgences are expected [2] before effective vaccines are widely available.Within the main challenges of the pandemic, overwhelming the healthcare systems has critical consequences on disease mortality [3].Thus, understanding and predicting inpatient lengths-of-stay (LoS) and critical-care demand remain some of the major components of outbreak monitoring for decision-making and contingency planning.
Predicting hospital demand entails estimating a patient's LoS and the probability of hospital outcomes such as requiring admission to the intensive care unit (ICU).Estimation of these variables is challenging as it requires investigating the patients' trajectories, and it must account for complexities in the data.For example, the LoS of some inpatients may be censored because the study ends before the patient leaves the hospital facility.The LoS of COVID-19 patients has been studied using parametric models [4], semi-parametric methods [5] and non-parametric estimators [3,6].
Parametric and semi-parametric approaches are often preferred due to their simplicity and ease of interpretation, but they require the LoS to conform to a predefined fixed model.Estimations based on non-validated assumptions can be significantly biased.Thus, nonparametric approaches, which do not require model assumptions, should be used when estimating COVID-19 LoS in the absence of solid knowledge.
The Kaplan-Meier (KM) estimator [7] is the simplest and most frequent non-parametric estimator in medical time-to-event data.It assumes that all patients with missing outcomes would experience the event in the end.This assumption applies when analysing the duration of hospitalisation, that is, the total time in the institution of the hospital (which includes time in hospital ward (HW) and time in ICU), as all patients leave the hospital eventually.However, this assumption does not apply to a patient's LoS in the HW until some of the potential outcomes, such as admission to the ICU or until death, as not all patients need admission to the ICU or die.Thus, the KM estimator should not be used to estimate those LoS, as it is wrongly specified.Alternatively, mixture cure models (MCMs) [8] account for the situations when a proportion of individuals will not experience the event being analysed.
Here, we propose a non-parametric MCM (NP-MCM) for estimating the LoS until specific events that are not experienced by all the patients.Specifically, we computed the following five LoS: LoS in HW until admission to ICU, LoS in HW until discharge from HW, LoS in HW until death in HW, LoS in ICU until discharge from ICU and LoS in ICU until death in ICU.Note that the KM estimator would not be biased to model the LoS until discharge if all status on discharge were gathered as a composite outcome.Although only the first LoS is necessary to model ICU demand, the other LoS are also of interest, as they are useful to estimate the conditional probability that a patient experiences each of those events according to the corresponding observed LoS.Finally, we also estimated the probability of each event.
First, to illustrate how our model improves data fitting, we compared the NP-MCM to the KM estimator (which assumes that all the individuals will experience the event) and to the empirical (E) estimator (which discards all observations which event is not observed) for a dataset of COVID-19 patients from the first weeks of the epidemic in Spain.We further simulated inpatient and critical care incidence during an outbreak, along with the final outcome (discharge or death), using the estimated values and adjusting for age and sex.Our model shows the importance of these individual variables for predicting hospital demand during transmission.

Data source
The dataset contains 10 454 confirmed COVID-19 cases reported in Galicia (North-West Spain), from March 6th to May 7th 2020.Since not all of them required hospitalisation, our study only included the 2453 patients who were admitted in hospital/ICU during that period.For the patients with several HW and/or ICU admissions, the considered LoS was the first recorded one, that is, the number of days from their first entrance in HW/ ICU until they experienced the outcome of interest.Data were provided by the regional public health authority, Dirección Xeral de Saúde Pública [9].The data included information on age and sex; the dates of COVID-19 diagnosis, admission to the hospital and/or ICU and the patient's last known clinical status.A summary of the dataset can be found in Supplementary material Section S1, see also [10] for other results on this dataset.

Model formulation
MCMs [8], a special case of cure models [11], explicitly model survival as a mixture of two types of patients: those who will experience the final outcome and those who will not (known as 'cured').Note that here a 'cured' individual is defined as being free of experiencing the event of interest, not necessarily cured in medical terms.The goal of MCM is to estimate the probability of experiencing the event and the distribution of the time to the event.The model is formulated as follows.
Let us denote Y as the time to the event of interest (admission to ICU, death or discharge), with survival function S(t) = P(Y > t).Let p = P(Y < ∞) be the probability that the event will happen, and S 0 (t) = P(Y > t|Y < ∞) be the survival function of the individuals experiencing the event.MCM write the survival function as S(t) = (1 − p) + pS 0 (t).Then the probability of the event, p, and the survival function of the time-to-event, S 0 (t), can be estimated using a proper estimator of the survival function, S(t), and the relations: When there is a group of patients known not to experience the event, the survival function S(t) can be estimated nonparametrically as follows [12]: where t i is the observed time-to-event of all the patients, d i indicates if the event was observed and x i is the indicator of whether the event was not observed because it is known it will never happen.To note, this estimator reduces to the well-known KM estimator in a classical time-to-event analysis when the event happens for all patients or when there is no way to identify the patients who will not experience the event ever.
The estimator of S(t) in ( 2) is computed with R software [13] and used to estimate the probability, p, of the event and the time-to-event survival function S 0 (t) using the relationships in (1) for the five LoS aforementioned in the Introduction.Details on each LoS, along with an R script for the computation of the different estimators, can be found in the Supplementary material.
The NP-MCM survival estimator of S 0 (t) is compared to the KM estimator computed with two different datasets: (i) the complete set of observations, considering all the patients who are known not experience the event as simply right censored regardless if they might experience it in the future or not (complete KM), and (ii) a reduced dataset, dismissing the patients who will not ever experience the event (reduced KM).The empirical (E) estimator, which considers only patients whose final event is observed and disregards the right censored observations, has also been considered.The NP-MCM estimator of the probability, p, of the event was computed using the estimator of S(t) in ( 2) and the relationships in (1).The E estimator of p, given by the ratio between the number of observed events and the total number of patients, was computed to motivate the proposed NP-MCM estimator of p (see Supplementary material Section S2 for details).
The NP-MCM estimator of S(t) in ( 2), the E estimator and the KM estimator do not incorporate possible covariate effects, such as sex and age.When the final outcome is experienced by all the patients, the extension of the KM estimator to handle covariates is the generalised product-limit estimator [14] of the conditional survival function, S(t|x).When the final outcome is not experienced by all the patients ('cured' individuals, all unidentified) the incorporation of covariates in the estimation of the probability of the final outcome p(x) and the distribution of the times until the event S 0 (t|x) has been studied recently [15][16][17] and implemented in the R package npcure [18], which also performs significance tests for the probability of the event p(x).When the final outcome is not experienced by all the patients ('cured' individuals, some of them identified as it happens for our COVID-19 data), the extension of these methods for the NP-MCM model to estimate p(x) and S 0 (t|x) has been recently addressed [12], where evidence of the superiority of the NP-MCM over the traditional methods is shown.These conditional estimators of S(t|x) [14], p(x) and S 0 (t|x) [12,[15][16][17] can handle continuous covariates such as age, using the information from all the individuals to provide estimates for one single value, e.g.40 years.Ignoring the effect of age and sex on these estimates can produce important bias in the statistical analysis.

COVID-19 outbreak simulation model
We further simulated a COVID-19 outbreak based on the NP-MCM estimates of the five LoS considered, with two different models: (1) the simplest possible where the distributions of the LoS and probabilities of moving from one state (HW, ICU) to another (HW, ICU, death, discharge) do not depend on individual covariates; and (2) a more realistic one with the LoS and transition probabilities depending on the age and sex.For the ease of computation, the simulated LoS were not generated directly from the NP-MCM estimates but from the parametric distribution that best fitted the NP-MCM estimates, specifically, Weibull distributions (see Supplementary Fig. S4 for the NP-MCM estimates and their corresponding Weibull counterparts).
The simulated outbreak consisted of N = 1000 infected individuals.For the i-th infected individual i = 1, …, N we simulated the sex G i (0 = male, 1 = female) and the age A i (years) using the real distributions of the reported COVID-19 cases in Galicia on May 7th 2020 (Supplementary Table S1).As not all the infected individuals required hospitalisation, let H ⊂ {1, …, N} be the infected subjects admitted to the hospital.The trajectory of every hospitalised patient i ∈ H was obtained by simulating the transitions between states (HW, ICU, discharge, death) using the NP-MCM estimated probabilities.The times in each state were simulated from the Weibull distributions that best fitted the NP-MCM estimates, both conditional and unconditional on the age and sex of the patient (see Supplementary material Section S4 for further details; Supplementary Figs S4; S5 and S2, Table S3).From the evolution of all the hospitalised patients, it is straightforward to compute the number of patients in every state.We simulated 1000 outbreaks of N = 1000 infected people, so the mean number of patients in a HW, in the ICU, dead and discharged can be approximated by a Monte Carlo simulation for each day as a function of time.For supporting the goodness of fit of the conditional model that considers the age and sex of the patient, the real number of inpatients in the COVID-19 dataset has been taken as reference, rescaled to N = 1000 infected people.

Results
We first compared the estimates of the LoS using the NP-MCM estimator with the E estimator, and the KM estimator with the complete and reduced dataset.
When an event happens for all patients (a.s.'leave the hospital', when all status on discharge gathered as a composite outcome), KM is not biased and coincides with the NP-MCM, both of them represented with the one single line in Figure 1 (top left) and Supplementary Figure S2 (top left).The NP-MCM and KM estimators consider the n = 2453 hospitalised patients of which 2142 experienced the event (they left the hospital within the study's timeframe).The E estimator considers only the 2142 patients who left the hospital, disregarding the information from the 311 patients still in hospital.This biases the E estimate towards shorter LoS, as hospitalised patients with longer LoS are not been included in the estimation.
Furthermore, when the final outcome is experienced by only a proportion of patients ('admission to ICU', 'death', 'discharge'), the KM (with both the complete and reduced samples) overestimates the time-to-event showing longer LoS than the NP-MCM.The E estimator underestimates the time-to-event due to right censoring, showing shorter values of LoS as it only takes into account patients who experienced the event.The NP-MCM estimates do not suffer from a similar bias [12].In fact, this is one of the advantages of using these methods.Interestingly, we found small differences between the NP-MCM estimates and the E estimates (Fig. 1 and Supplementary Fig. S2).The reason might be that the distribution of the LoS of the dismissed patients in the E estimation (never admitted to ICU) is similar to that of the patients who required ICU.
Importantly, for the probability of the medical event (admission from HW to ICU, death, discharge from HW or ICU) we showed that not correcting for right censoring (i.e. using the E estimator with only individuals with the observed outcome) underestimates the true probability, as the event of the rightcensored individuals could be recorded later in time.The NP-MCM can adjust to right censoring, providing more accurate estimates.This can be seen when comparing individual probabilities using NP-MCM and E estimators (Table 1).Note that in Table 1, the probabilities of mutually exclusive outcomes are not equal to 1.This is because the final outcome (death or discharge) of 42 inpatients still in ICU at the end of the study remains unknown, as detailed in Supplementary material Sections S2.4 and S2.5.This inconsistency of the E estimates is partially corrected by the NP-MCM estimate.
Then, we used the NP-MCM estimator to assess if age and sex could play a role in the estimates of the time of hospitalisation (both HW and ICU) and the time in ICU. Figure 2 shows that the LoS differ significantly between male and female patients, and between middle-aged (40 years) and older (70 years) patients.Particularly, we found that middle-aged female patients showed shorter LoS in both the institution of hospital and the ICU, whereas older females showed longer LoS in the ICU (but not in the hospital) compared to their male counterparts.
Finally, we implemented a COVID-19 outbreak simulation of N = 1000 infected individuals, using the NP-MCM estimates for the COVID-19 patients in Galicia (Spain) and accounting for age and sex heterogeneity in the LoS. Figure 3 shows the difference in the simulated number of inpatients between considering age and sex (conditional model) or not considering age and sex of the patient (unconditional model).The higher the curve the more inpatients the model predicts.The real number of inpatients in the COVID-19 dataset, rescaled to N = 1000 infected people, has been added as the reference.We found no large differences in the expected number of patients dead or discharged, regardless of age and sex are considered or not.Similarly, no large differences were found in the number of patients in ICU during the first month day 30 approximately) of the epidemic.However, after 1 month, the unconditional model tends to underestimate the number of patients in ICU, whereas the conditional estimation is closer to the real number of ICU inpatients in the COVID-19 dataset.The findings for the estimated number of patients in the HW are similar for a period of 2 months.As a consequence, if the prediction of HW and ICU beds' demand is estimated disregarding the sex and age of the patients, those predictions will be clearly underestimated, mainly in the case of ICU capacity.The consequences of this wrong forecast of HW and ICU occupancy are shown in Figure 4.For a range of possible capacities (15-90 beds in HW, 5-15 beds in ICU), Figure 4 shows the number of days in which the predicted number of patients exceeds the capacity.As expected, there is a decreasing trend, since the lower capacity the more days with excess demand.
If age and sex are disregarded for making predictions (unconditional model), Figure 4 (right) suggests that 11 ICU beds are enough to avoid overload in the ICU.However, that ICU capacity will be exceeded for 18 days, and the available ICU beds should be set to 12 instead to prevent overwhelming of the ICU.Similar conclusions can be drawn when predicting HW beds demand in Figure 4 (left).These discrepancies, simulated for N = 1000 infected people, would worsen as the incidence increases.In Epidemiology and Infection 5 summary, although no large differences are observed in predicted number of deaths and discharges, the conditional model gives more accurate estimates of the HW and ICU beds' demand.This leads to a reduction in the number of days when the number of inpatients exceeds the HW and ICU capacity.

Discussion
We applied a NP-MCM to estimate the time-to-event and event probabilities, including LoS in HW and ICU and time to death or discharge.In this study, we demonstrate how the LoS of hospitalised COVID-19 patients evolve over time, given age and sex distributions matching those from our database.The proposed model outperformed the KM and the empirical estimators when the outcome is not experienced by all patients.Importantly, the model can be adjusted for the use of covariates, which is significant when conditioning for known heterogeneity in estimating LoS.Particularly, our analysis demonstrates that adjusting for age and sex is crucial in accurately understanding ICU LoS and, in turn, forecasting bed demand.
Often studies with incomplete follow-up data on patients choose to exclude these patients from the study altogether, which yields biased estimates [19].Moreover, when forecasting hospital demand in (near) real time, information related to the most recent cases is not available, which again leads to right censored data.We showed that the empirical estimator introduced significant bias towards longer LoS from HW admission to ICU admission because it ignores patients in the HW without ICU admission.Alternately, the KM estimator yields biased estimates towards longer stays as well when the event is not experienced by all patients.The reason is that the KM estimator assumes that if the follow-up time was long enough, much longer stays would be observed.Therefore, by comparing the NP-MCM against the KM estimate, we show how biased the results are when using the KM estimate.This comparison would support the use of the non-parametric cure model approach, which diminishes the problem that occurs when not all subjects experience the event.
Our findings are consistent with previous studies: a recent systematic review has shown that median overall hospital stays ranged from 4 to 21 days outside of China [20], whereas our model estimated a median overall hospital stay of 11 days (IQR: 7-19); the LoS for patients who died in the HW was generally shorter than those discharged alive (median 7 and 10 days, respectively).In contrast, our estimates show a different trend with regards to ICU LoS, with similar median estimates for both death and discharged (15 vs. 14 days), again consistent with that reviewed by Rees et al. [20].Of note, to our knowledge only two studies have adjusted LoS by age, all showing increased LoS for increased age, which is consistent with our findings [4,21].Furthermore, as far as we know this is the first study showing the influence of sex in the LoS, which has important implications for predicting hospital demand (Fig. 2).With regards to prediction models, some approaches adjust estimates based on age [22,23], whereas sex has generally been overlooked in hospital demand forecasting [22,24,25].
Noteworthily, multi-state models [26][27][28][29] could seem an alternative method.Yet, multi-state cure models, in which transitions into one or more of the states cannot occur for a fraction of the population, are quite recent and the scarce literature is limited to semi and parametric models [30,31].Applying a nonparametric multi-state cure model is not straightforward, since there is no available literature related to this model.As a consequence, multi-state cure models were not used in this paper, but remain as a potential alternative approach.
Finally, we highlight key limitations of our model: the lack of a parametric function limits interpretability to a great extent and complicates handling several covariates simultaneously [32].Regarding the application of MCM, there must be good evidence that some individuals in the population will never experience the event of interest and the follow-up time must be long enough [33].To date, there has not been developed a reliable method for computing uncertainty (confidence bands) using MCM estimators, which remains a limitation of our approach.Finally, data on patient comorbidities, which likely represent an important source of heterogeneity in the LoS, were not available for the analysis.Thus, more accurate estimates of the different LoS can be obtained if more complete datasets are available.
In summary, we implemented an NP-MCM that improved the standard survival methodology when estimating LoS until final outcomes that will not happen for all patients.We also found that the LoS in the ICU is sensitive to age and sex, which in turn is relevant when forecasting hospital demand in real time for public health response.We believe our proposed approach can be easily implemented in other settings and can provide more accurate estimates of COVID-19 health demand compared to previous methods.
Supplementary material.The supplementary material for this article can be found at https://doi.org/10.1017/S0950268821000959.

Fig. 1 .
Fig. 1.Estimates of the survival function of LoS using NP-MCM (thick black line), KM with the complete dataset (thin light grey line), KM with the reduced dataset (thin dark grey line) and the E estimator (red line) for all the COVID-19 hospitalised cases (n = 2453) in Galicia (Spain), when the LoS is the time of hospitalisation both in HW and ICU (top left), time in HW until admission to ICU (top right), time in HW until death in HW (middle left), time in HW until discharge (middle right), time in ICU until death in ICU (bottom left) and time in ICU until discharge from ICU (bottom right).The NP-MCM and KM estimates give the same result for the time of hospitalisation, and are represented with a single thick black line (top left).

Fig. 2 .
Fig. 2. Generalised product-limit estimator [13] of the conditional survival function S(t|x) for the time of hospitalisation, both in HW and ICU (top) and the time in ICU (bottom), incorporating the effect of the sex (male = black line, female = red line) and the ages 40 years (left) and 70 years (right) for all the COVID-19 hospitalised cases (n = 2453) in Galicia (Spain).

Fig. 3 .
Fig. 3. Number of patients in HW (top left), ICU (top right), deaths (bottom left) and discharges (bottom right) computed from 1000 simulated COVID-19 outbreaks for a period of 200 days, when the LoS are simulated depending on age and sex (blue), and unconditionally ignoring age and sex dependence (red).The real case counts of inpatients in the COVID-19 dataset from March 6th (day 1) to May 7th 2020 (day 62), rescaled to N = 1000 infected people, are also included for reference (solid black line).

Table 1 .
Estimated probabilities of the different medical events for the COVID-19 patients in Galicia (Spain) using NP-MCM and empirical estimators