Complete Subset Averaging for Quantile Regressions

We propose a novel conditional quantile prediction method based on complete subset averaging (CSA) for quantile regressions. All models under consideration are potentially misspecified and the dimension of regressors goes to infinity as the sample size increases. Since we average over the complete subsets, the number of models is much larger than the usual model averaging method which adopts sophisticated weighting schemes. We propose to use an equal weight but select the proper size of the complete subset based on the leave-one-out cross-validation method. Building upon the theory of Lu and Su (2015), we investigate the large sample properties of CSA and show the asymptotic optimality in the sense of Li (1987). We check the finite sample performance via Monte Carlo simulations and empirical applications.


Introduction
The contribution of this paper is twofold. First, building upon the theory of Lu and Su (2015), we show that the complete subset quantile regression (CSQR) estimator converges the pseudo-true value and satisfies asymptotic normality under mild regularity conditions. The uniform convergence property of CSQR is also provided. Based on these pointwise and uniform limit theories, we prove the asymptotic optimality of k in the sense of Li (1987). Second, we implement the CSA method and show that it performs quite well both in simulations and real data sets. Especially, we show that the performance is still satisfactory when we use a fixed number of subsets randomly drawn from the complete subsets when the time budget does not allow estimating the quantile regressions of the whole subsets. We also provide regularity conditions on the choice of the fixed number of subsets. Finally, we provide a theory that compares the performance of equal weighting and optimal weighting in quantile regression. This result justifies our intuition such that optimal weighting forecasts poorly when the number of models increases and extends the existing result in mean regression.
Finally, we summarize related literature. Lu and Su (2015) and Elliott, Gargano, and Timmermann (2013) are closely related to this paper. The former proposes the jackknife model averaging (JMA) method for the quantile prediction problem and derives the nonstandard asymptotic properties of the estimator. Our approach is different from theirs in that we use complete subsets for models to be averaged and that we choose a scalar k from the cross-validation method instead of a weighting vector w. The latter proposes the CSA method in the mean prediction problem and shows by simulation studies that the CSA predictor outperforms alternative methods like bagging, ridge, lasso, and Bayesian model averaging. However, they do not show any optimality result of the estimator. Hansen (2007) and Hansen and Racine (2012) show the optimality of model averaging based on the Mallows criterion and that of the jackknife model averaging, respectively. Ando and Li (2014) propose a model averaging method in a high-dimensional setting and show the optimality result. Komunjer (2013) provides a great review on the quantile prediction problem of time-series data. Meinshausen (2006) proposes a quantile prediction method based on random forest. Lee (2016) studies the inference problem of the predictive quantile regression when the regressors are persistent. In the empirical finance literature, Vrontos (2019, 2021) apply complete subset quantile regression to forecast realized volatility and the risk premium.
The rest of the paper is organized as follows. Section 2 introduces the model and the CSQR estimator. Section 3 presents the asymptotic properties of the CSQR estimator and the asymptotic optimality. The Monte Carlo simulation results are reported in Section 4. Section 5 investigates two empirical applications and illustrates the advantage of the proposed method. Section 6 concludes.
All the proofs are deferred to the appendix. We use the following notation. For a matrix A, · represents its Frobenius norm A = tr(AA ). Let λ min (A) and λ max (A) denote the smallest and largest eigenvalues of A. We use the notation x n ≈ y n to denote x n = y n + o p (1); and a n b n to denote a n = o(b n ).

Model and Estimator
In this section, we lay out the model under study and propose the complete subset averaging (CSA) quantile predictor. We also discuss the choice of the subset size based on the cross-validation method.

CSA Quantile Predictor
Consider a random sample {(y i , x i )} for i = 1, . . . , n, where the dimension of x i can be countably infinite. Following Lu and Su (2015), we assume that {(y i , x i )} n i=1 is generated from the following linear quantile regression model: for τ ∈ (0, 1), where µ i = µ i (τ ) := ∞ j=1 θ j x ij , θ j = θ j (τ ), ε i = ε i (τ ) := y i − Q y (τ |x i ), and Q y (τ |x) is the τ -th conditional quantile function of y given x. Note that we drop τ from each expression for notational simplicity and that ε i satisfies the quantile restriction P (ε i (τ ) ≤ 0|x i ) = τ . Equivalently, we can also express Q y (τ |x i ) := ∞ j=1 θ j (τ )x ij as is often done in the quantile regression literature. We consider a sequence of covariates available, which approximate the above quantile regression model: x ij is the approximation error and K n is the total number of available regressors that may increase as the sample size n increases. Thus, we presume that all models are misspecified in a finite sample as in Hansen (2007). 1 Given K n regressors, we consider a model composed of k regressors, where k ∈ {1, 2, . . . , K n }.

There are
Kn! k!(Kn−k)! different ways to select k regressors out of K n . Therefore, a subset of size k is composed of M (Kn,k) = Kn! k!(Kn−k)! different elements and a model is defined as a single element of them. We use index m (Kn,k) ∈ {1, 2, . . . , M (Kn,k) } for each model. For example, consider that we have K n = 3 regressors {x i1 , x i2 , x i3 } and construct a subset of size k = 2. Then, we have M (3,2) = 3 different ways to choose a model as follows: (x i1 , x i2 ), (x i1 , x i3 ), and (x i2 , x i3 ). Each model is indexed by m (3,2) ∈ {1, 2, 3}. For succinct notation, we drop all subscripts from K n , M (Kn,k) , and m (Kn,k) and denote them as K, M , and m unless there is any confusion.
We now consider a quantile regression model with regressors in a complete subset. Let model m with a size k be given. For observation i, let x i(m,k) be a k-dimensional vector of regressors corresponding to model m, i.e. x i(2,2) = (x i1 , x i3 ) in the above example. We can construct a linear quantile regression model with regressors x i(m,k) : where b i(m,k) := µ i − x i(m,k) Θ (m,k) is again the approximation error when we use only x i(m,k) regressors. The model in (2) is estimated by the standard method in linear quantile regression: where Θ is a parameter space and ρ τ (u) := u(τ − 1{u ≤ 0}) is the check function. Note that the estimator Θ (m,k) is defined for each subset size k and for each model m with k regressors. As noted above, we can think of M different models and corresponding estimators that have k regressors.
We have a few remarks here. First, we use the subscript (m, k) to denote a generic model with k regressors. However, the index set {1, . . . , M (Kn,k) } itself is defined in terms of k, which implies that m is also determined by k. Recall the original notation m (Kn,k) above. Therefore, model m ∈ {1, . . . , M (Kn,k) } has the same number of regressors k and we cannot choose m and k in an arbitrary way. Second, we allow that the subset size k goes to infinity as n increases. In other words, there exists a sequence of subset sizes {k(n)} that diverges. This setting is natural as the upper bound K n goes to infinity as n increases. Note that the number of regressors in each model (k m in their notation) is also allowed to diverge in Lu and Su (2015). Both approaches allow more complex models to be averaged as n grows, which is measured by k and k m , respectively. However, Lu and Su (2015) require controlling the growth rates of M and max m k m , separately. The proposed method constructs submodels based on the complete subsets, and M is tightly related to K and k.
As a result, the regularity condition on the complexity of the models is expressed only in terms of K n (see Assumption 3 in Section 3).
We finalize this subsection by defining the complete subset averaging (CSA) quantile predictor.
Let the size of the complete subset k be given. For each model, we estimate the parameter Θ (m,k) by (3) and construct the linear index x (m,k) Θ (m,k) . The CSA quantile predictor of y given x is defined as a simple average of those indices over M different models: The CSA quantile predictor is different from the JMA quantile predictor of Lu and Su (2015) in two respects. First, we do not select the set of models to be averaged since we average over the complete subsets of size k. Second, CSA does not estimate the weights over different models. The idea of averaging over the complete subsets was first introduced by Elliott, Gargano, and Timmermann (2013) in the conditional mean prediction setup. Heuristically speaking, since the weights can be seen as additional parameters to be estimated in the model, the equal weight could perform better in a finite sample when the number of models (i.e. the dimension of a weight vector) is large.

Choice of Subset Size k
We propose to choose the subset size k using the leave-one-out cross-validation method. We will show in the next section that the subset size k chosen by this method is optimal in the sense that it is asymptotically equivalent to the infeasible optimal choice.
For k = 1, . . . , K, we define a cross-validation objective function as follows: where Θ i(m,k) is the jackknife estimator for Θ (m,k) , which is estimated by (3) without using the i-th observation (x i , y i ), and y i (k) is a corresponding jackknife CSA quantile predictor for the i-th outcome variable y i . The prediction error is measured by the check function ρ τ (·). Then, we can choose the complete subset size k that minimizes the cross-validation objective function as follows: After choosing the complete subset size, the CSA quantile predictor is finally defined as where the plugged-in k is chosen by (7).
We finalize this subsection by adding some remarks on computation. First, we propose to use a fixed number M max of random draws of models when M is too large to implement the method.
, it can be quite large when the model has large potential regressors.
The simulation studies in Section 4 reveal that the CSA quantile predictor still performs well with a feasible size of submodels randomly drawn from the complete subsets. We also provide regularity conditions that assure the asymptotic equivalence between using M and M max in Section 3. Second, the proposed jackknife method can be immediately extended to the b-fold cross-validation method, where b is the partition size of the sample. Algorithm 1 below summarizes the leave-one-out crossvalidation method for choosing k.

Asymptotic Theory
In this section, we investigate the asymptotic properties of the complete subset quantile regression (CSQR) estimator. We first provide the pointwise and uniform convergence results of Θ (m,k) and Θ i(m,k) , respectively. Then, we show the optimality of CSA in the sense of Li (1987), which implies that k is asymptotically equivalent to the infeasible optimal choice of the subset size.
In addition to the model described in Section 2, we define some notation for later use. Let f y|x (·|x) be a conditional probability density function for generic random variables x and y. Since all models are potentially misspecified in the model averaging literature, we define the pseudo-true parameter value for any given (m, k): Let ψ τ (c) := τ − 1{c ≤ 0}. For any (m, k) such that m = 1, . . . , M and k = 1, . . . , K, we define We need the following regularity conditions.
is bounded above by c f < ∞ and continuous over its support a.s.
(ii) There exist constants c A(m,k) and c A(m,k) such that 0 Estimate the jackknife estimator Θ i(m,k) : Conditions (i)-(ii) in Assumption 1 are the standard i.i.d. and the quantile restrictions. Assumption 1(iii) requires some finite moment restrictions to achieve the probability bounds of various sample mean objects in the proof. Assumption 2 allows conditional heteroskedasticity. Note that the eigenvalues of A (m,k) and B (m,k) are bounded and bounded away from zero for a given (m, k).
However, these bounds (c A(m,k) , c B(m,k) , c A(m,k) , c B(m,k) ) can converge to zero or diverge to infinity as n increases. The speed of convergence is restricted by Assumption 2 (iv). These bounded eigenvalue restrictions are commonly imposed in the literature that studies the increasing dimension of parameters (see, e.g. Portnoy (1984Portnoy ( , 1985). Assumptions 1-2 are standard and similar to those in Lu and Su (2015). See the additional remarks therein. Assumption 3 imposes some regularity conditions on the number of potential regressors K n and the sequence of the uniform bounds (c A , c B , c A , c B ). Different from the regularity condition of JMA in Lu and Su (2015), we need not restrict the growth rate of potential models M directly since M (Kn,k) is determined by K n .
However, M (Kn,k) increases very quickly at a factorial rate of K n and we need a stronger restriction on K n . As noted in Assumption 3(ii), K n can increase at most the logarithmic rate of n. In the case of JMA, the number of regressors can increase at the polynomial rate if we setk = k M = M in their notation. This is a trade-off in proving the uniform convergence results over a larger index set than that of JMA. We discuss this point in detail below in Theorem 2. The second part of Assumption 3(ii) holds if c 3 A /c AcB is bounded away from zero or converges to zero at the slower rate than log(log n)/ log n when K increases at the rate of log n.
First, we prove the convergence rate and the asymptotic normality of Θ (m,k) when the dimension of parameter k increases.
Theorem 1. Suppose that Assumptions 1, 2, and 3(i) hold. Let C (m,k) denote an l (m,k) × k matrix such that C 0 := lim n→∞ C (m,k) C (m,k) exists and is positive definite, where l (m,k) ∈ [1, k] is a fixed integer. Then, This theorem provides an asymptotic theory for the quantile regression estimator when the model is misspecified and the number of parameters diverges to infinity as similarly seen in Lu and Su (2015). The convergence rate in (i) is a standard result when k diverges as n increases. To show the asymptotic normality with a diverging number of parameters, we also consider an arbitrary linear combination of Θ (m,k) represented by C (m,k) . The difference between two estimators, CSA and JMA, originates from the fact that CSA chooses the total number of the regressors K n first and the number of complete subset models M (Kn,k) follows automatically for each k = 1, . . . , K n , whereas JSA selects the set of models M n (in their notation) in advance. Then, the size of regressors k m in case of JSA is determined by the sequence of models m = 1, . . . , M n chosen by a researcher.
Although there are slight differences in the definition of c A(m,k) and c B(m,k) and their bounds from those in Lu and Su (2015), the proof of Theorem 1 is identical to theirs, so is omitted.
We next turn our attention to the uniform convergence results of Θ i(m,k) and Θ (m,k) . In addition to its own interest, the uniform convergence rates in the next theorem are required to prove to the asymptotic optimality of k.
Theorem 2. Suppose that Assumptions 1, 2, and 3(ii) hold. Then, Since CSA is defined on the index sets of m and k, the uniform convergence rates are defined over those sets, m ∈ {1, . . . , M } and k ∈ {1, . . . , K}. In case of Θ i(m,k) , we need additional uniformity over i ∈ {1, . . . , n}. As a result, the regularity conditions that control the growth rates of K n and M (Kn,k) are different from those of JMA in Assumption 3 (ii). As discussed before, since the number of complete subsets increases at the factorial rate of K n , we need a restriction on K n slightly stronger than that of JMA. We follow the proof strategy in Lu and Su (2015) which extends the results of Rice (1984) by using the inequality in Shibata (1981Shibata ( , 1982. To handle the different growth rates, we provide new technical lemmas. The proof of Theorem 2 as well as these lemmas are provided in the appendix. Finally, the uniform convergence rates are expressed in terms of the sample size n and the total number of regressors K that goes to infinity as n increases. We next prove the prediction equivalence when we replace M with M max . Let M max be a subset of {1, . . . , M } such that M max elements are randomly drawn. Define y(k) to be the CSA quantile predictor using only M max models: We show the validity of M max in the following theorem: Theorem 3. Suppose that Assumptions 1-3 hold. Let M max → ∞ and K/M max → 0 as n → ∞.
Furthermore, we assume that The rate requirement for M max is mild and M max = O(n 1/2 ) would work given K = O(log n).
The uniform boundedness assumption on x (m,k) is weak and holds easily in most applications.
We have some remarks on the uniform convergence assumption of the model average with the k) . Note that k is discrete and the functional class size over k is small. Thus, it depends on the dependent structure of z (m,k) to hold the uniform law of large numbers. For example, consider the following maximal inequality: where the second line holds from the Markov inequality. Since K/M = o(1), a sufficient condition is covariance stationary over m for all k, then the sufficient condition becomes the absolute summability condition Fazekas and Klesov (2001) for more general conditions on the partial sums in a different dependent structure.
We now prove the asymptotic optimality of k in the sense of Li (1987). Following Lu and Su (2015), we use the final prediction error (FPE, or the out-of-sample quantile prediction error) as a criterion to evaluate the prediction performance: n} is a sample. The next theorem shows that k is asymptotically equivalent to the infeasible best subset size choice that is defined as a minimizer of F P E(k).
Theorem 4. Suppose that Assumptions 1-3. Then, A similar optimality concept has been adopted in the context of the weighted average estimator (e.g. Hansen (2007), Hansen and Racine (2012), and Lu and Su (2015)) and in the context of the IV estimator (e.g. Donald and Newey (2001), Kuersteiner and Okui (2010), and Lee and Shin (2018)).
Different from JMA, CSA considers the complete subsets given (K n , k) and does not require the pre-selection of models to be considered nor the order of models. Thus, the optimality result is also independent of the initial model selection/ordering issue once the total number of regressors is given.
The index set K of CSA is discrete while that of JMA or the jackknife model averaging in Hansen and Racine (2012) is compact. All require the finite moment condition similar to Assumption (A.1) in Li (1987) which is assured by Assumption 1 (iii) above. The idea of complete subset averaging has been adopted in the forecasting literature (e.g. Timmermann (2013, 2015), Rapach, Strauss, and Zhou (2010)). This is the first formal result to show the optimality of the subset size selection.
Finally, we compare the performance of the nonstochastic equal weight with that of the optimal weight. In the mean prediction context, it has been observed that a simple arithmetic mean, i.e. the equal weight, outperforms the estimated optimal weight. This empirical phenomenon is known as the 'forecast combination puzzle' and some formal explanations under the mean squared error are provided by Smith and Wallis (2009), Elliott (2011), and Claeskens, Magnus, Vasnev, and Wang (2016, to name a few. Heuristically speaking, it happens when the estimation error of the optimal weight is large enough to dominate the efficiency loss caused by the equal weight. We extend this result to the class of smooth expected loss functions. This is crucial in our analysis since the check function ρ τ (·) does not give a closed-form solution, which is different from the mean squared error used in the existing literature.
We consider the following simplified framework to focus on the main idea. Let beŷ 1 , . . . ,ŷ M be predictors for y based on M different models. For example, we can think ofŷ m = X (m,k) Θ (m,k) for any given k. Let w be an M -dimensional weight vector combining the M predictors. We consider only positive weights with 1 M w = 1, where 1 M is an M -dimensional unit vector. Let y := (ŷ 1 , . . . ,ŷ M ) and e m := y −ŷ m be the prediction error ofŷ m and e := (e 1 , . . . , e M ) be a vector of these prediction errors. We define the prediction error of the combined predictor as e c (w) := y − w ŷ = w (1 M · y −ŷ) = w e. Then, we can define an optimal weight w * as w * = arg min where ∆ M −1 is the standard (M − 1)-simplex and F (w) := E[L(w; e c )] is an expected loss function.
For example, the mean squared error in Elliott (2011) can be written in terms of the quadratic loss function: . The quantile prediction error adopted in this paper can be written in terms of the check function: . Letw := M −1 1 M be an equal-weight vector andŵ be an estimator for w * withη :=ŵ − w * .
We have some remarks. First, it shows that the equal weightw may work better than the estimated optimal weightŵ when we average many models, i.e. when M is large. Compared to the optimal prediction error F (w * ), the efficiency loss byw is bounded by 2 −1λ max (1 + 3M −1 ), which converges to 2 −1λ max for large enough M . On the contrary, the upper bound of the mean efficiency loss byŵ diverges as M increases. We admit that these upper bounds only reflect the worst case scenario. However, it confirms the intuition formally that the equal weight can outperform the estimated optimal weight under the class of smooth expected loss functions. Second, the prediction error ofw under a quadratic loss function converges to the optimal prediction error as M increases.
The same result is also proved in Proposition 1 in Elliott (2011). Different from his result, it does not require decomposing the prediction error into the common component and the idiosyncratic component. This result is summarized in Corollary 6 below. Third, to achieve the optimality, the estimation errors of the weightŵ, {η m }, should vanish fast enough. Letσ 2 η = O(c n ). A sufficient condition for the optimality is c n = o(M −1 n ). For example, if c n is a parametric rate, n −1/2 , then M n should be bounded by o(n 1/2 ). When M n = O(log n), for example, this condition is satisfied. However, if M n increases too fast, then M nσ 2 η will diverge andŵ may work worse than Fifth, if we restrict our attention to the expected check function adopted in this paper, F (w) is twice differentiable if the conditional density f (y|ŷ) is smooth for allŷ. From Theorem 1 in Angrist, Chernozhukov, and Fernández-Val (2006), we have where Q τ (y|ŷ) is the conditional quantile function of y givenŷ andω τ (ŷ, w) : Thus, the smoothness of F (w) is implied by the twice differentiability of f (y|ŷ). Finally, equation (10) shows that CSA would not work well if we include many irrelevant models. Similar to the quantile regression specification error in Angrist et al. (2006), we call (w ŷ − Q τ (y|ŷ)) the quantile prediction specification error. If there are many irrelevant models, the optimal weight ω * would be sparse, i.e. many elements of ω * would be zeros. In such a case, CSA withω = M −1 1 M results in a larger quantile prediction specification error given M and n. For example, if there is only one relevant regressor and all other coefficients θ j (τ ) equal zero besides one, the complete subsets will be composed of many irrelevant models. As we will see in the simulations studies in the next section, CSA does not perform well under this situation. Therefore, a pre-screening process is desirable to achieve a satisfactory result of CSA.
Corollary 6. Suppose that we have a quadratic loss function, L(w; e c ) = (e c (w)) 2 and that

Monte Carlo Simulations
In this section, we investigate the finite sample performance of the proposed estimator in simple Monte Carlo experiments. We consider two categories of the simulation designs: (i) all candidate models are misspecified, and (ii) candidate models include the true model.
First, we adopt the following data generating process (DGP): where x i1 = 1 and (x i2 , . . . , x i1000 ) follows a multivariate normal distribution, N (0, Σ) with Σ jk = ρ x if j = k and 1 if j = k. Therefore, the regressors are possibly dependent on each other, which is a more general feature of the design than the existing literature, see, e.g., Hansen (2007) and Lu and Su (2015). The term ε i follows N (0, 1) independent of x ij . The sample is i.i.d. over i. We compare the performance of the proposed Complete Subset Averaging estimator (CSA) with the Jackknife Model Averaging estimator (JMA) in Lu and Su (2015), the 1 -penalized quantile regression (L1QR) in Belloni and Chernozhukov (2011), the bootstrap aggregating methods (BAG) in Breiman (1996) and 2 -penalized quantile regression. L1QR and L2QR are also called the lasso and the ridge regression in the mean regression setup. The set of models used for JMA is constructed in an encompassing way, e.g.
For CSA, we set the maximum submodels to M max = 100. Thus, we draw 100 models randomly from the complete subsets of is bigger than 100. Furthermore, we reduce some computational burden by applying 10-fold cross-validation when n = 150. The tuning parameter of L1QR is chosen by Equation (2.7) in Belloni and Chernozhukov (2011). The bootstrap size of BAG is set to be 1000. The tuning parameter of L2QR is chosen by 10-fold cross-validation over the set {0.01, 0.05, 0.1, 0.5, 1.0} which is constructed after some pre-simulation studies.
To compare the performance, we first compute FPE(r) for each replication r = 1, . . . , R as follows. After estimating the model with n in-sample observations, we generate additional 100 out-of-sample observations. Then, FPE(r) is calculated by whereŷ x is a predicted value by each method. Then, we construct the following three comparison measures: where each subscript denotes generic notation for a forecasting method. Note that the loss to CSA ratio provides more direct binary comparison of each method to CSA. We set the total number of replications R = 1000. Figures 1-3 and Tables 1-4 summarize the simulation results over all designs. Overall, the performance of CSA compared to the alternative is quite satisfactory. We first direct our attention to Figure 1 and Tables 1-2. In these simulation designs, we vary R 2 over {0.1, 0.2, . . . , 0.9} while setting ρ x = 0.9. We consider two quantiles, τ = 0.1 and 0.5, respectively. From the four graphs in Figure 1, we confirm that CSA outperforms the alternative uniformly over R 2 's in terms of FPE in both quantiles. The prediction performance of CSA is better when the sample size is small, n = 50, and the gap decreases as the sample size increases to n = 150. At τ = 0.5, L1QR performs the second when n = 50 but L2QR does the second when n = 150. Thus, the performance order next to CSA is not stable. AT τ = 0.1, BAG performs the second overall but it is deteriorated when R 2 is very high, e.g. R 2 = 0.9. We also note that the performance of CSA is relatively stable over      R 2 while that of the alternative increases steeply for larger R 2 when n = 50. The same results are confirmed in Tables 1-2. CSA shows the highest winning ratios over all designs except τ = 0.5 and R 2 = 0.1, where that of L2QR is slightly higher. When we conduct the binary comparison (loss to CSA), all methods lose more than 50% to CSA over all designs and more than 80% in some designs.
Therefore, we conclude that both the winning ratio and the loss to CSA are more favorable to CSA in this set of simulation designs.
In the next simulation, we study the performance over a wider range of quantiles. We vary the quantile τ = {0.1, 0.2, . . . , 0.9} while setting R 2 = 0.5 and ρ x = 0.9. The results are summarized in Figure 2 and Table 3. In Figure 2, CSA outperforms the alternative uniformly over all quantiles in both sample sizes followed by BAG and L2QR. Again, the gap decreases as the sample size increases. It is also interesting that all estimators predict better at the tail distributions and they show the largest prediction errors at the median. The winning ratio and the loss to CSA in Table   3 are also satisfactory.
Third, we check the performance over different levels of dependency among the predictors. We vary ρ x = {0, 0.1, 0.2, . . . , 0.9} while setting R 2 = 0.5 and τ = 0.5. Since (x i2 , . . . , x i1000 ) are generated from the multivariate normal distribution, they are independent when ρ x = 0. Figure 3 reveals an interesting point. CSA performs better than the alternative when there exists any correlation between the predictors, i.e. ρ x > 0. Recall that most simulation studies in the literature consider independent predictors. As we can see from the empirical applications in the next section, however, the predictors are usually correlated with each other. Therefore, it is promising that CSA performs better when there is any correlation among predictors. Elliott, Gargano, and Timmermann (2013) also report in the conditional mean prediction settings that the CSA approach performs better when predictors are correlated with each other. In Table 4, both the winning ratio and the loss to CSA statistics improve dramatically when ρ x is away from zero, where JMA performs the best.
We next consider the second category of simulation designs, where the candidate models include the true DGP. The new simulations are based on the following model: where we observe all K predictors in the sample. We consider K = 5, 15 when n = 50 and K = 10, 20 when n = 150. Similar to the previous simulations, the population R 2 is controlled by θ. We set R 2 = 0.5, τ = 0.5, and ρ x = 0.9. Instead of varying R 2 , τ , and ρ X , we consider three signal structures in this simulation: Decreasing signal : β j = j −1 Constant signal : β j = 1 for all j   . Therefore, we consider 12 new DGPs in total.
Tables 5-7 summarize the simulation results. First of all, we take a look at the loss to CSA ratio in the second column (JMA) in these tables. Note that the loss ratio increases as K increases over all different designs, which is expected by the theoretical results in Theorem 5. Second, CSA performs worse in the sparse signal models compared to the other two designs. As discussed under equation (10), this is expected from the theory in Section 3 since the sparse design generates many subsets with totally irrelevant predictors. Third, it is interesting that JMA does not particularly outperform in this setup, where the candidate models include the true one. Also, note that L1QR does not particularly outperform in the sparse signal model. In fact, L2QR performs well over all three signal designs. Given that L2QR is understudied in the literature, this would be an interesting topic for future research.
In sum, we confirm that CSA shows satisfactory finite sample properties via Monte Carlo simulation studies. Related to the forecast combination puzzle, we observe a similar phenomenon in quantile forecasting and confirm some theoretical predictions developed in Section 3.

Empirical Illustration
In this section, we investigate the performance of the proposed method with real data sets.
Specifically, we revisit two empirical applications in Lu and Su (2015): (i) quantile forecast of excess stock returns; and (ii) quantile forecast of wages. Following the simulation studies in Section 4, we compare the performance of the complete subset averaging (CSA) method to the Jackknife Model Averaging (JMA), the 1 -penalized quantile regression (L1QR), the bootstrap aggregating method (BAG), and the 2 -penalized quantile regression (L2QR).

Stock Return
The same data set is composed of monthly observations of the US stock market from January 1950 to December 2005 (T = 672). The dependent variable is the excess stock return. We use the following twelve regressors: default yield spread, treasury bill rate, net equity expansion, term spread, dividend price ratio, earnings price ratio, long term yield, book-to-market ratio, inflation, return on equity, lagged dependent variable, and smoothed earnings price ratio. See Lu and Su (2015) and Campbell and Thompson (2007) for the details of the data set. Note that JMA needs to select the order of important regressors, but we do not need such a selection for CSA, BAG, L2QR. L1QR would select important regressors automatically by the 1 -penalty.
We forecast the one-period-ahead excess stock returns at 0.5 and 0.05 quantiles using various  fixed in-sample sizes, T 1 = 48, 60, 72, 96, 120, 144, and 180. The forecast performance is measured by the out-of-sample R 2 defined as where y t+1|t the one-period-ahead τ -quantile prediction at time t using the data from the past T 1 periods, andȳ t+1|t is the unconditional τ -quantile for the same T 1 periods. The out-of-sample R 2 measures the relative performance of a forecast method compared to the unconditional historical quantile. The higher values of R 2 imply better forecasting performance. Table 8 summarizes the forecasting results. In addition to R 2 , we report the ranking of each forecasting method, the mean of k, and the median of k. The upper panel of Table 8 reports the results when τ = 0.05. The R 2 of CSA is better than that of JMA uniformly over different sample sizes (T 1 ). The gap between two R 2 's is substantial except T = 180. BAG performs well when T 1 is small. The performance of L2QR is not satisfactory over all in-sample sizes. We next turn our attention to the lower panel when τ = 0.5. Again, CSA performs the best or second best except when T 1 = 144. CSA performs better when T 1 is small while JMA does better when T 1 is larger.
Overall, the gap between R 2 's is small when τ = 0.5. As we have observed from the simulation studies, the performance of the two estimators becomes similar as the sample size increases in both  panels. It is also noticeable that the selected k of CSA increases as the sample size increases and that CSA selects relatively large k across all T 1 and τ . Different from τ = 0.05, BAG performs poorly when τ = 0.5. L2QR also shows poor performance.
In sum, the performance of CSA is satisfactory in this forecasting exercise. It is quite stable over different in-sample sizes (T 1 ) and different quantiles in terms of the performance ranking. Among the alternative, BAG and JMA perform well in certain quantiles (0.05 and 05, respectively), but they do poorly when we apply them in different quantiles.

Wage
In this subsection we conduct the quantile forecast exercises using the Current Population Survey (CPS) data in 1975. The same data set is also used by Lu and Su (2015) and Hansen and Racine (2012) for quantile and mean forecast exercises, respectively. The sample size is n = 526 and we use the logarithm of the average hourly wage as the dependent variable. We use the following ten regressors: professional occupation, years of education, years with current employer, female, service occupation, married, trade, SMSA, services, and clerk occupation.
We split the sample into the estimation sample randomly drawn n 1 observations and the evaluation sample of n − n 1 observations. The estimation sample size varies n 1 = 50, 100, 150, and 200 and the random splitting is repeated 200 times for each n 1 . The out-of-sample R 2 is defined as where y s is the τ -th conditional quantile predictor andȳ s is the unconditional τ -quantile esti-mate from the estimation sample. Again, R 2 measures the prediction performance relative to the unconditional quantile estimate. Table 9 summarizes the exercise results. 3 We confirm that CSA shows good and stable quantile prediction performance. In this application, BAG shows quite a similar performance to CSA.
Similar to the stock return application, CSA performs better than BAG when τ = 0.5 and BAG does when τ = 0.05. The prediction results of JMA, L1QR, and L2QR are worse than CSA and BAG. The performance gaps are larger when the sample size (n 1 ) is small and they narrow as n 1 increases. As predicted by the theory and also confirmed in the stock return application, the selected k increases as n 1 increases.

Conclusion
In this paper, we propose a novel conditional quantile prediction method based on complete subset averaging of quantile regressions. We show the asymptotic properties of the estimator when the dimension of regressors diverges to infinity as the sample size increases. The size of the complete subset is chosen by the leave-one-out cross-validation method. We prove that the subset size chosen by this method is optimal in the sense that it is asymptotically equivalent to the infeasible optimal size minimizing the final prediction error. The prediction performance in the simulation studies and empirical applications is satisfactory.
We conclude with two potential extensions of the proposed method. First, we can think of a different approach in choosing the complete subset size. Recently, Hirano and Wright (ming) propose a Laplace cross-validation method, where the tuning parameter of interest is chosen by the pseudo-Bayesian posterior mean, and show that it works better than the standard cross-validation method when the risk function is asymmetric. It would be interesting to check how it performs in the CSA quantile prediction. Second, it will be useful if one can extend the results into the time-series data possibly including persistent regressors (e.g. Fan and Lee (2019)). We leave them for future research.

Appendix
In this appendix, we provide all necessary lemmas and technical proofs of the main text.
Lemma 1. Let e n := (nM K 2 ) 1/4 . Suppose that K/ log(n) = O(1). Then, we can show the following rate conditions: Proof of Lemma 1. (i) Recall the dependency of M on K and k. By construction, M K,k = K k . Then, the result follows from 2 (iii) Note that . It is enough to show that 2 log n /n = o(1). Let c 1n = 2 log n /n. Then, log c 1n = log n(log 2 − 1) → −∞.
Therefore, c 1n = o(1) and the desired result is established.
Proof of Lemma 2. The triangle inequality implies that We first investigate A 1 : We next turn our attention to A 2 . Let v i(m,k) := x i(m,k) − E x i(m,k) . Note that V ar(v i(m,k) ) ≤ CK for some generic constant C > 0. Let e n := (nM K 2 ) 1/4 . We have Boole's and Bernstein inequalities imply that The convergence result follows from K = o(M ) by Lemma 1 (i), (K log M )/n = o(1) by Lemma 1 (ii), and e n log M/n = o(1) by Lemma 1 (iii).

Proof of Theorem 2
The proof is similar to Lu and Su (2015) except the last part that shows the convergence of the maximal inequality bound. Let δ n := L n −1 K log n for some large constant L < ∞. Let The same arguments in Lu and Su (2015) imply that, for any Θ (m,k) ∈ S (m,k) (δ n ), The claim in (i) is established by showing that the following maximal inequality converges to zero: We first derive the upper bound of it: where W i(m,k) := n Θ (m,k) − Θ * (m,k) A (m,k) Θ (m,k) − Θ * (m,k) . We apply similar arguments in the proof of Theorem 3.2 of Lu and Su (2015) to show that I l (m,k) . Let c AB := c A c B /c 2 A andl := max 1≤k≤K max 1≤m≤M l (m,k) . Then, the above inequality for W i(m,k) and the corrected version of Lemma 2.1 of Shibata (1981Shibata ( , 1982 imply that For the last equality, note first that log(nδ 2 n c A /(lc AB ))/(nδ 2 n c A /(lc AB ) − 1) = o(1) by Assumption 3(ii). The leading term becomes The first inequality holds from the triangular inequality and the second one from the Cauchy-Schwartz inequality. The final inequality holds from the uniform convergence and boundedness assumptions, and Theorem 2 (ii) above. Similarly, we can show EQ 2 = o p (1).

Proof of Theorem 4
It suffices to show that, with K := {1, ..., K n }, We first expand the numerator by applying Knight's identity repeatedly.
It is straightforward to derive all terms except CV 4n . We need the following two results to get CV 4n . Let E x be an expectation with respect to a random variable x.
The identity (13)  The second equality holds by the independence of the sample {x i , ε i } and the generic random variable (x, ε).
We are now ready to prove (12). We first show that the denominator of (12) is uniformly bounded above zero and show the uniform convergence of CV 1n , . . . , CV 5n .
The last convergence result follows from the order conditions in Assumption 3. Claim 6: CV 5n = o p (1). Since CV 5n does not depend on k, this result follows from the weak law of large numbers.

Proof of Theorem 5
(i) There existsw betweenw and w * such that whereλ is a Lagrange multiplier from the constraint optimization problem: Note that the second equality above comes from the first order condition for w * and that the third equality holds by the normalization, 1 M w = 1 for any weight w. We investigate the upper bound of the quadratic term: 1 2 (w − w * ) 2 F (w)(w − w * ) = 2 −1 (w 2 F (w)w − 2w 2 F (w)w * + w * 2 F (w)w * ) ≡ 2 −1 (I + II + III). where · 1 denotes 1 -norm. Therefore, we have which establishes the desired result.

Proof of Corollary 6
Following similar arguments in Theorem 5, we only need to show that w * Σw * → 0 as M → ∞.