ESTIMATION OF A HIGH-DIMENSIONAL COUNTING PROCESS WITHOUT PENALTY FOR HIGH-FREQUENCY EVENTS

This paper introduces a counting process for event arrivals in high-frequency trading, based on high-dimensional covariates. The novelty is that, under sparsity conditions on the true model, we do not need to impose any model penalty or parameters shrinkage, unlike Lasso. The procedure allows us to derive a central limit theorem to test restrictions in a two-stage estimator. We achieve this by the use of a sign constraint on the intensity which necessarily needs to be positive. In particular, we introduce an additive model to extract the nonlinear impact of order book variables on buy and sell trade arrivals. In the empirical application, we show that the shape and dynamics of the order book are fundamental in determining the arrival of buy and sell trades in the crude oil futures market. We establish our empirical results mapping the covariates into a higher-dimensional space. Consistently with the theoretical results, the estimated models are sparse in the number of parameters. Using this approach, we are also able to compare competing model hypotheses on the basis of an out-of-sample likelihood ratio type of test.


INTRODUCTION
Counting processes are continuous time stochastic processes with nondecreasing trajectories, taking values in the set of positive integers.This paper is concerned with the estimation of the intensity of a counting process that depends on highdimensional covariates.In particular, we are interested in modeling the intensity of high-frequency trading events using a possibly large number of covariates, where the impact of each of them can be nonlinear.This further increases the dimension when the unknown nonlinearity is modeled by a series expansion or similar procedures.The motivation of this study is to use information from the order book to model the intensity of buy and sell arrivals.The importance of order book variables has already been shown by various authors with varying degrees of complexity (Hall and Hautsch, 2007;Cont, Kukanov, and Stoikov, 2014;Kercheval and Zhang, 2015;Sancetta, 2018).The intensity of the counting process can be used to predict buy or sell trade arrivals at infinitesimal timescales, using relevant microstructure variables such as order book volume imbalances, order flow, and spread as covariates.
The use of counting processes in high-frequency financial modeling was pioneered by Engle and Russell (1998).Since then, they have acquired an increasing popularity in the literature (Bauwens and Hautsch, 2009, for a survey).A possible way to characterize a counting process is via its intensity.Intuitively speaking, the intensity is the instantaneous rate of occurrence of events conditional on the past history.
For definiteness, let N := (N (t)) t≥0 be a counting process starting at zero, and λ * := (λ * (t)) t≥0 a predictable process, both adapted to a filtration F = (F t ) t≥0 and such that M := (M (t)) t≥0 , where M (t) := N (t) − t 0 λ * (r) dr , is an F t -martingale.The process λ * is an F t -intensity of N. We assume that there is a predictable process X = (X (t)) t≥0 that takes values in [0,1] K where K is large relatively to the sample period and where b * ∈ [0,∞) K is an unknown sparse parameter.All vector-valued quantities are column vectors, and the prime symbol stands for transpose.By sparse, we mean that b * has a small number of nonzero coefficients, relatively to K. We assume that the covariates are positive stochastic processes.Together with the assumption that the true coefficient vector b * is nonnegative, this ensures that the intensity is a positive stochastic process.We shall show that the nonnegativity restriction naturally arises in some parameterization.Then, this positivity constraint inherits a regularization property similar to Lasso.The goal of the paper is to find an estimator for b * when we observe a single trajectory/sample from (N (t),X (t)) t≥0 over a time interval [0,T] with T → ∞.
The empirical application in Section 3 further motivates the model, and Section 3.2 provides further discussion on the scope and limitations of the modeling strategy.

Remarks on the Model Restrictions
If the covariates are bounded, the restriction to [0,1] K is a mere linear transformation.Most covariates obtained from the order book satisfy this condition by construction.Examples include order book volume imbalances (see (10) in Section 3.1).On the other hand, variables such as durations are not bounded.However, we can always find transformations of the data to map variables into a bounded range.The type of transformation is a function of the modeling objectives.We provide more concrete remarks in Section 3.2.
The assumptions that we shall use essentially imply that the covariates process X is ergodic.We do not assume any stationarity.This is the weakest possible assumption for econometric inference.Although we do not do so in the empirical application, one example of the flexibility of the framework is to consider nonlinear Hawkes processes as one of the covariates, as long as they satisfy some stability conditions (Brémaud and Massoulié, 1996).In this case, one covariate could be set equal to ϕ (0,t) h (t − s) dN (s) where ϕ (•) is a Lipschitz function with range in [0,1], and h (•) is positive and integrable.If h is unknown, we could suppose a finite set of such functions h (l) : l = 1,2, . . .,L and generate as many covariates as L to capture any self-exiting nature of the intensity.Alternatively, the kernel function h would have to be estimated using a sample antecedent to the one used in the estimation of the model.

Contributions and Relation to Other Work
This paper contributes to the characterization of the impact of order book variables in the intensity of buy and sell trade arrivals.In order to do so in the most robust and simple to interpret way, it relies on a methodology that is elementary from an econometric point of view, but powerful.
The literature on modeling the arrival of high-frequency events using order book variables is growing (inter alia Hall and Hautsch, 2007;Cont et al., 2014;Sancetta, 2018;Morariu-Patrichi and Pakkanen, 2022).This is an important practical problem.Algorithmic traders do track the order book (MacKenzie, 2017).The order book not only contains information about liquidity, but also helps to identify informed traders.Nowadays, unlike traditional models of informed trading (Glosten and Milgrom, 1985;Kyle, 1985), sophisticated informed investors prefer to rely on passive execution.This means that they would avoid buying at the ask and paying the spread.Instead, they would break their orders into small ones and patiently fill most of them joining the bid price.There are two reasons for this.First, informed traders want to reduce their cost.Second, they want to reduce the amount of signaling.It turns out that uninformed market participants are smart and do look at quotes to infer information beyond liquidity.In fact, our empirical results show that the intensity of buy arrivals is a nonlinear increasing function of quoted buy orders over quoted sell sizes.The empirical application focuses on crude oil futures as main instrument together with information from other auxiliary instruments.
The model in (1) is similar to Aalen (1980) multiplicative intensity model.There, we observe i.i.d.copies in the presence of censoring.For each copy, the covariates are determined at time zero, and so they are usually referred to as marks.
Here, we observe a single trajectory of the data.The covariates are continuoustime stochastic processes and change as new information becomes available during trading.The absence of censoring leads to a simpler estimation procedure that does not require the use of U-statistics.U-statistics would be impractical for the sample sizes considered in this paper.The high-dimensional version of Aalen multiplicative intensity model has been considered by several authors (e.g., Gaïffas and Guilloux, 2012, and the references therein).
From a technical point of view, our results are derived minimizing a leastsquares criterion for count processes.In our problem, we impose a nonnegativity constraint without a penalty term, unlike what is usually done in the literature (Gaïffas and Guilloux, 2012;Alaya, Gaïffas, and Guilloux, 2015, and the references therein).Then, we can directly estimate the model using standard quadratic programming with no need to use a link function that ensures positivity.Hence, this work is related to results for nonnegative least squares.Under a sparsity assumption on the vector of regression coefficients, the sign constraint imposes restrictions that lead to a regularization as effective as Lasso with no need to tune a penalty parameter.We obtain results equivalent to nonnegative least squares with i.i.d.Gaussian errors (Meinshausen, 2013;Slawsky and Hein, 2013).However, in our problem, we need to control a dependent continuous-time process.
The use of quadratic programming is important in our context.In practice, models can be calibrated on months of data and the size of high-frequency data be so large that cannot be held in ready access memory.However, the problem is written in terms of sufficient statistics whose dimension does not grow with the longitude of the sample, and hence they can be held in memory making estimation very simple.
The plan for the paper is as follows.The next section describes the estimation procedure and the assumptions for validity of the statistical procedure.Then, we show that we can obtain consistency for a high-dimensional counting process with no need to use a penalty or additional tuning parameters.We conclude the section with a central limit theorem for a two-stage estimator for the parameters.Section 3 applies the results to the estimation of a point process for buy and sell arrivals where the intensity depends on order book variables in a nonlinear but additive way.In Section 3.2, we discuss the scope and limitation of the specific model we choose.Section 4 contains some final remarks.Proofs are in Section A.1 of the Supplementary Material.Additional details, which we may refer to, can also be found in the Supplementary Material.This includes a finite-sample study using simulations.

The Estimation Problem
Given (1), our estimator b for b * is the solution of the constrained problem This is a standard quadratic programming problem.Throughout, all vector inequalities are meant elementwise.Replacing the constraint with a penalty, this objective function has been used for high-dimensional problems by several authors (e.g., Gaïffas and Guilloux, 2012;Alaya et al., 2015).Note that the population version of the objective function is proportional to and using the definition of the intensity, this is equal to When λ * (t) = X (t) b * , it is easy to see that the constrained minimizer of the above display is b = b * , when the constraint b ≥ 0 holds for b * as well.In some situations, a positivity constraint on b is meaningful.Then, we do not need to use a penalty to find consistent estimators in high dimensions.We shall give examples in Section 2.3.

Assumptions
We introduce some additional notation and terminology.Fix an arbitrary positive T. We denote the first n = N (T) jump times of N by T 0 < T 1 < • • • < T n ≤ T, where T 0 = 0.The integral of (1) is the compensator of N (t) and M (t) := N (t) − t 0 λ (r) dr is an F t -martingale (see the discussion around (1)).Throughout, we use index subscripts to denote the relevant entry in either vectors or matrices.Let S = {i ≤ K : b * i > 0} be the set of nonzero entries of b * , S c = {i ≤ K : b * i = 0} be its complement, and s = |S| be the cardinality of S. For an arbitrary vector a ∈ R K and U ⊂ {1,2, . . .,K} we denote by a U ∈ R |U| the |U|dimensional subvector of a obtained by removing all the entries with index not in the set U. Define ˆ := 1 T T 0 X (t) X (t) dt and ˆ S := 1 T T 0 X S (t) X S (t) dt.We use • p to denote the p norm p ∈ (0,∞].Finally, we write w.p.1.to mean with probability going to 1. Assumption 1 (Model assumption).The point process admits the intensity (1) which is supposed to be uniformly bounded by a constant λ (possibly going to infinity), X := (X (t)) t≥0 is a predictable process with values in [0,1] K for each t ≥ 0, and b * ∈ [0,∞) K .
One main restriction is that the intensity is bounded by a constant λ.It could be relaxed to a moment condition, but at the cost of additional technical complications.In this case, it would not be possible to obtain an error bound that is logarithmic in K.

Assumption 2 (Eigenvalues condition).
There is a constant φ min > 0 such that the eigenvalues of ˆ S are all greater than φ min , w.p.1.
Let L > 0 and R (L,S) := {b : b S c 1 ≤ L b S 1 }, where S is the index set of active variables.The (L,S) restricted 1 -eigenvalue of a matrix A is defined as A lower bound on (4) is the weakest assumption used to derive oracle inequalities for Lasso (van de Geer and Bühlmann, 2009).We shall use the following.

Assumption 3 (Compatibility condition
).There is a strictly positive constant φ such that φ 2 comp ( ˆ , 3 √ ν ,S) ≥ φ w.p.1, where ν is as in Assumption 4. The positively constrained minimal 1 -eigenvalue of a matrix A is defined as A lower bound on (5) has been used by Meinshausen (2013) in the context of nonnegative least squares.
We shall refer to Assumptions 1-4 simply as the Assumptions.
As we shall discuss next, there are reasons to make λ, φ min , φ, and ν depend on T. Hence, we shall allow λ = λ (T) → ∞, and φ min = φ min (T), φ = φ (T), ν = ν (T) to go to zero as T → ∞, if needed.For ease of notation, we drop the dependence on T in what follows.

Assumption 1.
We view the sign constraint as a hypothesis, and estimation is carried out under this hypothesis.Specifically for the problem of highfrequency trading, we have a priori knowledge whether the marginal impact of a high-frequency order book covariate is increasing or decreasing.Such information can be obtained from other studies (e.g., Kercheval and Zhang, 2015;Sancetta, 2018).Hence, given such information on the direction of the impact, we show how to improve estimation in high-dimensional problems.
For the sake of clarity, we now consider two examples.Consider the inten- The parameters' restriction ensures that this intensity is always nonnegative.Then, λ can be written as (1) where , where b * ≥ 0. Hence, in our framework, we are able to control the direction of the impact by the linear transformation x → 1 − x.From a computational point of view, this is equivalent to changing the sign of the covariate and imposing an additional inequality constraint.Furthermore, possible nonlinearity of the impact of a covariate can lead to highdimensional problems.For example, suppose that Z (t) takes values in [0,1] and where The coefficients of Bernstein polynomials have a clear physical interpretation for K → ∞: (Lorentz, 1986, Thm. 1.1.1).Hence, the above display suggests that if g ≥ 0, the assumption that a i ≥ 0 is almost necessary when K is large.The argument can be extended to a dimension greater than 1.However, we note that the assumptions on ˆ will fail as K → ∞, unless we allow φ and ν to go to zero slowly enough.This is allowed in our results (Theorems 1 and 2).
In the empirical section of this paper, we view the sign constraint as a hypothesis on the direction of the impact of the covariates.Relying on several hypotheses, we estimate models and compare their performance out of sample (see Section 3 for details).This remark would suggest that impact/sign misspecification would lead to a zero coefficient.Consider the first example in the above discussion.By impact misspecification, we mean using X 3 (t) = 1 − Z 2 (t) when in fact the impact of Z 2 (t) on the intensity is positive.Because of the influence from other variables, the procedure may select a covariate with the wrong impact even when solving the population objective function (3).We carried out a number of numerical examples to find the solution to (3).We found that the constraint tends to be binding for a true negative coefficient when both s and the number of wrongly signed variables are small.However, as either s or the degree of misspecification increases, the constrained population estimator may have a positive sign even when the true one is negative.For the designs we considered in the Supplementary Material, selection of a misspecified variable when solving (3) was relatively infrequent.Details can be found in Section A.4 of the Supplementary Material.
We allow λ, the constant upper bound on the intensity λ * , to possibly grow to infinity.This is relevant in practice, as we may only have a crude upper bound for λ * that depends on s, the number of active variables.To see this, note that by assumption, λ * = X b * ≤ b S 1 = O (s).Even if crude, with no further information on either the covariates or the coefficients, this is the best possible upper bound on λ * .Hence, it is relevant to allow λ = O (s) and s to diverge to infinity with T. To see how information can be used to find a tighter bound on λ, consider the intensity in (6).Then, we know that the intensity is bounded by max i≤K a i , which is the largest of the coefficients to be estimated.Meinshausen (2013) to make the proofs simpler, we use Assumption 2. This is just slightly stronger than the Compatibility Condition φ 2 comp ( ˆ ,0,S) ≥ ρ for some ρ > 0 w.p.1, which is needed in the proofs.To see this, note that, by the nonnegativity of the coefficients, the latter implies that

Assumption 2. Unlike
, where ˆ i,j is the i,j entry in ˆ .Suppose that ρ is a lower bound on the smallest eigenvalue of ˆ S w.p.1.Then, Using the fact that the 1 norm of an s-dimensional vector is bounded by √ s times its 2 norm, this means that φ 2 comp ( ˆ ,0,S) ≥ ρ holds w.p.1.In practice, this means that the covariates with index in the active set S need to be linearly independent w.p.1, when T → ∞.This is plausible, as s should be relatively small, i.e., b * is sparse.
It is instructive to consider conditions on X that imply this assumption.Recall that ˆ i,j := T −1 T 0 X i (t) X j (t).Suppose that X i (t) X j (t) t≥0 is ergodic, for all i,j ∈ S, in the sense that ˆ i,j → i,j a.s., where i,j is a constant, i,j ∈ S. By boundedness of the covariates, the convergence must also hold in L 2 .Therefore, if S has bounded cardinality, ˆ S − S converges to zero in expected Frobenius norm.This implies convergence of the eigenvalues of ˆ S to the ones of S (Bosq, 2000, Thm. 4.4).Incidentally, boundedness implies that all the moments of ˆ i,j converge to i,j , i,j ∈ S.
The above definition of ergodic is closely linked to the existence of an asymptotic mean stationary measure such that S is the expectation of ˆ S with respect to that measure (see Gray andKieffer, 1980, Thm. 1, andGray, 2009, Chap. 6, for an extensive treatment).The argument does not require stationarity of X i (t) X j (t) t≥0 .As a simple example of why this can be relevant, consider It is common in high-frequency data to have time-varying (unconditional) volatility of certain quantities, due to intraday nonstationarity.However, by ergodicity and boundedness, there is a constant In practice, this is plausible when T spans multiple days.
Finally, we note that if s increases, the dependence among the active variables may increase, making a lower bound φ min on the smallest eigenvalue smaller.For this reason, we allow the bounds in our results to explicitly depend on φ min .Meinshausen (2013).Meinshausen (2013) also uses φ 2 comp ( ˆ ,0,S).As discussed above, by Assumption 2, we do not need this additional condition.To verify the Compatibility Condition using population quantities, we can use approximations.Suppose that there is a sequence T → 0 such that max i,j ˆ i,j − i,j = O P ( T ), where i,j is understood to be the limit (in probability) of ˆ i,j , i,j ∈ {1,2, . . .,K}.If the number of active variables satisfies s = o P ν −1 T with ν as in Assumption 3, then we can deduce that the Compatibility Condition is satisfied if ν > 0 and the smallest eigenvalue of is strictly positive (van de Geer and Bühlmann, 2009, Cor.10.1 and discussion).A bound on the minimal eigenvalue of requires that the covariates are not asymptotically dependent.To establish such a bound, we will need rates of convergence.Ergodicity alone, as discussed in the previous remark, is not sufficient even when S has finite cardinality.To see this, note that when s is bounded we still need max i,j ˆ i,j − i,j = o P (1) for i,j ∈ {1,2, . . .,K} and not just for i,j ∈ S. Given that K is not necessarily bounded, this requires convergence rates.

Assumption 3. We use the same Compatibility Condition as in
We also note that as K increases, the smallest eigenvalue of may tend to zero.For this reason, our bounds are in terms of φ to show how quickly the convergence rate may deteriorate if φ → 0.

Assumption 4.
This assumption is the same as in Meinshausen (2013).For Assumption 4 to hold, it is sufficient that min i,j ˆ i,j > 0 a.s., i,j ∈ {1,2, . . .,K}.This is satisfied if the covariates X i (t) do not have a disjoint support for all t ≥ 0.Then, the average of X i (t) X j (t) is greater than zero because X i (t) ∈ [0,1], t ≥ 0, unless some of the covariate are exactly zero.This condition is satisfied in our empirical study.However, there are parameterizations that lead to disjoint subsets.Notable examples are splines and one-hot encoding (Alaya et al., 2019).One-hot encoding essentially builds mutually disjoint bins for each covariate and constructs dummy variables for each bin.This leads to mutually disjoint covariates.As discussed in Meinshausen (2013, Exam. III), the following setup covers splines and one-hot encoding, and satisfies Assumption 4. Fix a positive integer L, and suppose that {P l : l = 1, . . .,L} is a partition of {1,2, . . .,K} such that ˆ i,j > νL a.s., if i,j ∈ P l for some l and zero if i ∈ P l ,j ∈ P k when k = l.This means that covariates that have an index in different partitions have disjoint support.In this case, Then, using the fact that, by Jensen's inequality, for any constants a 1 ,a 2 ,...,a L , we know that when min i b i ≥ 0.Then, Assumption 4 is satisfied.
As mentioned above, we require ˆ i,j > νL.Given that ˆ i,j ∈ [0,1], we need ν → 0 if the number of partitions L diverges to infinity.This is of interest for certain nonparametric estimators like one-hot encoding and splines with an increasing number of knots.
We conclude this section giving an intuition on why Assumption 4 leads to results comparable to 1 penalization.By Assumption 4, b ˆ b ≥ ν b 2 1 w.p.1 as long as b ≥ 0. Hence, bounds on b ˆ b will translate into bounds for b 2 1 .However, as soon as ν → 0, the control of b 2 1 using b ˆ b becomes loose and the procedure will underperform standard estimation with an 1 penalty.Similar remarks, but in the context of regression analysis, are made in Meinshausen (2013).

Asymptotic Results
2.4.1.Consistency.We shall keep track of all the constants to see how the bound is affected.This is useful, for example, if φ min → 0 slowly enough.Similarly, we can allow λ → ∞.This is important, because T/ λ is one of the main quantities affecting the convergence rate.Define c (s) := max s 2 φ 2 , 1 ν and μ T := 1 T T 0 Eλ * (t) dt.Note that μ T ≤ λ.We have consistency of the estimator b for b * , under the 1 norm.Throughout, to simplify the notation, we assume that K ≥ 2 in all the results that follow.

THEOREM 1. Under the Assumptions, if log
The second result is an estimation of the prediction error.
THEOREM 2. Under the Assumptions, if log K = O T λ , then Theorems 1 and 2 make explicit the dependence on the parameters φ, φ min , ν, and λ defined in the Assumptions.Hence, we can have consistency even when φ min ,φ,ν → 0 and λ → ∞ at suitable rates as T → ∞.When ν,φ,φ min are fixed, c (s) = O s 2 .Then, we have the following corollary.COROLLARY 1.Under the Assumptions, if λ is fixed and bounded away from infinity, φ min ,φ,ν are fixed and bounded away from zero, and log and hold true.
The convergence rate derived here is typical of high-dimensional estimation problems under some form of regularization.For example, the convergence rate in our error bounds is similar to the one derived for the nonnegative least-squares regression problem with i.i.d.Gaussian errors (Meinshausen, 2013, Thms. 1 and2, andSlawsky andHein, 2013, Thm. 2).We can relate to those results, assuming that the conditions of Corollary 1 hold.Then, we have 1 consistency of the estimator if T −1 s 4 + s 2 ln K → 0 (see ( 7)).On the other hand, Meinshausen (2013, first part of his Thm. 1) says that, in the regression case, we have 1 consistency if n −1 s 4 ln K → 0, where n is the sample size.
Suppose that the regressors are bounded.Under a lower bound on the smallest nonzero elements in the regressor coefficients, results in Meinshausen (2013, Thm. 2) imply (empirical) L 2 consistency of the regression function when n −1 s ln K → 0. With no such condition on the regression coefficients, in the present context, we can obtain L 2 consistency of the intensity estimator when T −1 s 3 + s ln K → 0 (see ( 8)).We can also consider the L 2 consistency of the estimated intensity in Gaïffas and Guilloux (2012) for the Aalen intensity model, using a data-driven Lasso penalty.Using a restricted eigenvalue condition, Gaïffas and Guilloux (2012, Thm. 2) achieve consistency if n −1 s ln K → 0, where, again, n is the sample size.Given that the model setup differs, and we cannot rely on a penalty, the method of proof in Gaïffas and Guilloux ( 2012) is different.It can be difficult to discuss convergence rates beyond the aforementioned remarks.
The next result is about set identification.
Let Ŝ = i ≤ K : bi > .Then, under the above conditions, for any such that Corollary 2 says that Ŝ, the estimated support of b * , is a superset of S, the true support, w.p.1.Using a threshold on the coefficients, we can achieve set identification, w.p.1.Note that we can have κ → 0 slowly enough and similarly for .
2.4.2.Convergence in Distribution.We obtain a central limit theorem for the ordinary least squares estimator for b * .At first, we estimate b in (2) and obtain Ŝ as in Corollary 2.Then, we compute b OLS , which is the K-dimensional vector such that its entries with index in Ŝ are equal to while all the other entries are zero.Note that the cardinality of Ŝ can still grow with T. Given a fixed K-dimensional vector α satisfying α α = 1 and α Ŝ α Ŝ > 0, we are interested in the asymptotic distribution of S to be the submatrix that includes only the entries with row and columns indices in S. We state an additional assumption.

Assumption 5. We have that lim T i,j∈S Var
where φ min is as in Assumption 2, and E ˆ S and E ˆ N S both converge to full-rank constant matrices.
Note that Var ˆ i,j = o (1) under ergodicity assumptions on X i X j .As already mentioned, ergodicity and boundedness imply that ˆ i,j converges in L 2 to a constant i,j .By boundedness, we also know that E ˆ i,j converges to a constant limit i,j .However, we need the rate of convergence to be fast enough to ensure convergence of all the entries in ˆ S .For example, this is trivially satisfied if s is bounded or if we assume that max i,j∈S Var ˆ T .The same argument applies to ˆ N S .In this case, write using the definition of M and intensity (see the discussion around (1)).Being a martingale, the second term on the right-hand side converges to zero in probability i,j ∈ S.Under ergodicity assumptions on X i (t) X j (t) X k (t) t≥0 for i,j,k ∈ S, we also have that the first term on the right-hand side converges to a constant and the same applies to its moments because of dominated convergence.We can now state a central limit theorem for our estimator.
THEOREM 3. Suppose that the assumptions of Corollary 2, Assumption 5, where N (0,1) is the standard normal distribution and Note that the conditions are for S rather than Ŝ .By Corollary 2, the two sets are the same, w.p.1.By assumption, min i∈S b * i > κ for κ as in Corollary 2. When κ is fixed, the post-selection asymptotics in Theorem 3 are valid as they are uniform in b * (Leeb and Pötscher, 2005).When κ is allowed to go to zero as in Corollary 2, post-selection asymptotics are still valid, but the convergence to a normal can be arbitrarily slow (mutatis mutandis; see Leeb and Pötscher, 2005, p. 29ff).Intuitively, this follows from the fact that as κ → 0, it becomes harder to distinguish very small coefficients from zero coefficients.
Next, we shall define a nonlinear additive model for buy and sell trades based on order book and trade variables.

EMPIRICAL APPLICATION: ORDER BOOK DETERMINANTS OF CRUDE OIL BUY AND SELL TRADE ARRIVALS
We estimate the intensity of buy and sell trade arrivals separately.We investigate the impact of features constructed from quote and trade data on these buy and sell events.The features include order book imbalance, trade imbalance, spread, and durations.Information from quote data appears to contain much information (Hall and Hautsch, 2007;Cont et al., 2014;Kercheval and Zhang, 2015;Sancetta, 2018).For example, order book imbalances are a quantity that has been used by practitioners for more than two decades (MacKenzie, 2017).However, most of the attention in the econometric literature has been on trade data.We use data from the front month of crude oil futures traded on the Chicago Mercantile Exchange (CME).The CME ticker is CL, and the sample period is May 1, 2013 to June 5, 2013, each day from 13:30 to 18:00 GMT.As auxiliary instruments, we use information from heating oil (HG), natural gas (NG) and the S&P500 (ES) futures, where the CME ticker is in parenthesis.The data were collected by a high-frequency proprietary trading firm collocated in the Aurora data center in Chicago.The data are at the highest level of granularity, comprising all quotes and trades, time stamped at nanosecond resolution.

The Model
We use different model specifications to capture the nonlinearities in the impact of the covariates on the intensity.We separately consider buy and sell trade arrivals.Note: The column "Smoothing" reports the smoothing parameter used if an EWMA had been applied to the original variable.
We want to assess the extent to which these variables impact the intensity in a nonlinear way.We consider different hypotheses.To evaluate the hypotheses, we carry out a test for model performance.The results for the estimated models show that the impact is nonlinear as expected and allow us to characterize the shape of the impact.In order to formulate our hypotheses, we first define the covariates.
3.1.1.The covariates.The covariates are reported in Table 1.We apply exponential moving average (EWMA) filters to some of the covariates.The EWMA of a variable X (t i ) with smoothing parameter α is where EWMA (X (t 1 )) = X (t 1 ).Here, t 1 is the time of the first update in the variable X at the start of each day, whereas t i is the time of the ith update.EWMA's are computed for each day.Note that the covariates are updated at discrete times for each instrument and the update times are different from the trade update times T j .
We then sample the data at times that are the union of each covariate update and the times T j for the traded instrument only, i.e., CL.This reduces the total number of updates within each day.Finally, to ensure that the covariates are predictable, we make them left continuous by lagging them after sampling.Variables are mapped, by linear transformation, into [0,1].However, spread and durations are first capped and then scaled by the cap so that they take values in [0,1].The book volume imbalance at level j ∈ {1,2, . . .,5}, VolImb j , is defined as where BidSize j is the bid size (quantity) of the jth level bid, and similarly for AskSize j .This variable takes values in [−1,1].We map it to [0,1] by standard linear transformation: multiply by 2 and subtract by 1.Hence, a value of 0.5 corresponds to a book volume imbalance equal to zero.The trade imbalance is computed from the EWMA of the signed traded volume every time there is a trade.We then divide it by the EWMA of the unsigned volumes.The EWMA's parameter is α = 0.98 for both denominator and numerator.Durations are in seconds with nanosecond decimals, capped to 1 second.They are then passed to EWMA filters with parameter α = 0.98 and 0.90.The spread is capped to four ticks and standardized by 4. Hence, the minimum value it can take is 0.25.Additional details regarding the data and the calculations are in Section A.2 of the Supplementary Material.
All variables are assumed to be mapped to [0,1] as described above.Then, if we hypothesize that the impact of a covariate is negative, we apply the transformation x → 1 − x as discussed in Section 2.3.For example, we would apply this transformation to durations: it is natural to assume that longer past durations are associated with a lower current intensity.We then apply a set of transformations x → g l (x), l = 1,2,3, where g 1 (x) = x, g 2 (x) = (x − 0.5) 2 , and g 3 (x) = x 3 .The functional forms can account for different types of impacts: g 1 (x) has constant marginal impact as opposed to g 3 (x) which is close to zero for most values of x not in the proximity of one.The latter transform seems more appropriate for volume imbalances and most covariates.The function g 2 (x) is just convex with a minimum at x = 0.5.Combining these functions, we can approximate different types of impacts and still be able to interpret the results.We now state the hypotheses for model restriction.
3.1.2.Model Hypothesis 1.The impact of all volume imbalances and trade imbalance is positive for buy trade intensity and negative for sell intensity.The impact of the spread and durations is always negative.Finally, we apply, separately, the following two transformations: g 1 and g 3 , i.e., linear and cubic.Hence, this doubles the final number of covariates to estimate.
3.1.3.Model Hypothesis 2. The impact of all covariates is as in model hypothesis 1.However, we finally apply, separately, the following two transformations: g 2 and g 3 , i.e., quadratic and cubic.Again, this doubles the final number of covariates to estimate.
3.1.4.Model Hypothesis 3. The impact of all covariates is as in model hypothesis 1, except for spread, for which we now suppose a positive impact.Finally, we apply, separately, the same transformations as in model hypothesis 2.
3.1.5.Remarks on the Model Hypotheses.We briefly comment on the hypotheses.First, the direction of the impact is a maintained hypothesis when estimating the model.If this is wrong, it is likely that the estimated coefficient will be zero.Second, regarding the transformations, we allow for the possibility of combining two functions (out of three) in order to derive a more flexible model.At the same time, we also rely on the estimation procedure to select the best submodel in an automated fashion.For example, assume that the direction of the impact of the covariates is as in model hypothesis 1 and that the impact is linear.Then, model hypothesis 1 is true, whereas model hypotheses 2 is false.We have included model hypothesis 3 to illustrate the point that using the wrong direction of the impact for the spread can lead to a zero coefficient.
For this problem, we have high confidence in the direction of the impact based on the existing literature (recall the discussion in Section 2.3).
After the application of EWMA's filters and the transformations, each hypothesis has 18 covariates for each of the four instruments, which means 72 parameters.Rather than making this a pure exercise in data mining, we prefer to keep the number of covariates relatively small to simplify our discussion and focus on the significance of the results.

Scope and Limitations
The goal is to present a model in the same vein as the ones used by highfrequency trading firms.The focus is on the order book as the main source of information.The EWMA's of trade durations attempt to capture the well-known autocorrelation of durations (Engle and Russell, 1998).The model is geared toward high-frequency predictions in live trading.It is then customary to restrict variables to finite ranges to avoid consuming corrupted/invalid messages from the exchange.This may sound problematic for some fat-tailed variables like durations.However, as durations increase, the intensity should shrink.Capping durations to a relatively high value is equivalent to saying that the decreasing marginal effect of durations beyond the cap value is zero.Given that for crude oil we can have hundreds of trade arrivals within a second, capping durations to 1 second seemed reasonable.
The model does not include an intraday seasonal component.However, part of this seasonality is implicitly captured by a slow moving EWMA of durations, so we did not further increase the number of variables as we want to showcase the importance of the order book.
A final comment pertains to the separate estimation of buy and sell intensities.Suppose the following feedback loop effect: where ρ buy and ρ sell are constants in [0,1).This system has the reduced form In consequence, separate estimation of the buy and sell intensities is equivalent to estimation of the above reduced form model.

Results
To carry out a test of performance measurement, we split the data in two halves.We use the data until May 18, 2013 for estimation and the subsequent sample for testing.We shall only report results for buy trades.The ones for sell trades are essentially identical.Notes: For each of the three hypotheses, the active covariate and the estimated coefficient are reported.
The superfix and suffix in each covariate name are separated by an underscore.The superfix represents the ticker and the suffix the index in the transform g l l = 1,2,3, defined at the end of the paragraph "the covariates" in Section 3.1.The total number of estimated parameters is denoted by K and the estimated number of nonzero elements by ŝ.
The estimator for all models is sparse, as only 10%-15% of the coefficients are nonzero (Table 2).An interesting finding is that the estimation based on hypothesis 2 leads to an impact of order book volume imbalances that is not monotonic.Moreover, all hypotheses suggest that order book volume imbalances beyond the third level are not important.
We then test the performance of the models out of sample.As already mentioned, we split the sample into estimation and test samples of roughly the same size each.Let λ(k) be the intensity estimator from (2) using the estimation sample for model hypothesis k ∈ {1,2,3}.The log-likelihood on the test sample is Its predictable part is Under the null hypothesis that model hypothesis 1 and hypothesis 2 perform the same, we have that H (1)  T − H (2) T = 0, against the one-sided alternative that model 2 is worse, i.e., H (1)  T − H (2) T > 0. The standardized likelihood ratio test statistic between different model combinations is reported.The columns identify the null hypothesis.For example, H2-H1 is the null that the model implied by Hypothesis 1 performs as well as the one from Hypothesis 2. A large value rejects the null in favor of the alternative that the model from Hypothesis 2 performs better.
LR (1,2)    The test is performed for different pairs of model hypotheses.We find that hypothesis 2 is favored out of sample (Table 3).From the results in Tables 2 and 3, we infer that information from auxiliary instruments is relevant and instruments are interlinked.Moreover, we find that the impact of volume imbalances is not monotonic.It is also worth mentioning that the results for the sell intensity led to similar conclusions except for one case.We did not reject the hypothesis that model hypothesis 2 (H2) is not better than model hypothesis 3 (H3).Recall that the difference in these two model hypotheses is that the impact of the spread is in the wrong expected direction for H3.In this case, the spread is not selected when estimation is carried out under H3.The spread does not seem to be an important variable in this problem because all products are very liquid.Hence, their spread tends to be very tight most of the time.The only exception is heating oil (HO).In this case, inspection of the data did show some variation for the spread.However, overall, its importance is not as crucial as the other covariates.It is reassuring that under H3, the spread for heating oil, having the wrong sign is not selected.

CONCLUSION
This paper studies the estimation of a point process where the intensity is a function of high-dimensional variables.We rely on a nonnegativity constraint for the intensity to show that the estimation problem can be solved by standard quadratic programming with no need to include a penalty and tune additional parameters.The resulting estimator is consistent and the error only grows logarithmically in the number of estimated parameters, as long as the true parameter is sparse.Our motivation is the estimation of a possibly nonlinear additive model for the intensity of buy and sell trades of crude oil futures prices.The covariates that affect the intensity are order book and trade variables on crude oil together with information from auxiliary instruments such as natural gas, heating oil, and the S&P500 futures.The results show that the impact of variables constructed from the order book and trades is nonlinear and that the instruments are interconnected.A test for model performance is used, and it shows that we can compare competing hypotheses for the direction of the impact of the covariates and the functional form of the intensity.

Table 1 .
Covariates used for estimation

Table 2 .
Estimation results for buy trades

Table 3 .
Test of model performance for buy trades