Modeling the relationship between vertical temperature profiles and acute surface-level ozone events in the US Southwest by spatially smoothing a functional quantile regression estimator

Abstract With the proliferation of gridded data products, modeling the relationship between a scalar response and a functional covariate over the points on a regular lattice is becoming increasingly important. In this work, our overall aim is to better understand the relationship between high quantiles of surface-level ozone and the vertical temperature profile (VTP), a functional covariate, over the US Southwest in the summer. We develop our penalized functional quantile regression based approach within the framework provided by functional data analysis. As we assume that coefficient functions at points on the lattice exhibit spatial similarity, we obtain improved estimates by penalizing dissimilarity between nearby coefficient function estimates. In order to better account for the high degree of diversity of this region, we more strongly weight differences between coefficient function estimates at cells that exhibit a higher degree of geographical and climatological similarity. Our analysis suggests that the VTP is associated with acute surface-level ozone events during the summertime over this region, and the nature of this relationship differs spatially.


Introduction
Surface-level ozone (O 3 ) has negative health consequences that may be further magnified when it reaches its highest levels. The elderly and those with complicating health conditions, such as asthma, may be even more vulnerable to the impacts of surface-level O 3 . O 3 is formed via a chemical reaction between NO x and volatile organic compounds (VOCs) in the presence of sunlight. Therefore, emissions play a role in characterizing O 3 levels on the surface. However, it is not uncommon for 2 days with identical NO x and VOCs emissions to have drastically different surface-level O 3 amounts. It is well established that local and regional meteorological conditions account for a large portion of this discrepancy (Jacob and Winner, 2009;Tai et al., 2010;Liu and Cui, 2014;Porter et al., 2015). Broadly speaking, these works conclude that meteorological variables such as surface-level air temperature, circulation, and stagnation are related to surface-level air pollution; however, there are often several approaches for characterizing these meteorological conditions.
One way to capture these types of meteorological conditions is by considering atmospheric profile variables (APVs). An APV is considered to be a function mapping the altitude above a location on Earth's surface to a scalar measurement. The vertical temperature profile (VTP), which reports the temperature of the air at increasing altitude is an important APV in climate sciences. Modeling surface-level air pollution as a function of APVs has been considered previously (Du et al., 2013;Rendón et al., 2014Wolf et al., 2014Russell and Porter, 2021), but has received much less attention in the literature in comparison to approaches that consider surface-level (scalar) covariates exclusively. In this work, we investigate the association between the VTP and high levels of surface-level O 3 through the use of methods developed within the framework of functional data analysis (FDA). In FDA, some or all observations are taken to be functions, as opposed to scalars and/or vectors. To this end, in this work, we consider the VTP to be a functional covariate.
Modeling a scalar response variable as a function of a functional covariate is often done via the use of the functional linear model Horváth and Kokoszka (2012). As in a standard linear regression model, a functional linear model (with a scalar response) estimates the conditional mean response given the functional covariate. If one is interested in higher quantiles of the response distribution, a functional quantile regression model may be of use. Along these lines, Russell and Dyer (2017) suggest a penalized functional quantile regression model to investigate the impact of APVs on surface-level air pollution at locations in South Carolina and Florida. These authors conclude that surface-level air pollution is related to APVs at different pressure levels, and that these associations differ by location and season. As acute levels of surface-level O 3 may have a disproportionately higher impact on human health outcomes, we focus on modeling higher conditional quantiles of the O 3 response in this work.
The last 10 or 15 years have seen the proliferation of gridded data products, which offer climate and air pollution data over large spatial domains with excellent temporal coverage and no missing values. By its nature, the grid utilized in a gridded data product could be thought of as a regular spatial lattice. Others have considered approaches for analysis of data in these contexts Zhu et al., 2010;Zheng and Zhu, 2012;Reyes et al., 2015), although our research objectives and methods differ from these works. We emphasize that our primary goal is to better understand the relationship between the VTP and high surface-level O 3 in the summer in the US Southwest based on output from a gridded data product. That is, our objective is to model the relationship between a functional covariate and conditional quantiles of a scalar response at points on a regular lattice.
A simple analysis approach would be to fit independent functional linear regression models at each point on the lattice. Unfortunately, such an approach would likely be difficult to interpret due to the presence of excess noise. For our data, we believe that it is reasonable to assume that the functional regression relationships at adjacent points on the lattice are similar. For this reason, we consider an approach that institutes a penalty for the dissimilarity between estimated coefficient functions at adjacent locations on the lattice. This approach helps us with our objective of better understanding the relationship between VTP and key quantiles of surface-level O 3 at all points on the lattice.

Pointwise Functional Quantile Regression Modeling
For the random process Y, X f gwith iid copies Y t , X t f g t ∈ ℕ , assume the response variable of interest is Y t ∈ ℝ, and X t : 0, H ½ !ℝ is a functional covariate. Further, assume that X t f g t ∈ℕ are in L 2 , the separable Hilbert space determined by the set of measurable real-valued square-integrable functions on [0,H]. Given e21-2 Brook T. Russell and William C. Porter the value of the functional covariate, we directly model Q τ Y t jX t ð Þ, the conditional τth quantile of the scalar response variable for some We note that the intercept α τ ∈ℝ, and we call Þthe centered covariate function. Additionally, the twice differentiable function β τ is called the coefficient function for the τth quantile. In a typical analysis, estimation and inference regarding β τ in equation (1) is a primary aim, as portions of 0, H ½ over which the coefficient function is positive (negative), imply a positive (negative) relationship with that conditional quantile of the response. Additionally, we assume that β 00 τ ∈ L 2 , and that β τ and β 0 τ are both absolutely continuous.
As this is commonly done in FDA, we assume that β τ can be approximated by a linear combination of a finite set of known basis functions, ϕ k f g k¼1,…,K . This implies β τ s ð Þ≈ In reality, functional covariates x i f g i¼1,…,n are infinite dimensional and therefore not fully be observed. Instead, the analyst observes ,M is another set of known basis functions. Therefore, the conditional quantile in equation (1) can be expressed via Here, the M Â K dimensional matrix J is constructed such that the entry in the mth row and kth column is where ρ τ is the check-loss function commonly used in quantile regression (Koenker and Bassett Jr., 1978). The resulting optimum from equation (3) leads to the coefficient function estimator via the relationship

Functional Quantile Regression Modeling on a Grid
Climate scientists often make use of gridded data products, which can be thought of as producing output over the points on a regular spatial lattice. Assume that D is a spatial domain of interest, and that we observe data over the set of points on a regular lattice D 0 ⊂ ℤ 2 , such that D 0 ⊂ D. Despite the fact that the primary objective is to describe the association between the functional covariate and a key quantile of the scalar response everywhere in D, we assume that modeling this relationship ∀ s ∈D 0 will be useful. Additionally, we assume that the true coefficient functions at nearby locations are similar, which makes the approach of borrowing information from nearby cells a potentially promising way to improve coefficient function estimates. Assume that at time t ∈ 1, …, n f g , we observe realizations of a functional covariate and scalar response at locations s l f g l¼1,…,L such that s l ∈ D 0 ∀ l∈ f1,…,Lg, at a sequence of pressure levels 0 < h 1 < ⋯ < h U ≤ H. We propose a procedure to simultaneously estimate all L coefficient functions β τ,l È É l¼1,…,L , under the assumptions described above. We again assume that the true coefficient function at s l is well approximated by a finite linear combination of known basis functions, giving β τ,l h ð Þ≈ P K k¼1 b l,k,τ ϕ k h ð Þ ¼ ϕ T h ð Þb l,τ . Observations from the functional covariate are denoted by x l,t h ð Þ f g l ∈ 1, …, L f g ,t ∈ 1, …, n f g , and assume further that there exists a l,t ∈ ℝ M such that Environmental Data Science e21-3 x l,t h ð Þ≈ P M m¼1 a l,t,m ψ m h ð Þ ¼ ψ T h ð Þa l,t . We then approximate the conditional τth quantile of the response at time t and location s l by One could consider the estimatorÃ The estimator in equation (5) does not incorporate any sort of penalty for dissimilarity between neighboring estimated coefficient functions. As we believe that nearby coefficient functions exhibit similarity in our data application, we supplement the loss function in equation (5) by an additional penalty term.

Penalizing spatial dissimilarity
Our spatial region of interest is composed of lush forests, arid deserts, coastal regions, and high mountain terrain. Because of this geographical heterogeneity, some pairs of neighboring cells on the lattice may exhibit a higher (or lower) degree of spatial similarity in terms of their true underlying coefficient functions. For this reason, for neighboring locations s l and s l 0 (l 6 ¼ l 0 ), we propose the weighting function w θ,γ s l , s l 0 ð Þ¼θ exp Àγδ s l , s l 0 ð Þ ð Þ , where θ > 0 and γ ≥ 0 are unknown parameters and the function δ models geographic and or climatological dissimilarity between two locations. We propose the penalized estimator where Λ τ is defined in equation (6) and We note that equation (9) gives the sum of the spatial dissimilarity penalties over all pairs of adjacent cells in D 0 . Here, I Á f g denotes the indicator function, and s l $ s l 0 implies that s l and s l 0 are adjacent. The definition of adjacency is flexible and can be selected in a way that makes sense for a specific data application. Importantly, the b . The absolute loss penalty in equation (9) is used, because the absolute value function can be expressed in terms of the check-loss function, as seen in equation (10). This makes it straightforward to implement the penalty by augmenting the quantile regression design matrix.

Analysis of Surface-Level Ozone in the US Southwest
The primary research objective in this work is to gain a better level of understanding regarding the effects of VTP on acute daily surface- In order to more effectively penalize the dissimilarity between adjacent grid cells, we seek to identify a data product that is able to assess the degree to which two adjacent grid cells are similar from a geographical and climatological perspective. For this purpose, we use the PRISM data product (Daly et al., 1997) as it explicitly incorporates information about rainfall and implicitly incorporates information about geographical features such as elevation. Denote the 30-year average annual precipitation totals for PRISM cell Pr 30 i ð ÞI i∈ l f g ð Þ = P N Prism i¼1 I i∈ l f g, and i ∈l is the event that the midpoint of PRISM cell i is contained in cell l. Essentially, η s l ð Þ is the average of all PRISM 30-year annual precipitation totals for all PRISM locations in the grid cell l. The PRISM data product takes both geographical and climatological information into account, and therefore we find it useful to leverage it for this purpose. We implement our analysis procedure using B-spline basis functions to represent both the coefficient function as well as the functional covariates. We utilize 7 B-spline basis functions of order 4 with equally spaced interior knots to represent the functional covariates and estimated coefficient functions. The parameters θ and γ are estimated via a two-dimensional grid search using Bayesian information criterion (BIC) as the model comparison criterion. In keeping with our objective of modeling acute surface-level O 3 , we take τ ¼ 0:90. Figure 2 plots the resulting coefficient function estimates at all locations in the spatial domain for a sequence of eight equally spaced pressure levels: 825, 850, …, 975, 1,000. Recall that a pressure level of approximately 1,000 corresponds with the Earth's surface. At higher pressure levels (above 925 or 950), warmer temperatures are associated with larger high O 3 events. Around the pressure level 950, we begin to see this relationship disappear over large portions of this region. Instead, as we go lower in the atmosphere, larger high O 3 events are associated with cooler air over Nevada, Southern Utah, and Arizona, as well as coastal California. Interestingly, we see the opposite over the high mountains of California and in the far northern part of the state. This implies that air temperature inversions are associated with acute surface-level O 3 events over the majority of the US Southwest; in contrast, warmer air at the surface is linked to these types of air pollution events over the Sierra Nevada Mountains and in Northern California.

Discussion
In this work, at each location on a spatial lattice, we estimate the relationship between the VTP and high daily surface-level O 3 events during the summer in the US Southwest states of California, Nevada, Arizona, and Utah. We take a functional quantile regression approach, and as we believe that nearby coefficient functions are similar, we estimate at all points on the lattice simultaneously and penalize the spatial dissimilarity. In this work, we model acute surface-level O 3 events by estimating the conditional 0.90 quantile of the response distribution. Because of the geographic heterogeneity of this diverse region, we weight differences between coefficient function estimates at neighboring cells more if they are similar geographically and/or climatologically. Our analysis concludes that temperature inversions are a primary driver of acute O 3 events over Arizona, Utah, Nevada, and coastal California. In contrast, higher temperatures are associated with acute O 3 events over far Northern California and the Sierra Nevada Mountains. We determine two cells on the grid to be adjacent if they are connected to the N, S, E, or W. In the future, we hope to expand this definition to further outlying cells. Doing so would complicate estimation, but may improve estimates. Also, the model developed in this work does not incorporate temporal dependence. Although we believe that the results presented here would not change greatly, we hope to refine the modeling approach to account for temporal dependence in the future. We would also like to further refine our methodology to simultaneously consider other quantiles of interest, and to investigate the impact of alternative weighting functions. In the future, we believe that it would be interesting to consider combinations of other regions and seasons, to compare and contrast the results presented here. Similarly, it would be interesting to consider models other than GEOS-Chem model version 12.2.1. Other extensions to our modeling procedure include developing modeling procedures that are able to incorporate both gridded model output and in situ observations (O 3 and VTP). We hope to consider these ideas in future work.
We note that equation (9), with its absolute value penalty terms written in terms of finite differences, can also be expressed via the check-loss function, relying on the fact that ∀ τ ∈ ð0,1Þ and u∈ ℝ: Therefore, in practice, the optimization in equation (8) can be performed in a computationally efficient manner by constructing the design matrix determined by the loss function in equation (6), and augmenting it with the design matrix determined by the penalty Λ spat in equation (9), relying on the relationship in equation (10)). For this reason, functions designed for quantile regression with scalar covariates, such as R's quantreg package, may be utilized for optimization. With a large number of adjacent cells, the resulting design matrix will be very large, but also very sparse. This sparsity can be leveraged to make estimation reasonable.