Generalized Kernel Regularized Least Squares

Kernel Regularized Least Squares (KRLS) is a popular method for flexibly estimating models that may have complex relationships between variables. However, its usefulness to many researchers is limited for two reasons. First, existing approaches are inflexible and do not allow KRLS to be combined with theoretically-motivated extensions such as random effects, unregularized fixed effects, or non-Gaussian outcomes. Second, estimation is extremely computationally intensive for even modestly sized datasets. Our paper addresses both concerns by introducing generalized KRLS (gKRLS). We note that KRLS can be re-formulated as a hierarchical model thereby allowing easy inference and modular model construction where KRLS can be used alongside random effects, splines, and unregularized fixed effects. Computationally, we also implement random sketching to dramatically accelerate estimation while incurring a limited penalty in estimation quality. We demonstrate that gKRLS can be fit on datasets with tens of thousands of observations in under one minute. Further, state-of-the-art techniques that require fitting the model over a dozen times (e.g. meta-learners) can be estimated quickly.


Introduction
Designing models that can correctly estimate complex interactions between covariates or non-linear effects of continuous predictors is an important but challenging problem. These models are increasingly popular not only as a robustness test to check the impact of functional form assumptions, but also as key constituent components to a variety of increasingly popular machine learning algorithms designed to estimate causal effects.
One popular method in political science to estimate a highly flexible model while maintaining good out-of-sample predictive performance is "Kernel Regularized Least Squares" (KRLS; Hainmueller and Hazlett 2014), also known as "kernel ridge regression" (e.g. Yang, Pilanci and Wainwright 2017). This method provides a flexible approach to estimate a possibly complex underlying function and can easily capture interactions between covariates or non-linear effects of certain predictors. It is simple to use as it only requires the researcher to provide a matrix of relevant predictors. Hainmueller and Hazlett (2014) describe other attractive features. However, it has two noticeable drawbacks that have likely limited its more widespread adoption. First, traditional approaches to estimating KRLS are rather inflexible as they require that all variables are included in a single kernel and regularized. 1 This prevents common extensions such as (unregularized) fixed effects, random effects, or multiple kernels for different sets of predictors from being included; further, it is challenging to estimate models with non-Gaussian outcomes (e.g., binary, ordered, or categorical outcomes) and difficult to implement alternative standard errors (e.g., cluster-robust standard errors). In many applied settings, researchers desire a "modular" approach like that found when using hierarchical models where different variables can be included in the model in different ways based on the researcher's theoretical beliefs.
Second, and equally importantly, traditional versions of KRLS are highly computationally expensive as the cost of estimation is dominated by the cube of the number of observa-1 KSPM (Schramm et al., 2020) is an exception, although it has some limitations discussed in Appendix F. tions (Yang, Pilanci and Wainwright 2017;Mohanty and Shaffer 2019). Without additional modification, it is difficult to fit these models with more than 10,000 observations-and even this may take many hours.
We introduce "generalized KRLS" (hereafter gKRLS) to tackle these issues. Our solution has two parts; first, some existing literature shows that (regular) KRLS can be reformulated as a carefully chosen hierarchical model (e.g., Liu, Lin and Ghosh 2007;Zhang, Dai and Jordan 2011). Theoretically, this reformulation facilitates a modular model building strategy that can contain multiple kernels in addition to random effects, other smooth terms, and unpenalized fixed effects. However, using rich modular models can considerably complicate estimation using existing approaches given the need to tune multiple different regularization parameters. Fortunately, this hierarchical perspective also facilitates estimation techniques for fast tuning of the regularization parameters without expensive grid searches or cross-validation. These techniques also immediately extend to non-Gaussian outcomes and provide well-calibrated standard errors on key quantities of interest (Wood 2017). This reformulation alone, however, is insufficient to make gKRLS practical on large datasets given the cubic cost noted previously. We address this by using the popular "sub-sampling sketching" to reduce the cost of estimation by building the kernel based on a random sample of the original dataset (Drineas and Mahoney 2005;Yang, Pilanci and Wainwright 2017).
Our paper proceeds as follows. Sections 2 and 3 describe gKRLS. Section 4 provides two simulations to illustrate its advantages; first, we examine the scalability of gKRLS. 2 While maintaining accurate estimates, gKRLS takes around six seconds for a dataset with 10,000 observations and two covariates and around two minutes with 100,000 observations without any parallelization and only 8GB of RAM. This compares with hours needed for existing approaches. Our second simulation shows the importance of having a flexible modular approach. We consider a data generating process that includes fixed effects for a group membership outside of the kernel. Traditional KRLS includes the fixed effects in the kernel which assumes the effect of all covariates can vary by group. We find this model is too flexible for modestly-sized datasets and performance can be improved by including the fixed effects as unregularized terms "outside" the kernel.
Finally, we conduct two empirical analyses. Section 5 reanalyzes Newman (2016)'s study of gender and beliefs in meritocracy. Building on theory from the original paper, we use the modular nature of gKRLS to estimate a logistic regression includes three hierarchical terms (random effects, splines, and KRLS) as well as unpenalized covariates (fixed effects).
Estimation takes around ten minutes with 8GB of RAM. Section 6 explores Gulzar, Haas and Pasquale (2020)'s study of the implications of political affirmative action for development in India. This is a larger dataset (around 30,000 observations), and our preferred model includes many unpenalized covariates and a single kernel. To address regularization bias, we also use gKRLS in algorithms that require fitting gKRLS between 10 and 15 times (e.g., double/debiased machine learning; Chernozhukov et al. 2018). Estimation takes a few minutes.

Generalizing KRLS
There are many different approaches to presenting KRLS (Hainmueller and Hazlett 2014).
We focus on the penalized regression presentation to build connections with hierarchical models. In this view, KRLS creates covariates that measure the similarity of observations (e.g., the transformed distance between covariate vectors) while penalizing the estimated coefficients to encourage estimation of conditional expectation functions that are relatively smooth and penalize excessively "wiggly" functions where the outcome would vary dramatically given small changes in the predictors (Hainmueller and Hazlett, 2014). This is a common goal for smoothing methods, and different underlying models lead to different design matrices and penalty terms (Wood 2017, Ch. 5). KRLS is especially useful when there are multiple variables that could interact in complex and possibly non-linear ways as it does not require the explicit formulation of which interactions or non-linearities may be relevant. This differs from sparsity-based frameworks such as the LASSO that require creating a set of possibly relevant interactions and bases before deciding which ones are relevant.
Formally, assume the dataset has N observations with covariate vectors w i . We assume that {w i } N i=1 has been standardized-as our software does automatically-to ensure different covariates are comparable in scale. This prevents arbitrary changes (e.g., changing units from meters to feet) from affecting the analysis. Hainmueller and Hazlett (2014) center each covariate to have mean zero and variance one. We use Mahalanobis distance to also address potentially correlated input covariates; we thus assume that a mean-centering and whitening transformation has been applied to {w i } N i=1 such that the covariance of the stacked w i equals the identity matrix.
Given this standardized data, we create an N × N kernel matrix K that contains the similarity between two observations. We use the popular Gaussian kernel, but our method can be used with other kernels. Equation 1 defines K that depends on a transformation of the squared Euclidean distance between the observations scaled by the kernel bandwidth which we fix to P -the number of covariates in w i -following Hainmueller and Hazlett (2014). 3 In traditional KRLS, K becomes the design matrix in a least-squares problem with 3 If the design matrix of stacked w i is not full rank, we use its rank instead of P and use a generalized inverse in the whitening transformation. parameters α to predict the outcome y i with error variance σ 2 . To prevent overfitting, KRLS includes a term that penalizes the wiggliness of the estimated function where a parameter λ determines the strength of the penalty. As λ grows very large, all observations are predicted the same value (i.e., there is no effect of any covariate on the outcome). As λ approaches zero, the function becomes increasingly wiggly, and predicted values might change dramatically for small changes in the covariates. Equation 2 presents the KRLS objective where k i denotes row i of kernel K. It is equivalent to traditional KRLS as maximizing Equation 2, for a fixed λ, gives coefficient estimatesα λ (denoting the dependence on λ) that are identical to Hainmueller and Hazlett (2014).α We start by viewing the problem from a more Bayesian perspective and choose a Gaussian prior for α that implies a posterior mode on α, conditional on σ 2 and λ, that is identical to the penalized objective (see also Appendix 2 of Hainmueller and Hazlett 2014).
This prior, sometimes known as the "Silverman g-prior", can also be derived from an independent and identically distributed Gaussian prior on each of the coefficients from the underlying feature space associated with the kernel K (Zhang, Dai and Jordan 2011).
Thus, KRLS can be viewed as a traditional random effects model (or ridge regression) on the feature space associated with K. Equation 3 displays this generative view of KRLS where K − denotes the pseudo-inverse of K in the case of a non-invertible kernel.
A key advantage of this Bayesian view is that KRLS becomes simply a hierarchical model with particular choice of design and prior. This leads to the idea of "modular" model construction where different priors are used for different components of the model.
For example, it is common to have unpenalized terms (e.g., "fixed effects") alongside the regularized terms. Alternatively, theory may call for the inclusion of more traditional random effects for a geographic unit such as county. We define generalized KRLS, therefore, as a hierarchical model with at one least KRLS term on some covariates. Equation 4 presents the general model. Fixed effects (β) have design x i for each observation. There are J penalized terms, indexed by j ∈ {1, · · · , J}, with parameters α j and designs z ij .
As is standard for hierarchical models, each α j has a multivariate normal prior with precision S j . Each hierarchical term j has its own parameter λ j that governs the amount of regularization.
Specific choices of design and prior give well-known models. If z ij is a vector of group membership indicators and S j is an identity matrix, this is a traditional random intercept.
If z ij = k i and S j = K, we recover KRLS from Equation 3. S λ represents the block-diagonal concatenation of each penalty term λ j S j .
A key difficulty in using (generalized) KRLS is choosing the appropriate amount of regularization, i.e. calibrating {λ 1 , λ 2 , · · · , λ J }. In the case of a single KRLS term (e.g., J = 1) and a Gaussian likelihood, Hainmueller and Hazlett (2014) use an efficient method where the leave-one-out cross-validated error can be computed as a function of λ and requires only a single decomposition of the kernel K. One could employ K-fold crossvalidation to tune λ if a non-Gaussian likelihood were used (Sonnet and Hazlett, 2018).
However, existing strategies encounter considerable challenges when there are multiple hierarchical terms (J > 1). Since Hainmueller and Hazlett (2014)'s method may not be available, a popular alternative-grid searches across different possible values for each λ j to minimize some criterion (e.g., cross-validated error)-is very costly even for modest J.
Our hierarchical and Bayesian perspective provides a different strategy for tuning λ for any choice of J: Restricted Maximum Likelihood (REML). 4 This approach observes that β has a flat (improper) prior and considers the marginal likelihood after integrating out β and all α j . A REML strategy estimates λ and σ 2 by maximizing the log of this marginal likelihood; this is also referred to as an empirical Bayes approach (Wood 2017, p. 263).
|S| + denotes the product of the non-zero eigenvalues of S; M p is the dimension of the null 4 Wood (2017) discusses other criterion, e.g., generalized cross-validation, that could be employed.
After findingλ and σ 2 , point estimates for β and α are obtained by plugging the estimatedλ into Equation 5. Liu, Lin and Ghosh (2007) use the REML approach for a single KRLS hierarchical term (e.g., J = 1), and we push that intuition further by noting that that KRLS can be part of a general J approach to hierarchical and generalized additive models.
In practical terms, Wood (2017) summarizes the extensive research into numerically stable and efficient approaches to optimizing Equation 6 and describes well-established and high-quality software (mgcv in R). For very large problems (in terms of the number of observations or parameters), further acceleration may be needed. Appendix A.2 discusses a set of less stable but faster estimation techniques implemented in the same software.
The final piece of inference is quantifying uncertainty. The Bayesian perspective on hierarchical models suggests using the inverse of the Hessian of the log-posterior on {β, α} for the estimated variance matrix (Wood 2017

Extensions to Generalized KRLS
The above presentation focused on a Gaussian outcome with arbitrary J and homoskedastic errors. We discuss four important extensions that our hierarchical perspective facilitates. First, the preceding exposition is easily generalized to non-Gaussian likelihoods: One changes the likelihood in Equation 4, e.g.
and adjusts the objective in Equation 6. This is justified using a Laplace approximation for evaluating the integral of the log-posterior;β λ andα λ are obtained using penalized iteratively re-weighted least squares (Wood 2017, Ch. 3).
Second, the hierarchical perspective also justifies robust and/or clustered standard errors. Appendix A.1 provides a detailed justification of the typical "sandwich" formula with slight modifications. We also show existing standard errors for KRLS (Hainmueller and Hazlett 2014) differ from those derived using the Bayesian perspective discussed above. A simple example suggests that using the Bayesian perspective results in considerably better coverage.
Third, a key use for gKRLS is in machine learning techniques such as stacking or double/debiased machine learning. We provide a software integration of gKRLS (and mgcv) into popular packages for both methods; Appendix F provides details.
Finally, we provide new software for easily calculating marginal effects and predicted outcomes for a variety of likelihoods (e.g., Gaussian, binomial, multinomial, etc.). Among other quantities, this allows users to calculate the "average marginal effect" (i.e., the partial derivative of the prediction with respect to a specific covariate averaged across all observations in the data; Hainmueller and Hazlett 2014). Appendix A.3 provides details.
We are able to properly incorporate uncertainty for both fixed and random effects for these quantities.

Improving Scalability of Generalized KRLS
The optimism of the above discussion, however, elides a critical limitation of gKRLS as currently proposed. We focus on the traditional KRLS case (J = 1, no fixed effects) to illustrate the problem. Recall that the model has N observations but requires the estimation of N coefficients. Estimation is extremely time-and memory-intensive as the computational cost is roughly cubic in the number of observations and requires storing a possibly huge N × N matrix (Hainmueller and Hazlett 2014;Yang, Pilanci and Wainwright 2017). While some work in political science has focused on reducing this cost, the fundamental problem remains and, in practice, limits its applicability to around 10,000 observations with 8GB of memory (Mohanty and Shaffer 2019) and possibly taking hours to estimate-as Section 4 shows. Thus, using gKRLS without modifications is simply impractical for most applied settings. Further, if one needs to fit the model repeatedly (e.g., for cross-validation), it is prohibitively expensive.
Fortunately, there is a large literature on how to approximately estimate kernel methods on large datasets. We employ "random sketching", focusing on "sub-sampling sketching" or "uniform sampling" (e.g., Drineas and Mahoney 2005;Yang, Pilanci and Wainwright 2017;Lee and Ng 2020) to dramatically accelerate the estimation; other methods could be explored in future research (e.g., random features; Rahimi and Recht 2007). 6 The subsampling sketching method takes a random sample of M data points and uses them to build the kernel, reducing the size of the design to N × M . If M is much smaller than N , this can reduce the cost of estimation considerably. Formally, define the M sampled observations as w * m , m ∈ {1, · · · , M }. If k(w i , w m ) is the function to evaluate the kernel (e.g., Equation 1), define the sketched kernel K * as an N × M matrix with the (i, m)-th element as follows: Equivalently, one can define K * by multiplying K by a sketching matrix S with di-

Calibrating the Sketched Kernel
We note two key points to consider when using sub-sampling sketching. First, the sketching dimension M clearly affects performance. As M increases, the model will likely perform better (see Appendix C.5). Inspired by some literature on the Laplace approximation for standard hierarchical models (e.g., Shun and McCullagh 1995), the default setting in our software sets M = δN 1/3 , i.e. growing at a rate of N 1/3 times a (constant) sketching multiplier δ; this can be manually increased by the researcher as appropriate.
We show that δ = 5 often provides good performance, but one could use a larger shows both can be fit quite rapidly.
Even if M is relatively large, the sub-sampling sketching method may sometimes fail to provide a good representation of the original data (Yang, Pilanci and Wainwright 2017).
We also find some evidence of this when the kernel is complex (see Appendix C.5). Lee and Ng (2020) review the literature on how to improve these methods; future research could explore these techniques.
Second, sub-sampling sketching will not generate identical estimates if the model is re-estimated due to different sketching matrices. While this randomness is common to some statistical methods (e.g. random forests), researchers should carefully examine the sensitivity of their results to the specific sketching matrix chosen. Exactly characterizing the impact of this variability is outside of the scope of this paper, although it may often be relatively small especially when δ = 15. Appendix C.5 and Appendix E examine this for our simulations and applied examples. Corroborating the above discussion about potential limitations of the sub-sampling sketching method, we find that when the kernel is relatively simple, there is a high degree of stability. When the kernel is complex, a larger multiplier may be needed to ensure stable estimates. Assuming it is computationally feasible, a researcher might fit the model multiple times with different sketching matrices to show robustness. If the quantity of interest seems to vary considerably, we suggest increasing the size of the sketching dimension.

Evaluating the Performance of Generalized KRLS
We evaluate the scalability of gKRLS when performing the tasks used in standard applications: estimating the model, calculating average marginal effects, and generating predictions on a new dataset of the same size as the training data. We compare gKRLS against popular existing implementations: KRLS (Hainmueller and Hazlett 2014) and bigKRLS (Mohanty and Shaffer 2019)-where we examine truncating the eigenvalues to speed estimation ("bigKRLS (T)" using a truncation threshold of 0.001) and not doing so ("bigKRLS (NT)"). 7 Finally, to examine the role of the sketching multiplier, we fit gKRLS with δ ∈ {5, 15} ["gKRLS (5)" and "gKRLS (15)", respectively]. All numerical results in this paper are run on a single core with 8GB of RAM. We explore a range of sample sizes spaced from 100 to 1,000,000-spaced evenly on the log-10 scale. For this initial examination, we rely on a generative model from Hainmueller and Hazlett (2014) ("Three Hills, Three Valleys") shown below: We generate fifty datasets and calculate the average estimation time and accuracy across the simulations. We stop estimating methods once costs increase dramatically to limit computational burden. Figure 1 reports the estimation time: KRLS and bigKRLS (with truncation) can be estimated quickly when the number of observations is relatively small, but this increases rapidly as the sample size grows (around the rate of N 3 ). When there are more than 10,000 observations, even bigKRLS would take hours to estimate. By contrast, gKRLS is at least an order of magnitude faster.
The right panel of Figure 1 illustrates this more starkly by reporting the logarithm of time on the vertical axis. Even with the large multiplier ("gKRLS (15)"), gKRLS takes a few minutes for 100,000 observations. Appendix B.2 calculates an empirical estimate of the computational complexity of gKRLS and shows it is substantially lower than traditional methods. Even with one million observations, gKRLS (δ = 5) takes under one hour. Appendix A.2 discusses an alternative estimation technique (bam) that decreases this time to around three minutes with no decline in performance.

Kernels and Fixed Effects
Traditional KRLS usually requires that one include all covariates in a single kernel. This has the benefit of allowing the marginal effect of each variable to depend on all others.
However, this could be too flexible and require enormous amounts of data to reliably learn the underlying relationship. This problem is likely especially severe when considering fixed effects for group membership. Allowing all marginal effects to vary by group (e.g., a non- However, if one has theoretical reason to believe parts of the underlying model are additive (e.g., including fixed effects to address [additive] unobserved confounding), then including indicators for group outside the kernel (i.e., in β) will likely improve performance for modestly sized datasets. Since the group indicators are unregularized, this ensures that the usual "within-group" and "de-meaning" interpretation associated with fixed effects holds; this would not occur if they were included in the kernel.
We use a simulation environment that mimics traditional explorations of fixed effects (e.g., Bell and Jones 2015) but where the functional form of two continuous covariates is possibly non-linear. One of these covariates (x i,1 ) is correlated with the fixed effects and thus its estimation should be more challenging as the correlation increases. The data generating process is shown below.
• Assume there are J groups with some number of observations T .
• Assign each observation i to some group j at random.
• Generate the covariates for each observation as follows. First, draw a fixed effect µ j and a group level meanx j for each group. ρ controls the amount of correlation.
Larger ρ implies "random effects" should perform less well.
• Generate the outcome as follows for each m ∈ {linear, nonlinear}: In our analysis, we set ρ ∈ {0, 0.3, 0.6, 0.9} to vary the degree of correlation between x i,1 and µ j . 8 We assume a reasonable number of groups (J = 50) and ten observations per group (T = 10). We compare the following models: (linear) OLS, fixed and random effect models. We also examine two kernel methods: bigKRLS (without truncation; with all variables in the kernel) and gKRLS (with a multiplier of five). For gKRLS, we use a kernel on x i,1 and x i,2 and include indicators for group membership outside the kernel as unregularized fixed effects (β). We run each simulation 1,000 times. We expect that all kernel methods should incur some penalty versus linear fixed effects when the true data generating process is linear. Figure   First considering the linear data generating process (left panel), the traditional estimators (OLS, random effects, and fixed effects) behave as expected: OLS and random effects perform increasingly poorly as ρ increases. In the non-linear data generating process, the same pattern holds although all three linear models perform less well as they are not able to capture the true underlying non-linearity.
When we compare the kernel methods used in the linear data generating process, both perform worse than fixed effects-i.e. a correctly specified model-and neither method is affected much by ρ. However, gKRLS consistently outperforms bigKRLS by a considerable margin. In the non-linear case, we see that both kernel methods perform well versus the linear alternatives, although gKRLS still has a considerable and constant advantage over bigKRLS. Appendix C.4 shows that including the two covariates as fixed effects (β) in addition to their inclusion in the kernel improves performance considerably on the linear data generating process but incurs some penalty for the non-linear case.
Appendix C provides additional simulations. Appendix C.1 considers alternative metrics for assessing the performance of the methods, e.g., out of sample predictive accuracy.
The results show a similar story: gKRLS is either close to bigKRLS or beats it by a considerable margin. Appendix C.2 also explores the performance on estimating the effect of the second covariate (x i,2 ): gKRLS outperforms bigKRLS. Appendix C.3 considers an increasing number of observations per group (T ). As T grows, both kernel methods improvealthough gKRLS continues to perform better even when T = 50. To better understand why gKRLS improves upon bigKRLS, Appendix C.4 shows that the improvement can be attributed solely to including the fixed effects outside the kernel-not additional changes such as how the smoothing parameter is selected, using Mahalanobis distance for creating the kernel, or sub-sampling sketching.
Finally, Appendix C.5 explores the impact of sketching in this more complex case. It estimates models with different sketching matrices for fixed multiplier δ to understand the impact on the RMSE versus the unsketched estimates. It finds that sketching incurs some penalty on the accuracy of the estimated average marginal effect, although this declines as the sketching multiplier increases. When fixed effects are included in the kernel, this decline is considerably slower. When fixed effects are not included in the kernel, virtually any sketching multiplier can recover nearly identically accurate estimates to the corresponding unsketched procedure.

Generalized KRLS for Observational Data
Our first empirical application examines an observational study by Newman (2016). The paper focuses on the contextual effects of gender-based earnings inequality for women's belief in meritocracy. The key theoretical discussion concerns how gender inequality in earnings in the local area where a woman lives affects their rejection of a belief in meritocracy (e.g., "hard work and determination are no guarantee of success for most people").
Newman ( gKRLS's modularity allows us to more robustly test Newman (2016)'s argument. Our first hierarchical term is a kernel including all covariates to capture possible interactions or non-linearities omitted by the original (additive) model and thereby improve the robustness of the reported results. We also include a random intercept for county, following Newman (2016), to address the nested nature of the data.
However, Section 4 illustrated that relying exclusively on gKRLS given limited data may be undesirable as it could be too flexible. An additional risk of relying exclusively on KRLS is that if the estimated λ were very large, that would effectively exclude all covariates and mimic an intercept-only model. A more modular approach uses a KRLS term to flexibly estimate interactions or non-linear effects while additionally including "primary" covariates of interest.
We include additional terms following Newman (2016). First, we include all controls in the fixed effects (β). Second, we perform a more robust examination of the effect of earnings inequality. Rather than assuming the relationship is quadratic, we additively include a thin plate regression spline (Wood, 2017, p. 216) on earnings inequality. This does not impose a specific functional form and allows the data to reveal whether the relationship is quadratic or has some other shape. This expanded model ensures that we include a specification that is comparable to Newman (2016) while also allowing for extra interactions using KRLS.
Overall, we estimate a logistic regression with four parts (  The results partially support Newman (2016). The point estimates from gKRLS show a non-linear inverted "u-shaped" relationship that is similar to the original results ("Newman"), although the curve is noticeably flatter for extreme values of earnings inequality.

Generalized KRLS with Machine Learning
Our second empirical replication considers a geographic regression discontinuity analysis in Gulzar, Haas and Pasquale (2020). They focus on the effects of improving political representation using quotas on the economic welfare of various groups in society. They examine how electoral quotas for members of Scheduled Tribes affect the economic welfare of members of that group, members of a different historically disadvantaged group not affected by the quota (members of Scheduled Castes), members in neither group ("Non-Minorities"), as well as the total population.
We focus on their analysis of three economic outcome variables from the National Rural Employment Guarantee Scheme that offers one hundred days of employment for rural households (Gulzar, Haas andPasquale, 2020, p. 1231). The outcomes we consider are "(log) jobcards" (the total number of documents issued to prospective workers under the program), "(log) households" (the number of households who participated in the program), and "(log) workdays" (the total number of days worked by individuals in the program).
The treatment is whether a village is part of a scheduled area that imposes an electoral quota. Across the three outcomes, the key findings from Gulzar, Haas and Pasquale (2020) are that (i) there is no effect on the total economic welfare, (ii) the targeted minorities (Scheduled Tribes) see increases in economic welfare; (iii) the non-targeted minority groups (Scheduled Castes) do not see any significant changes; and (iv) non-minority groups see decreases in economic outcomes.
gKRLS can improve the original analysis in two ways. First, Gulzar, Haas and Pasquale (2020) include the interaction of fourth-order polynomials on latitude and longitude following previous work on geographic regression discontinuity designs (replicated as "GHP" in Figure 5). gKRLS enables a more flexible solution, even on this larger dataset (32,461 observations), by using a kernel on the geographic coordinates 9 while including the treatment 9 In this specification only, we rely on raw Euclidean distance, without standardization, due to the direct and other covariates linearly as unpenalized terms (β). We denote this model as "gKRLS (Geog.)". Second, Gulzar, Haas andPasquale (2020, p. 1238) report some imbalance on certain pre-treatment covariates; they include controls additively and linearly to improve the robustness of their results. Including these variables (and treatment) in a KRLS term provides additional robustness. We use "gKRLS (All)" for this model that includes the KRLS term (J = 1) as well as all variables linearly in the fixed effects (β) to ensure their inclusion. In both models, we use cluster-robust standard errors following the original specification.
The use of penalized terms, however, raises a concern about regularization bias in the estimated treatment effect; we address this using double/debiased machine learning (DML) that removes such bias (Chernozhukov et al. 2018). We use the specification from "gKRLS (All)" (after removing the treatment indicator) for our machine learning model. Estimation with five folds requires fitting gKRLS ten or fifteen times depending on whether one uses the partially linear model ("DML-PLR") or the dedicated algorithm for estimating the ATE ("DML-ATE"), respectively. 10 Both procedures estimate conditional expectation functions with Gaussian outcomes, while the latter (DML-ATE) also estimates a propensity score for being treated using a binomial outcome with a logistic link. Either procedure takes only a few minutes to estimate. To address clustering within the data, we use stratified sampling to create the folds for DML and produce the standard error on the treatment effect using an analogue to the usual cluster-robust estimator (Chiang et al. 2022).  to examine variability across different sketching matrices. It finds relatively low variability of the point estimates relative to the magnitude of the estimated standard errors.
Appendix E.2 uses gKRLS with a machine learning algorithm to estimate heterogeneous treatment effects ("R-learner"; Nie and Wager 2021). Even though this method requires fitting gKRLS over a dozen times (with both Gaussian and binomial outcomes), estimation takes only a few minutes. We find that one state (Himachal Pradesh) has noticeably larger treatment effects than other states.

Conclusion
Our paper generalized KRLS in two meaningful directions by drawing together different existing literatures. First, we recast the original model into the modular framework of hierarchical and generalized additive models where adding a kernel on some variables can be thought of as simply adding one additional hierarchical term (i.e., increasing J by one).
This allows researchers using gKRLS to modularly build their model by including variables in different ways based on their substantive knowledge. For models with multiple hierarchical terms and/or non-Gaussian outcomes, a hierarchical perspective on KRLS allows for easy tuning of the regularization parameters, efficient estimation, and well-calibrated standard errors. Empirically, we show that in a stylized example with additive fixed effects, thinking carefully about how to include different terms in the model (e.g., unregularized fixed effects versus including them in the kernel) can be critically important to performance. The second generalization employed sub-sampling sketching to allow gKRLS to be easily scalable to most datasets encountered in social science. By breaking the requirement that the cost of the model depends on the cube of the number of observations, sub-sampling sketching allows the model to be estimated very quickly on tens or hundreds of thousands of observations. Even for methods that require repeated estimation of gKRLS (e.g., double/debiased machine learning), models can be estimated with limited computational cost.
Our paper and accompanying software therefore allows KRLS to become a more widely used part of the applied researcher's toolkit.

Supplementary Material for "Generalized Kernel Regularized Least Squares" A Additional Theoretical Results
This section contains additional theoretical results. First, we provide a more detailed exposition of the Gaussian model. It unifies our presentation in the main text with more classical presentations of hierarchical/multilevel models (e.g., Hazlett and Wainstein 2022).
It also allows us to derive our justification for robust/clustered standard errors as well as providing new standard errors for (traditional) KRLS. Second, we discuss the alternative estimation methods available in mgcv (gam vs bam). Finally, we explain how our software estimates average marginal effects.

A.1 Alternative Standard Errors
We consider the following Gaussian outcome model for our discussion on standard errors following Hazlett and Wainstein (2022). Extensions to non-Gaussian outcomes can be made using standard arguments (see Wood 2006). We do not assume any special structure on Ω but are more general than the main text insofar as Σ need not be diagonal. We assume that Ω and Σ are known and full rank. A rank-deficient Ω can be addressed via an eigen-decomposition and adjusting the fixed effect design matrix X. Σ = σ 2 I and Ω = 1/σ 2 S λ recovers the model in the main text (Equation 4b).
The penalized maximum likelihood estimator is shown below, mirroring Equation 5.
Despite our different presentation, it can be easily verified (see also Wood 2017, p.80-81) thatβ andα are identical to the standard multilevel estimators (e.g., Hazlett and Wainstein 2022, p. 51). In terms of appropriate estimators for the variance ofβ andα, Wood (2006) suggests two options; a "Bayesian" estimator that is the inverse of the Hessian of the log-posterior and a frequentist estimator. We consider each in detail. First, the Bayesian estimator V B is shown below. This is justified based on the posterior distribution of {β, α} given y if Σ and Ω are known (Wood 2006).
Note that the upper left-block, corresponding to the estimated variance ofβ, denoted as V β B is identical to the "standard" multilevel model estimator forβ (see Wood 2017, p. 80 andWainstein 2022, p. 52). Second, Wood (2006) suggests a "frequentist" estimator where, for some choice of variance of y denoted as Var(y), the estimator V F is shown below.
There are different possible choices for Var(y). Following Wood (2006) could be swapped with the usual "meat" of the squared residuals, possibly scaled by a finite sample correction (Cameron and Miller 2015).
This idea is not novel to our paper, although we think it is perhaps underappreciated in the literature: Hazlett and Wainstein (2022) consider it in the case of clustered standard errors and a hierarchical model with a single random effect. Our formula, with the same corresponding "meat", recovers an identical variance matrix onβ to their proposal in Equation 8 (p. 51) with some re-arrangement. However, a benefit of this formulation is that the joint uncertainty onβ andα is quantified so any quantity of interest that includes both terms (i.e., most marginal effects and predicted values) can be fully incorporated.
Our justification does not assume any particular structure on Ω or Σ so applies to generic generalized additive models as other types of robust standard errors (e.g., multiway clustering, Conley standard errors for spatial dependence, etc.; Cameron and Miller 2015).
Following the arguments in Wood (2006) that generalize V F and V B to non-Gaussian outcomes, one could also apply our logic to non-Gaussian outcomes. Exploring these alternative standard errors in more detail is an interesting area for future research.

A.1.1 Implications for Traditional KRLS
We briefly consider the implications for traditional KRLS. Beyond justifying (cluster) robust standard errors, there is a more subtle point about the appropriate "regular" standard errors for KRLS. Hainmueller and Hazlett (2014, p. 153) propose frequentist standard errors, i.e. V F assuming Var(y) = σ 2 I. Noting that in their model, β does not exist (and J = 1), Z = K, Ω = λ σ 2 K, and Σ = σ 2 I, the possible variance estimators are shown below where V HH is the suggestion in the original paper and coincides with V W ood While exploring the coverage properties of these estimators in detail is outside of the scope of this paper, we note that the suggested variance-covariance matrix from Hainmueller and Hazlett (2014) is thus not the standard recommendation for generalized additive models (V B ) and thus may have worse coverage properties than the Bayesian alternative. We examine this in a simple stylized example. We use a bivariate smoothing example from mgcv, shown below. We draw 250 observations indexed with i ∈ {1, · · · , N }.
In this case, we compare the coverage over the response surface itself, i.e. over the  It reports the size of the average standard error on a predicted value, the coverage averaged across all points and simulations, and the mean absolute error (MAE) in prediction.   Finally, we use a simple example that illustrates the importance of clustered standard errors. Sticking to the above simulation, we assume that there is a group g that is drawn randomly from one of forty clusters (g ∈ {1, · · · , 40}). For two observations in the same cluster, our generative model assumes they are correlated with ρ = 0.6 and are otherwise uncorrelated. We re-estimate the same models from before, but also consider cluster-robust errors for gam and gKRLS. We see from

A.2 Alternative Estimation Methods
The main text notes that mgcv provides two methods for estimation: gam and bam. gam is designed to be a highly stable numerical procedure. However, it is sometimes slow especially when the number of observations is very large (e.g., the simulations in Appendix B) or when the number of parameters is very large (e.g., the replication of Newman 2016). The latter may often occur when a random intercept with many levels is included. In these cases, some alternative estimation technique is needed. Wood, Goude and Shaw (2015) develop bam to address these limitations. The two main innovations are as follows (see Wood 2017, p. 290 for a concise summary): First, bam slightly alters the estimation procedure for λ.
In the initial exposition in the main paper, for a proposed λ,β λ andα λ are estimated to convergence-e.g. using penalized iteratively re-weighted least squares (PIRLS) with multiple iterations for a non-Gaussian outcome. This can be expensive so bam optimizes λ after each step in PIRLS estimation. This is known as "performance orientated iteration" and can be less numerically stable than the main algorithm (Wood, Goude and Shaw, 2015).
Second, and perhaps more interestingly, bam never forms the entire design matrix.
Rather, it splits the data into chunks of size p (10,000 by default) and then builds the matrix iteratively as needed. This allows it to exploit multiple cores-although we do not use this in our paper. This can lower the memory footprint of the algorithm considerably.
One point of caution, however, is that the basis for the penalized terms (i.e., the hierarchical terms) are formed only on a random sample of chunk size p taken at the start of the algorithm. For our purposes, this means that bam freezes the size of the sketching matrix at δ(p) 1/3 . If p = 10, 000 (default), this is around 21δ if N > 10, 000. This chunk size can be modified using arguments to bam but may slow down the algorithm somewhat.
Thus, while bam can be helpful for large-scale problems, one should be careful to check any potential impacts of chunk size p (perhaps by increasing δ).
We examine bam systematically throughout the appendices. The main places where the chunk size issue could cause different results would be for (i) the simulations in Section 4 where N > 10, 000 and (ii) the analysis of Gulzar, Haas and Pasquale (2020) where N = 30, 000 for the full-sample analysis (although note that N < 10, 000 for the algorithms relying on sample-splitting). We find little evidence of difference between bam and gam.

A.3 Calculating Average Marginal Effects
Our accompanying software provides the ability to calculate average expected outcomes (e.g., average predicted probabilities) as well as "marginal effects" (e.g., first differences).
For continuous predictors, we also include the ability to calculate the average marginal effect, i.e., the average of the partial derivative of the prediction with respect to a single covariate (Hainmueller and Hazlett, 2014); Appendix D provides a specific example.
Following existing software, we do this using numerical differentiation; Leeper (2016) provides a detailed discussion. This is especially important for complex models where a single covariate could appear multiple times and using an analytical approach is difficult to implement in a flexible fashion. We use following formula following Leeper (2016), shown for a two argument function for simplicity: In practice, some small h is used to approximate the derivative. The default setting for h is h = max(|x|, 1) √ ϵ where max(|x|, 1) ranges over data distribution that one is marginalizing over (following Leeper 2016) and ϵ is machine precision.

B.1 Assessment of Model Performance
We consider two ways of assessing the performance of gKRLS following Hainmueller and Hazlett (2014). First, the main text considers the out-of-sample predictive accuracy. We do this by generating a dataset of identical size to the estimation data using the same data generating process. We then calculate the prediction for each observation in the test data and summarize the performance using the root mean squared error (RMSE).
Alternatively, we compare the estimated average marginal effect to the true value. The    we also compare results for gam and bam and types of sketching (sub-sampling sketching or Gaussian), denoted by "method-sketching type". We see that Gaussian sketching is considerably slower and rather expensive after around 45,000 observations.
As noted in the main text, when using gam (sub-sampling sketching; multiplier δ = 5), estimation takes around 30-40 minutes for 1,000,000 observations. This figure illustrates that bam can help considerably. Estimation time is around three minutes, although as discussed in Appendix A.2 this is partially due to a freezing of the size of the sketching dimension.
We can also use the log-log plot to provide a rough estimate of the computational complexity of the various algorithms and sketching procedures; since the plot looks highly linear (above a certain sample size), we can use the slope p of that log-log line to estimate the complexity as roughly N p for sufficiently large N . As expected, KRLS and bigKRLS is around N 3 , with an estimated slope of around 3 when the sample size is over 100 for KRLS and 1,000 for bigKRLS. For the sub-sampling sketched methods, if we focus on a sample size above 40,000-as the relationship appears clearly linear-the estimated slope for the methods is around 1.3, suggesting a cost that increases much more slowly than traditional methods. When using bam (discussed in Appendix A.2), the slope is around 0.95, presumably because the size of the sketching dimension does not grow after N > 10, 000. With Gaussian sketching, the estimated slope (when N > 2500) for gam is around 2.10; lower than the unsketched methods but considerably larger than the subsampling methods.  Note: This figure shows the root mean squared error (RMSE) of predicting the outcome on a dataset of the same size to the estimation data, averaged across the fifty simulations. The main text defines the abbreviation for each method. 95% confidence intervals are shown; they are calculated using a percentile bootstrap using 1,000 bootstrap samples.

B.3 Estimation Time Disaggregation
When estimating gKRLS and using it for inference, there are three steps that the applied user may perform; it is useful to know which takes more time as sample size increases. First, the model must create the (sketched) kernel and estimate the parameters ("Estimation"); second, one might wish to perform prediction on a new dataset; we use one of the same size as the estimation data ("Prediction"). Finally, one might wish to calculate marginal effects ("Marg. Effects"); this requires repeated predictions on counterfactual versions of the estimation data. The total time is reported in the main text. provides information on the impact of sketching in this more complicated scenario.

C.1 Alternative Performance Metrics
In addition to the RMSE, Figure A.6 shows the estimated bias of the methods in Figure 3 for the AME of the main covariate of interest (x i,1 ). It shows, as expected, that the fixed effects estimator is unbiased and that the bias of the OLS and RE methods increase considerably as ρ increases. In the linear model, the kernel methods are biased downwards although the bias for gKRLS is considerably smaller. In the non-linear case, both have a small bias, although bigKRLS's is slightly larger at small ρ.
We next compare the out-of-sample predictive accuracy of the methods-estimated using the procedure in Appendix B.1. Figure  worse than fixed effects or random effects, although the differences are more modest. In the non-linear case, we see the kernel methods perform the best although gKRLS out-performs bigKRLS by a considerable margin.

C.2 Performance on Other Covariate (x i,2 )
We replicate Figure 3 from the main text on the covariate x i,2 , i.e. one that is not correlated with the fixed effect. Figure A.8 shows that, in the linear case, all methods perform rather similarly at estimating this average marginal effect-including methods such as random effects or OLS. In the spline case, gKRLS clearly out-performs all other methods. Note: The figure reports the bias of the estimated average marginal effect (AME) on the second covariate (x i,2 ), averaged across 1000 simulations, as ρ varies. The left panel shows the linear data generating process and the right shows the non-linear data generating process. 95% confidence intervals are shown; they are calculated using a percentile bootstrap using 1,000 bootstrap samples.

C.3 Varying Number of Observations Per Group
In the main text, we consider 50 groups (J = 50) and set the group size (i.e. number of observations per group) at 10. We vary that here and consider T ∈ {5, 10, 25, 50}.  Note: This figure reports the RMSE of the estimated AME on the first covariate (x i,1 ) at varying sample sizes for the linear data generating process. The shaded box indicates the values reported in the main analyses, i.e. T = 10. 95% confidence intervals are shown; they are calculated using a percentile bootstrap using 1,000 bootstrap samples.
Looking first at the linear case ( Figure A.9), we see that as the cluster size increases, random effects and both kernel methods improve considerably. For larger cluster sample sizes, gKRLS is close to the RMSE for the fixed effect method although bigKRLS remains noticeably worse. This provides further evidence for the limitations of adding the fixed effects directly into the kernel. In the non-linear case, the story is similar-random effects improves with increasing sample size towards fixed effects-although gKRLS and bigKRLS are much closer in performance (as in the main text).

C.4 Understanding Why gKRLS Does Better
This section explores in more depth why gKRLS provides an improvement on bigKRLS.
We first consider the most similar model to bigKRLS fit using mgcv (i.e. no sketching, standardized covariates, and fixed effects inside the kernel). We then vary each of these dimensions separately to the default settings in gKRLS and see which affects performance.
This allows us to disaggregate what seems to improve performance in a more controlled setting. We consider the following models.
9. "gKRLS": The method shown in the main text. That is, "gKRLS (GCV)" but using REML for penalty parameter selection. Figure A.11 reports the results where the estimated RMSE on the AME is divided by the RMSE from bigKRLS without truncation. Values above "1" indicate worse performance; values below 1 indicate better performance. The first observation is that the "Baseline" gKRLS model that is as similar as possible to bigKRLS returns nearly equivalent performance. This corroborates the results in the first set of simulations (Section 4).
Next, consider the models that change only one feature of the "Baseline" model: "Ma-hal.", "FE", "Sketch(5)". With a linear data generating process, all three result in improved performance. In the non-linear data generating process, "Sketch(5)" results in worse performance. After further investigating, this is seemingly a function of using GCV (Generalized Cross-Validation) to select λ; the bar below ("Sketch(5)+REML") uses REML instead of GCV and finds a slight degradation in performance (see Appendix C.5) but nothing as catastrophic as with GCV. This may be due to some weaknesses of GCV, including a tendency to overfit for modestly sized datasets (see Wood 2011).
Of all of the simple modifications, note that "FE" is the only one that results in considerably improved performance. Thus, this provides evidence that putting the fixed effects outside of the kernel is what drives stronger performance of gKRLS.
Next, we examine the "Naive" specification; this simply swaps bigKRLS for gKRLSusing random sketching and Mahalanobis distance while keeping the fixed effects inside the kernel. In the non-linear setting, "Naive" incurs some penalty against a method with no sketching ("Baseline"). By contrast, "gKRLS (GCV)" that is the sketched, Mahalanobis, version of "FE" incurs a very slight penalty upon the unsketched version ("FE"). Methods that include fixed effects outside the kernel ("gKRLS (GCV)", "gKRLS", "gKRLS + Lin.") improve considerably upon the "Baseline" and "Naive" model-especially with a non-linear data generating process. In terms of penalty parameter selection after using sketching and Mahalanobis distance (REML vs GCV; "gKRLS" and "gKRLS (GCV)", respectively), we see a small negative impact for REML in the linear model, although there is little difference in the non-linear case.
Finally, we consider the "gKRLS + Lin." specification that includes the group indicators and two covariates as unpenalized fixed effects β as well as a kernel on the two continuous covariates. In this model, there is one hierarchical term (J = 1) and as λ → ∞, this model converges to a linear additive model of the group indicators and two covariates.
By contrast, simply placing a kernel on the two variables (with no linear component) would converge a model that predicts the mean for all observations as λ → ∞. We see that in the linear case, where this model is true, this specification has very strong performance. However, in the non-linear model, it incurs some penalty versus the "default" gKRLS. Thus, this suggests that deciding whether include linear terms in addition to kernel is either tested empirically or motivated based on what theory suggests is a reasonable limiting case for the model as λ → ∞.

C.5 Impact of Sketching
Give the somewhat varying results of sketching procedures depending on whether the fixed effects are included in the kernel or not, we conducted one additional set of simulations to more systematically understand the impact of sub-sampling sketching in this more complicated example. We focus on sub-sampling sketching and conducted the following set of simulations using the same data generating process in Section 4.1.
• We consider two quantities of interest. First, what is the RMSE of the average marginal effect on x i,1 (the key quantity of interest in the main text) across the 100 repetitions for each δ? We define this as follows, where AME * is the true AME as calculated in our main simulations (see Appendix B.1) and AME (m) δ is the AME estimated for the m-th repetition with sketching multiplier δ.
Next, we also calculate the AME for the unsketched method, define this as AME unsketch ; we can calculate the corresponding RMSE without sketching as RMSE unsketch -corresponding to the absolute error the unsketched model is deterministic for a single dataset: RMSE unsketch = |AME unsketch − AME * | = [AME unsketch − AME * ] 2 • Given those quantities for a single simulated dataset, we repeat this process 150 times as the quality of the unsketched estimates can vary across datasets. Define s ∈ {1, · · · , 150} as indexing this "outer" simulation. We average the estimated RMSEs across simulations and create a measure of "relative impact" of sketching, i.e. As expected from the earlier simulations, including the fixed effects outside the kernel is highly amendable to sketching; at almost any multiplier, the RMSE is very close to the unsketched estimates, and this does not seem especially dependent on ρ or the Note: The RMSE for each sketching multiplier, ρ and model specification is shown. 95% confidence intervals, using a percentile bootstrap over the 150 datasets using 1,000 bootstrap samples, are shown. The RMSE of the unsketched method is shown to the right of the vertical bar.
By contrast, including the fixed effects inside the kernel shows more nuanced results.
Before examining sketching, we note that including the fixed effects inside the kernel results in worse performance, on average, at every δ, ρ and standardization method than including them outside the kernel. However, given that choice of specification, there is more divergence between the standardization methods (especially with poor performance for no standardization as ρ grows). For this complex kernel, the convergence in the estimated RMSE to the unsketched method is much slower (i.e., a large multiplier δ is needed to closely approximate the RMSE of the unsketched method).
To show this more clearly, we report the standardized change in RMSE for sketching: RelativeImpact δ , defined above. This can be interpreted the proportional increase in RMSE that comes from using sketching with multiplier δ versus the unsketched method. Note: This figure shows the relative impact of sketching (RelativeImpact δ ) for each sketching multiplier, ρ and model specification (described in the main text). 95% confidence intervals are reported using a percentile bootstrap over the 150 datasets using 1,000 bootstrap samples. Interestingly, scaled covariates (mean-zero; variance one) seems to improve more quickly than Mahalanobis relative to the unsketched baseline, although note that the absolute performance of scaled vs. Mahalanobis standardization depends on ρ (see Figure A.12. Overall, while more work is needed to understand the impacts of sketching with these complex kernels, we note that a likely cause is the fact that the sketched kernel likely contains zero or a few observations for each group and thus the estimation may be unreliable. This intuition is discussed further in Yang, Pilanci and Wainwright (2017) (Example 4) about issues with sub-sampling sketching when there is considerable variation in the data. Future research might explore better methods for sketching to improve performance even with kernels that include fixed effects (e.g., those discussed in Lee and Ng 2020).
For example, stratified sampling based on the fixed effect may be helpful in improving performance.

D Additional Results: Newman (2016)
This appendix provides additional results for the analysis in Section 5. First, we provide information on the questions analyzed in Newman (2016). For the main analysis, the respondent is coded as "1", rejecting meritocracy, if they agreed with the statement presented to them or with both statements if both were presented-shown below. Otherwise, they are coded "0" (Newman 2016(Newman , p. 1013).
• "Success in life is pretty much determined by forces outside our control" • "Hard work and determination are no guarantee of success for most people" We consider the following two quantities. First, the average predicted probabilityp(e) as a function of earning inequality e where x i indicates all other covariates in the model.
For a secondary analyses as to the lack of a statistically detectable non-linear effect of economic inequality, we considered the following tests. Define e * min as the smallest value of earnings inequality considered (0.349), define e * inf as the model-specific inflection pointor the closest point in the grid of values considered, and define e * max as the largest value considered (1.119).
The first test examines the difference in average marginal effect at the end points, i.e. AME(e * max ) − AME(e * min ). For the original model in Newman (2016) This tests the idea that for a concave function (as posited by Newman 2016, p. 1011), the second derivative should be negative. Figure A.14 shows that, for the model used in Newman (2016), the average second derivative is usually negative and its confidence intervals do not contain zero except in the most extreme regions. For gKRLS, however, the confidence intervals contain zero at all points considered.

D.1 Additional Questions
The final test uses a secondary analysis in Newman (2016) where, on a different survey, the following five questions were analyzed. The below text is quoted directly from the supporting information in Newman (2016). Each question is colored in red with our annotation. For the two ordinal questions, Newman (2016) focuses on the probability of "major reason".
Barriers to Women's Professional Advancement

Traits of Men and Women
"Now I would like to ask about some specific characteristics of men and women.
For each one I read, please tell me whether you think it is generally more true of men or more true of women": (1) Intelligent [Q11A], and (2) Arrogant [Q11G]. For "Intelligent," constructed variable is dichotomous, and coded "1" for respondents who believed the trait is "More true of women" and "0" otherwise. For "Arrogant," constructed variable is dichotomous, and coded "1" for respondents who believed the trait is "More true of men," and "0" otherwise.

Gender Equality in Nation
"Which of these two statements comes closer to your own views-even if neither is exactly right": "This country has made most of the changes needed to give women equal rights with men" OR "The country needs to continue making changes to give women equal rights with men." [Q13]. Constructed variable coded "1" if respondent selected latter statement and "0" otherwise.  Newman gKRLS E Additional Results: Gulzar et al. (2020) This appendix provides additional results for the analysis in Section 6. We examine the sensitivity of the estimates to (i) the size of the sketching multiplier (5 or 15) and (ii) the use of bam or gam (see Appendix A.2 for a discussion). For each of the twelve outcomes, we ran the model fifty times. Note that for the double/debiased machine learning method (DML-PLR; DML-ATE), there is an additional source of randomnessthe five folds that the data is split into-that is also accounted for in the uncertainty.  In terms of computational cost, Table A.3 reports the average estimation time for each method, averaged across all repeated estimations and outcome variables. It shows, as expected, that increasing the sketching multiplier can increase the cost considerablyespecially for double/debiased machine learning methods or a kernel with many covariates ("gKRLS (All)"). bam provides considerable increases in speed especially in the case of the DML methods. Note: This table reports the average estimation time in minutes on a computer with 8GB of RAM and 1 core, averaged across the simulations and outcome variables. "DML" reports the time of two methods (DML-PLR; DML-ATE) discussed in the main text. "gKRLS" reports the two kernel methods ("gKRLS (Geog.); "gKRLS (All)") discussed in the main text. "Method" indicates the method used with bam or gam and the sketching multiplier ("-5"; "-15").  Note: This figure reports the estimated effects and 95% confidence intervals from fifty repetitions of each model for each outcome and group. The horizontal axis is truncated as occasionally the confidence interval from a method is very large. The top panel shows the results of gKRLS that includes all covariates ("gKRLS (All)") and the bottom includes results from a method that only includes the geographic coordinates ("gKRLS (Geog.)" .

E.2 Additional Results: Heterogeneous Effects
We conduct an additional application of gKRLS on the Gulzar, Haas and Pasquale (2020) to estimate heterogeneous effects. There is a large and active literature on how to use arbitrary machine learning algorithms to estimate heterogeneous effects (e.g., Künzel et al. 2019;Nie and Wager 2021). We focus on a recent method ("R-learner") by Nie and Wager (2021) as representing a current state-of-the-art method.
The R-learner is a type of meta-learner (see Künzel et al. 2019 for a general review) that allows the researcher to use an arbitrary machine learning algorithm to estimate heterogeneous treatment effects. The core innovation of the R-learner is to estimate a heterogeneous treatment effect function τ * (.) (as a function of a vector of pre-treatment covariates X i -noting their notation differs from that in the rest of our paper) by minimizing the following empirical analogueτ (.) on a dataset with N observations where m * (x) and e * (x) represent the conditional mean outcome-E[Y i |X i = x]-and treatment propensity-P (W i = 1|X i = x), respectively. Λ n {τ (.)} regularizes the estimated τ (.) function.
One of the meanings of "R" in R-learner gestures at the fact that this function depends on residualizing the observed outcome Y i and the treatment W i from their expected values (m * (X i ) and e * (X i ), respectively). This ensures a more robust objective function when trying to estimate the heterogeneous treatment effects (Nie and Wager 2021). The key difficulty of estimatingτ (.) is that the two key functions m * (.) and e * (.) are unknown sõ τ (.) cannot be estimated directly. A second key innovation of Nie and Wager (2021, p. 301) is to first estimate m * and e * and use these "pilot estimates" to create a feasible version for estimatingτ (.).
They propose obtaining estimatesm andê using a procedure known as "cross-fitting." This process is similar in implementation to ensemble methods (stacking, SuperLearning) and starts by separating the data into K-folds. Using K-1 folds of the data, one estimates the conditional mean outcome (i.e. predict Y i with covariates X i ) and the propensity score (i.e. predicting treatment W i with covariates), and then generate predictions on the held out fold. By cycling through all of the folds, one gets an out-of-sample prediction for each observation. These are combined as shown below to estimate the heterogeneous treatment effect functionτ (.). Formally, the R-learner algorithm is sketched below (Nie and Wager, 2021, p. 301).
1. Split the data into K folds. Fitm andê using cross-fitting with some machine learning algorithm, i.e. hold out one fold and estimate the model using the other K − 1.
2. Estimate τ (.) using the out-of-sample predictions for each observation. Definem −k(i) (X i ) as the estimate of m * (x) that does not include the fold k of which i is a member; similarly defineê −k(i) (X i ). The objective is shown below where some machine learning algorithm is used to estimateτ (.). 11 While any machine learning algorithm can be used for the R-learner, there is an especially interesting reason to use gKRLS. Nie and Wager (2021) prove that if one uses traditional KRLS in estimatingτ (.) (and some weaker assumptions on the quality of the estimates ofm andê), then the resulting estimatorτ (.) has a "quasi-oracle" property.
Roughly speaking, this means that the accuracy on estimatingτ (.) is asymptotically equivalent to the accuracy one would obtain if the researcher knew m * (X i ) and e * (X i ) exactly.
Thus, there is no loss of information from having to use the estimated analogues. This provides a formal justification for using the "pilot estimates"m −k(i) (X i ) andê −k(i) (X i ) in the final estimation ofτ (.).
In their numerical experiments, Nie and Wager (2021)  problems of 500 or 1000 observations. Thus, gKRLS allows for the theoretical promise of the R-learner when combined with KRLS to be scaled to much larger datasets.
We apply this to the Gulzar, Haas and Pasquale (2020) application, using gKRLS for all machine learning procedures where we include the covariates linearly as well as kernel that includes all of the covariates. The formula in pseudo-code (see Appendix F) is shown below where "x1 + x2 + ..." indicates the covariates. 12 y~x1 + x2 + ... + s(x1, x2, ..., bs = "gKRLS") The procedure takes under three minutes on a machine with 8 GB of RAM and a single 12 After our first explorations of the Gulzar, Haas and Pasquale (2020) data, we noted that two controls (share of scheduled tribes in 1991 and 2001) are extremely highly correlated with the treatment indicator in Himachal Pradesh (0.87 and 0.94, respectively; standard error of 0.039 and 0.027, respectively) but in no other state (correlations ranging from -0.05 and 0.25; standard errors ranging from 0.010 to 0.031). No other control variable's correlation with treatment within a state with treatment is higher than 0.47 (standard error of 0.070). Following supplemental analyses in the original paper, we find that this imbalance on "share of scheduled tribes in 2001" in Himachal Pradesh between treated and untreated units does not decrease as the geographic bandwidth decreases, although it does decline in all other states. The imbalance on "share of scheduled tribes in 1991" does decrease although more slowly. Exploring this and the implications for Gulzar, Haas and Pasquale (2020) in more detail is outside of the scope of this project. However, we exclude these two covariates for the heterogeneous effect analysis as their inclusion led to unstable estimates for Himachal Pradesh given minor changes in specification. Their exclusion has a limited effect on the results in the main text (e.g., Figure 5). core. Our initial examinations of the estimated heterogeneous effects suggested a key role of geography. Figure A.19 shows that, in general, one state-Himachal Pradesh-exhibits quite different patterns of estimated treatment effects. It shows the distribution of effects for each observation in each state across the four groups and three outcomes. In total, their analysis includes nine states. We also include a symbol (■) to indicate the estimated treatment effect, without controls, for each variable inside of each state. It is reassuring that the treatment effects obtained from the R-learner are similar to those calculated by simply calculating the difference-in-means within each state.  (2020) Note: This figure shows the distribution of heterogeneous treatment effects by state, outcome, and group using the R-learner approach described in the main text. The ■ indicates the treatment effect estimated within each state using a difference-in-means estimator.
In substantive terms, Himachal Pradesh has much larger effects for the targeted minority group (Scheduled Tribes). We also find slightly negative effects for the non-targeted minority group (Scheduled Castes) and large negative effects for the other individuals (non-minorities). While fully exploring the reason for the differential effects in Himachal Pradesh is outside of the scope of this paper, we note that it is distinct amongst the states considered in that it (i) has the smallest population, (ii) has the largest share of the nontargeted minority group (Scheduled Castes; 25% of the population in 2001), and (ii) has the smallest share of the targeted group (4% of the population in 2001). This suggests a possibility of an interesting qualification of the effect of an electoral quota scheme to benefit a small minority group when there is another larger, although still historically disadvantaged group, who is not affected by the scheme. We found similar distributions of estimates by state if we used different multipliers, e.g., 5 vs. 15, and gam instead of bam.

F Software
This appendix briefly discusses software packages for estimating KRLS and provides a brief demonstration on how to use the core function in the gKRLS package.

F.1 Existing Software Packages
In the main paper, we consider KRLS and bigKRLS as the two well-known packages for estimating KRLS to political scientists. They are only able to assume a Gaussian outcome and must include all predictors in the kernel. An additional package KSPM (Schramm et al. 2020) provides additional flexibility by allowing multiple kernels and additive terms. It is still considerably less flexible than mgcv (e.g., a lack of other methods of penalization, the use of non-Gaussian outcomes, sketching, selection of penalty parameter using something other than LOO-CV, etc.). It does have some novel features (e.g., the use of alternative kernels besides the Gaussian one; interactions between kernels) that are straightforward to include in future development of gKRLS. Unfortunately, preliminary experiments also suggested that it was considerably slower than either KRLS or mgcv and thus it was not explored further in our empirical analyses.

F.2 Software Demonstration
We demonstrate how to estimate gKRLS, average marginal effects, and double/debiased machine learning.

data = data )
We can also integrate gKRLS with double/debiased maching learning method.