1 Introduction
Spatial regression models are popular in the social sciences for modeling spatial dependencies within data. At the core of these models is the spatial weights matrix $\boldsymbol{W}$, an $n\times n$ matrix with zeroes along the diagonal that, in the cross-sectional context, contains the set of spatial relationships in the data in its off-diagonal elements: each element $w_{ij}$ represents the relative influence unit $j$ has on unit $i$. More extreme values of $w_{ij}$ indicate a stronger influence of unit $j$ on unit $i$. Likewise, unit $j$ has no influence on unit $i$ when $w_{ij}$ equals zero.
There are many approaches to specifying $\boldsymbol{W}$, both theoretical and empirical (Bhattacharjee and Jensen-Butler Reference Bhattacharjee and Jensen-Butler2013; Neumayer and Plumper Reference Neumayer and Plumper2016). There are also a number of challenges to specifying $\boldsymbol{W}$, including using the correct conception of space, correct means of expressing the same conception of space, and measuring $w_{ij}$ without error. Potential misspecification creates a problem for scholars, as the estimates from spatial regression models are conditional upon $\boldsymbol{W}$. A misspecification of $\boldsymbol{W}$, even by a constant, will cause inconsistent estimates (Neumayer and Plumper Reference Neumayer and Plumper2016). This is not the case in other regression contexts, as inflation of a variable by a constant will be absorbed into the intercept. Scholars largely agree that the misspecification of $\boldsymbol{W}$ creates problems for inference, though this view is not universal (Corrado and Fingleton Reference Corrado and Fingleton2012; Gibbons and Overman Reference Gibbons and Overman2012, but see LeSage and Pace Reference LeSage and Pace2014).
Despite the risks of misspecifying $\boldsymbol{W}$, few techniques exist for evaluating a particular specification. Some developed techniques adjudicate between competing $\boldsymbol{W}$ matrices (Leenders Reference Leenders2002; Stakhovych and Bijmolt Reference Stakhovych and Bijmolt2009). But these approaches are limited in that they can only help judge the best $\boldsymbol{W}$ among a set of choices, leading to the possibility of picking the best out of a set of badly misspecified matrices (Harris, Moffat and Kravtsova Reference Harris, Moffat and Kravtsova2011). Neumayer and Plumper offer a test to determine if the nondiagonal elements of $\boldsymbol{W}$ that are set to zero are valid (Reference Neumayer and Plumper2016, p. 123). The test, however, is only theoretically described and gives no indication about the potential misspecification of nonzero elements of $\boldsymbol{W}$.
This letter highlights the problematic bias resulting from the misspecification of $\boldsymbol{W}$ and provides a test and control for some misspecifications. Specifically, each $w_{ij},i\neq j,$ within $\boldsymbol{W}$ is thought to be measured with some amount of error. From this approach, I develop a theoretically appealing regression-based test of the specification of $\boldsymbol{W}$ that is tractable when using panel data. I call this the K test, which is simply another spatial lag where all nondiagonal elements of $\boldsymbol{W}$ equal one. The test is able to detect when $\boldsymbol{W}$ is inflated by a constant in expectation. Inclusion of the K test in the regression model controls for part of the endogeneity resulting from this type of misspecification, at least in SLX models, much like how including an intercept in a regression model controls for the inflation of a variable by a constant.
2 The Specification of W as a Measurement Error Problem
To begin, I examine classical measurement error in a simple regression context. Consider the following data-generating process:
in which $\boldsymbol{\unicode[STIX]{x1D700}}$ is a well-behaved error term and $\boldsymbol{\unicode[STIX]{x1D704}}$ is a vector of ones associated with the intercept parameter $\unicode[STIX]{x1D6FC}$. Suppose a scholar would like to model the relationship between $\boldsymbol{x}^{\ast }$ and $\boldsymbol{y}$ but does not have access to $\boldsymbol{x}^{\ast }$. Instead, the scholar has access to $\boldsymbol{x}$, a function of $\boldsymbol{x}^{\ast }$:
in which $\boldsymbol{u}^{\ast }$ is an error variable with mean c that is uncorrelated to $\boldsymbol{x}^{\ast }$. It will be useful for our purposes to decompose $\boldsymbol{u}^{\ast }$ into mean-zero measurement error and a constant:
in which $\boldsymbol{u}$ has a mean of zero but is otherwise similarly distributed to $\boldsymbol{u}^{\ast }$ and $\boldsymbol{c}$ is a vector of the constant c. By substituting equation (3) into equation (1), we observe the problems associated with measurement error in the independent variable:
In equation (6), there are two sources of endogeneity in the model. The first results because $\boldsymbol{u}\unicode[STIX]{x1D6FD}$ is contained within the error term. The terms $\boldsymbol{u}$ and $\boldsymbol{x}$ are necessarily correlated given the data-generating process in equation (3). In the bivariate regression case, the estimate of $\unicode[STIX]{x1D6FD}$ will be biased toward zero; this bias is commonly known as attenuation bias. In the multiple-regression context, bias will permeate all independent variables in unknown directions in addition to the attenuation bias in the mismeasured variable. Scholars have developed a number of approaches to circumventing attenuation bias (see Buonaccorsi Reference Buonaccorsi2010). In general, however, the onerous requirements for these solutions means that they are rarely implemented in the applied literature (Greene Reference Greene2017).
The second source of endogeneity comes from $\boldsymbol{c}\unicode[STIX]{x1D6FD}$, which results in the expectation of the error term being nonzero. This endogeneity is transferred from the error term to the intercept parameter $\unicode[STIX]{x1D6FC}$; as a consequence, only $\unicode[STIX]{x1D6FC}$ is biased. This is rarely problematic for scholars, as the quantity of interest is usually not $\unicode[STIX]{x1D6FC}$ but rather $\unicode[STIX]{x1D6FD}$. Thus, while inferences from regression models are not immune to measurement error, part of the endogeneity from nonzero measurement error is solved by including an intercept.
2.1 The Problem of Measurement Error Within W
As we have just seen, part of the problem associated with nonzero measurement error is remedied by including an intercept in the regression model. Unfortunately, this will not alleviate similar problems when misspecifying $\boldsymbol{W}$ in a spatial regression model. Consider the spatial lag of an independent variable (SLX) model:
in which $\unicode[STIX]{x1D700}$ is a well-behaved error term. Suppose a scholar would like to model the relationship between $\boldsymbol{x}^{\ast }$ and $\boldsymbol{y}$ but does not have access to $\boldsymbol{W}\ast$. Instead, the scholar has access to $\boldsymbol{W}$, a representation of $\boldsymbol{W}^{\ast }$ measured with error:
in which $\boldsymbol{U}$ is a weights matrix where all nondiagonal elements are drawn from variable $\boldsymbol{u}$, which has a mean of zero, and $\boldsymbol{C}$ is a weights matrix where all nondiagonal elements are equal to a constant c. Note how equation (8) mirrors equation (3). By substituting equation (8) into equation (7), we observe unique problems created by measurement error within $\boldsymbol{W}$:
As before, there are two sources of endogeneity. The first is from the mean-zero measurement error. Because of the correlation between the elements of $\boldsymbol{W}$ and $\boldsymbol{U}$, $\boldsymbol{W}\boldsymbol{x}^{\ast }$ and $\boldsymbol{U}\boldsymbol{x}^{\ast }$ will also be correlated. As mentioned before, this attenuation bias is unlikely to be solved due to the problems previously noted in solving mean-zero measurement error in simpler regression contexts. The second source of endogeneity is derived from the matrix $\boldsymbol{C}$ and its nondiagonal elements c. Unlike in the simple regression model, $\boldsymbol{C}$ is postmultiplied by $\boldsymbol{x}^{\ast }$ to create a new variable. Any change to the expectation of the error term resulting from $\boldsymbol{C}\boldsymbol{x}\ast$ is dependent upon the expectation of $\boldsymbol{x}^{\ast }$. As in the previous context, the expectation of $-\boldsymbol{C}\boldsymbol{x}^{\ast }\unicode[STIX]{x1D703}$ will be absorbed into the constant.
But the intercept will not absorb the entirety of $-\boldsymbol{C}\boldsymbol{x}^{\ast }\unicode[STIX]{x1D703}$, however, because of the variance associated with $\boldsymbol{x}^{\ast }$. Further, this variation will necessarily be correlated with $\boldsymbol{x}^{\ast }$ since $-\boldsymbol{C}\boldsymbol{x}^{\ast }\unicode[STIX]{x1D703}$ is a linear transformation of the total of $\boldsymbol{x}\ast$ within a cross-section minus the influence of $x_{i}^{\ast }$. Thus, there will be additional bias in $\unicode[STIX]{x1D6FD}$ and the coefficients of all other variables correlated with $\boldsymbol{x}^{\ast }$ that is unique to the spatial regression context.
It is important here to note that measurement error in $\boldsymbol{W}$ is not unique to the SLX context. Spatial autoregressive and spatial error models (SAR and SEM, respectively) also use $\boldsymbol{W}$ to account for spatial dependence. If $\boldsymbol{W}$ is measured with error, it is highly likely that estimates from these models will also exhibit bias in some way. The additional econometric difficulties associated with these models, however, prevents an analytic demonstration of this bias at the present.
2.2 A Test and Control for the Misspecification of $\mathbf{W}$
The misspecification of $\boldsymbol{W}$ introduces additional endogeneity that is not a problem in simpler regression contexts and will cause inconsistent estimates. This additional endogeneity, however, is also an opportunity to develop a specification test of $\boldsymbol{W}$ by attempting to model that endogeneity. For the SLX data-generating process in equation (7), the model is equivalent to:
in which $\boldsymbol{K}$ is a $\boldsymbol{W}$ matrix where every nondiagonal element equals one. Because the new term $\boldsymbol{K}\boldsymbol{x}^{\ast }$ has a coefficient of zero, it does not influence $\boldsymbol{y}$ and is thus equivalent to the original model. But if a scholar only had $\boldsymbol{W}$ instead of $\boldsymbol{W}^{\ast }$ and modeled this relationship, one would get the following result:
A substantially changed result appears in equation (16). As before, there will still be bias resulting from mean-zero measurement error in $\boldsymbol{W}$. Now, however, the additional bias resulting from $-\boldsymbol{C}\boldsymbol{x}^{\ast }\unicode[STIX]{x1D703}$ is transferred to the coefficient of $\boldsymbol{K}\boldsymbol{x}^{\ast }$ rather than staying in the error term and affecting inferences in the rest of the model. Thus, scholars concerned about the misspecification of $\boldsymbol{W}$ should estimate a full model:
$\boldsymbol{K}\boldsymbol{x}^{\ast }$, hereafter referred to as $\boldsymbol{K}\boldsymbol{x}$ or the K test, can serve as a specification test of $\boldsymbol{W}$ within the regression model. If one assumes that $\boldsymbol{K}\boldsymbol{x}$—or its counterparts $\boldsymbol{K}\boldsymbol{y}$ and $\boldsymbol{K}\boldsymbol{\unicode[STIX]{x1D700}}$ in other models—is not a part of the data-generating process, then the K test should be zero in expectation and distinguishable from zero only by chance.Footnote 1 If the elements of $\boldsymbol{W}$ have been misspecified by a constant in expectation, then the K test should be nonzero and distinguishable from zero at rates greater than expected by chance. Thus, a K test that is statistically distinguishable from zero is evidence that $\boldsymbol{W}$ is misspecified and that the scholar has failed to adequately model spatial relationships within the data.Footnote 2
Beyond functioning as a specification test, the K test also controls for a part of the endogeneity resulting from the misspecification of $\boldsymbol{W}$. By absorbing the effect of $-\boldsymbol{C}\boldsymbol{x}^{\ast }\unicode[STIX]{x1D703}$, the only endogeneity remaining from the misspecification of $\boldsymbol{W}$ is $-\boldsymbol{U}\boldsymbol{x}^{\ast }\unicode[STIX]{x1D703}$. This is similar to how the intercept controls for part of the endogeneity resulting from measurement error in an additive variable. While the leftover endogeneity will result in bias in the rest of the regression model, it now makes spatial regression models comparable to other regression models in that nonzero measurement error is modeled explicitly in the regression.Footnote 3
While the K test can be used outside of the SLX case, it is unclear how the K test will perform in the SAR and SEM cases. It is highly possible that the properties of the K test are different given their nonrecursive nature. I do not have an analytic proof on the point, and the Monte Carlo results presented later provide evidence that the additional complexities associated with these models need to be taken into account. Future research should further examine the K test’s properties in these settings.
2.3 Limits and Opportunities in Cross-Sectional and Panel Data Settings
While the K test is theoretically appealing, it suffers from crippling limitations in cross-sectional settings. In a SLX model of a single cross-section, the sum of $\boldsymbol{K}\boldsymbol{x}$ and $\boldsymbol{x}$ results in a constant that will be perfectly multicollinear with the intercept. The researcher is left with an untenable choice between forgoing the test and omitting a variable and causing bias. The problem is not limited to the SLX case; the use of the K test alongside a constant in the SAR case results in inconsistent estimates (Kelejian and Prucha Reference Kelejian and Prucha2002).Footnote 4
This limitation is mitigated when using panel data (Franzese and Hays Reference Franzese and Hays2007). If $\boldsymbol{x}$ varies over both time and space and $\boldsymbol{W}\boldsymbol{x}$ is a purely spatial lagFootnote 5, then the sum of $\boldsymbol{x}$ and $\boldsymbol{K}\boldsymbol{x}$ results in a time-invariant variable. This allows for the estimation of $\boldsymbol{x}$, $\boldsymbol{K}\boldsymbol{x}$, and the constant in the same regression model. The same is true in the SAR setting (Kelejian, Prucha and Yuzefovich Reference Kelejian, Prucha and Yuzefovich2006).
But the use of the test in these instances also has unique considerations associated with it. When a spatial lag is taken of $\boldsymbol{x}$, then the sum of $\boldsymbol{x}$ and $\boldsymbol{K}\boldsymbol{x}$ results in a new variable, $\boldsymbol{x}_{total}$, that only varies over time. This makes the test potentially sensitive to temporal unit effects, both that are uncorrelated and correlated with $\boldsymbol{x}_{total}$. I offer suggestions for each case in turn. If there are temporal unit effects that are uncorrelated with $\boldsymbol{x}_{total}$ but are not accounted for in the model, the K test may report false positives. Fortunately, it is relatively easy to model these effects using random effects.
More problematic is the possibility of temporal unit effects correlated with $\boldsymbol{x}_{total}$. A scholar could normally combat this problem using fixed effects. The lack of spatial variation in $\boldsymbol{x}_{total}$, however, prevents the inclusion of temporal fixed effects in SLX models. SAR models have similar problems: estimates are inconsistent with the inclusion of fixed effects and the K test (Kelejian, Prucha and Yuzefovich Reference Kelejian, Prucha and Yuzefovich2006). As present, there is no readily available solution to employ both techniques. Scholars must assume an otherwise well-specified regression model, at least on the temporal dimension, in order to make inferences about whether the spatial relationships specified in the data are valid. Alternatively, temporal fixed effects should control from the endogeneity resulting from $\boldsymbol{W}$ being inflated by a constant (though it would be impossible to infer whether $\boldsymbol{W}$ is misspecified).
3 Simulation
To demonstrate the K test’s utility, I employ two sets of Monte Carlo simulations: one featuring cross-sectional data and the other panel data.Footnote 6 One thousand data points are generated that either follow a SAR data-generating process or a SLX data-generating process:
where $\unicode[STIX]{x1D6FD}=5$ and $\unicode[STIX]{x1D70C}=\unicode[STIX]{x1D703}=0.75$. $\boldsymbol{\unicode[STIX]{x1D700}}$ is drawn from a standard normal distribution, while $\boldsymbol{x}$ is generated from a uniform distribution between $-1$ and 1. $\boldsymbol{W}$ is randomly generated. As per the above discussion, the intercept is omitted in cross-sectional data. In panel data, an intercept $\boldsymbol{\unicode[STIX]{x1D704}}\unicode[STIX]{x1D6FC}$ is included with $\unicode[STIX]{x1D6FC}=10$.Footnote 7
A spatial regression model is then estimated on the data that also employs my test of the specification of $\boldsymbol{W}$, which includes either $\boldsymbol{K}\boldsymbol{y}$ or $\boldsymbol{K}\boldsymbol{x}$ depending on the regression model. OLS is used to estimate the SLX models, while 2SLS is used to estimate the SAR models using $\boldsymbol{W}\boldsymbol{x}$ and $\boldsymbol{K}\boldsymbol{x}$ as instruments. This process is repeated one thousand times. In the analysis, I see how the test fairs when a model is estimated using the correct $\boldsymbol{W}$ matrix inflated by a constant. When the matrix is inflated by zero, this reduces to a model with a correctly specified $\boldsymbol{W}$. I subsequently report the average bias for all coefficient estimates and their coverage probabilities using 95% confidence intervals. These quantities are reported for two reasons. First, these quantities demonstrate both the size and power of the K test’s ability to detect misspecifications of $\boldsymbol{W}$. Second, these quantities show the K test’s ability to control for the bias caused by the misspecification of $\boldsymbol{W}$ when estimating a model. If the K test is a valid way for controlling for the misspecification of $\boldsymbol{W}$, then the bias of all other coefficients should be zero and their coverage probabilities should be 95%. This should be true regardless of whether $\boldsymbol{W}$ is correctly or incorrectly specified.
Tables 1 and 2 present the results for misspecifying a $\boldsymbol{W}$ matrix by a constant for SLX and SAR processes, respectively. As we can see in Table 1, the K test works as a specification test in the SLX context. When $\boldsymbol{W}$ is correctly specified, the test’s coefficient is centered on zero and correctly sized (in that it is only statistically distinguishable from zero about five percent of the time). But when $\boldsymbol{W}$ is inflated by a constant, either positive or negative, the test is consistently biased in a singular direction and almost always statistically distinguishable from zero. The K test, then, has the expected size and power in the SLX context.
The $K$ test also works as a control for endogeneity in the SLX context. Across all model specifications, $\hat{\unicode[STIX]{x1D6FC}}$, $\hat{\unicode[STIX]{x1D6FD}}$, and $\hat{\unicode[STIX]{x1D703}}$ are centered on their true values and their confidence intervals regularly include them. In the instances in which $\boldsymbol{W}$ is misspecified, these estimates would normally be biased. The only reason they are not is due to the inclusion of the K test in the model, thereby demonstrating its utility as a control for this form of endogeneity.
In Table 2, the K test’s utility as a specification test in the SAR case is strong in applied settings. Using both cross-sectional and panel data, the K test has the same size in the SAR context as it does in the SLX context: it is centered around zero when $\boldsymbol{W}$ is correctly specified. The test has slightly lower power than its SLX counterpart when using panel data. When using it with cross-sectional data, however, the test has little power. When $\boldsymbol{W}$ is inflated by a positive constant, the test appears to have no power at all. While concerning on the surface, this limitation will have little relevance in applied work given that the K test cannot be consistently estimated in a single cross-section.
The K test’s utility as a control in the SAR case is not as strong as it is in the SLX case. In both the cross-sectional and panel case, $\hat{\unicode[STIX]{x1D6FD}}$ is not biased when the K test is estimated, even if $\boldsymbol{W}$ is misspecified. In contrast, though, $\hat{\unicode[STIX]{x1D70C}}$ is biased when $\boldsymbol{W}$ is misspecified even when the K test is included. This result is likely due to the nonrecursive nature of SAR models compared to SLX models. Interestingly, the coefficient of the K test is biased in the exact opposite direction when $\boldsymbol{W}$ is misspecified. Additional analysis in the replication materials reveal that a linear combination of $\hat{\unicode[STIX]{x1D70C}}$ and the K test’s coefficient is equal to $\unicode[STIX]{x1D70C}$ and has the correct coverage probability of 95%. This curious result does not have a readily available explanation. Regardless, additional consideration needs to be given to the SAR case given its complexity.
4 Discussion
This letter presents the uncertainty surrounding the specification of $\boldsymbol{W}$ as a measurement error problem. The approach clearly demonstrates the bias that results when $\boldsymbol{W}$ is misspecified in regression models and reveals a theoretically appealing test and control for the misspecification of $\boldsymbol{W}$ by a constant, which I call the K test. But while the K test has its uses, it is not a silver bullet for measurement error in $\boldsymbol{W}$. The test’s utility as a control in the SAR case is limited. The test also does not identify or control for mean-zero measurement error. Finally, this letter does not consider the possibility that measurement error in $\boldsymbol{W}$ is correlated with measurement error in other variables in the model. Future research should consider these possibilities.
Scholars must exercise care when using the K test to adjudicate between multiple $\boldsymbol{W}$ matrices. If the test is used multiple times, the results must be appropriately weighted in order to avoid spurious inferences that result in repeated hypothesis testing. Alternatively, this problem can be avoided by using the test in conjunction with other methods of adjudicating between multiple $\boldsymbol{W}$ matrices. This would result in a two-step procedure: identify the best choice of $\boldsymbol{W}$ among a set of alternatives and then test the validity of that choice.
Appendix
This letter focuses on the ability for the K test to control for measurement error resulting from the nondiagonal elements of $\boldsymbol{W}$ being inflated by a constant. It implicitly assumes that the K test will not be able to account for mean-zero measurement error. But as one reviewer noted, the K test, or $\boldsymbol{K}\boldsymbol{x}$ , may in fact be correlated with mean-zero measurement error in $\boldsymbol{W}$, $\boldsymbol{U}\boldsymbol{x}$. If true, then the K test would be able to detect and possibly control for mean-zero measurement error. This implications of this correlation are too important to ignore, and for that reason I using Monte Carlo analysis to investigate whether these two quantities of interest are correlated.
I randomly generate one thousand observations of a variable $\boldsymbol{x}$, $\boldsymbol{x}\sim U(-a+c,a+c)$. I use this variable to create two more variables, $\boldsymbol{K}\boldsymbol{x}$ and $\boldsymbol{U}\boldsymbol{x}$. I create $\boldsymbol{K}\boldsymbol{x}$ by premultiplying $\boldsymbol{x}$ by $\boldsymbol{K}$, a weights matrix where all the nondiagonal elements are equal to 1. I create $\boldsymbol{U}\boldsymbol{x}$ by premultiplying $\boldsymbol{x}$ by $\boldsymbol{U}$, a weights matrix where all the nondiagonal elements are randomly drawn from a variable $\boldsymbol{u}$, $\boldsymbol{u}\sim U(-b,b)$. Parameters a and b are identically distributed, $U(0,1,000)$, while parameter c is uniformly distributed with mean zero, $U(-1,000,1,000)$. After both $\boldsymbol{K}\boldsymbol{x}$ and $\boldsymbol{U}\boldsymbol{x}$ are generated, I calculate and record their correlation and its 95% confidence interval. I repeat this process one thousand times, with the underlying distributions of $\boldsymbol{x}$ and $\boldsymbol{u}$ varying as well as the set of observations drawn.
A density plot of these one thousand correlations is presented in Figure 1. The density appears to be normally distributed with a mean of zero and low variance. This is exactly what one would expect if $\boldsymbol{K}\boldsymbol{x}$ and $\boldsymbol{U}\boldsymbol{x}$ were uncorrelated. In addition, I investigated what proportion of these correlations were statistically distinguishable from zero. The proportion is 0.06, close to the proportion of statistically significant results one would expect when repeatedly calculating a 95% confidence interval on truly uncorrelated data.
There is a possibility that the elements of $\boldsymbol{U}$ could be correlated with measurement error in $\boldsymbol{x}$, should $\boldsymbol{x}$ be measured with error. In that case, $\boldsymbol{K}\boldsymbol{x}$ and $\boldsymbol{U}\boldsymbol{x}$ may be correlated. Similarly, if the elements of $\boldsymbol{U}$ can be predicted by $\boldsymbol{x}$, then there is a possibility of correlation. But if you assume that the error in $\boldsymbol{W}$ is orthogonal to everything else in the model, as I do throughout, then $\boldsymbol{K}\boldsymbol{x}$ and $\boldsymbol{U}\boldsymbol{x}$ will not be correlated.