This last chapter summarizes most of the material in this book in a range of concluding statements. It provides a summary of the lessons learned. These can be viewed as guidelines for research practice.
We start with the recommendation to try to replicate earlier studies. Contact the authors, ask for the data, or retrieve the data from a publicly available database, use the same methods of analysis, and use the same estimation methods. You may also want to collect experimental data again and use the same methods. This is now often done in some disciplines such as social psychology and economics.Footnote 1 You, as a researcher, should also allow people to replicate your studies.
Detected scientific misconduct sometimes shows that people “overdo it” when they misbehave. Knowledge of statistical methods and techniques is useful when detecting scientific misconduct.
When Certain Datapoints Are Influential
Anyone can make mistakes. And mistakes must be admitted and resolved. We have learned that influential observations can happen, do happen, and can happen quite frequently. Influential observations are not always visually obvious. For a two-variable model, it can be visualized, but when the number of regressors increases, graphs will not help much. Diagnostic tests are needed, and these are easy to compute and easy to interpret.
It is best to perform analysis on influential observations based on model estimation results. That implies that you do not remove data prior to model building.
Once you have found such influential observations, make explicit how you dealt with them. You can replace the observations by others, new-to-compute, values, in your analysis (not in the raw database) or you can delete them.
When There Are More Models to Choose From
Trying to end up with one single final model has all kinds of drawbacks. You need various diagnostic tests, and sometimes these tests do not have much power (such as tests for unit roots). Moreover, sequential strategies to arrive at a single final model can lead to quite different final models. Data mining has its downsides. Multiple testing has consequences for the reliability of the last round t values and p values.
Hence, it seems better to maintain various models, for inference and for forecasting. We could, for example, estimate the average loss in airline revenues owing to 9/11 based on forecasts from three distinct models. Forecasts can also be based on the combination of forecasts.
On Data Collection
Measurement errors happen, and like other features of the data such as simultaneity, such errors can require alternative estimation methods, alternative to simple OLS. This does not have to be dramatic, but if data collection makes modeling more complicated, one may want to spend more effort in collecting better data. A plea for better research designs is well made,Footnote 2 and exactly laying out how the data were collected is important.
When Data Are Missing
Data can be missing, and there are several reasons for that. We should be aware that interpolation has consequences and that we should record how interpolation took place.
We should report on how missing data were managed either by using an adapted estimation method or by constructing another model, after aggregation.
Note also that missing data can occur on purpose, to facilitate data collection and data analysis. There is no need to collect a few
observations and contrast these with millions of
observations. Furthermore, models for temporally aggregated data can still be informative about parameters in a higher frequency process. The consequence of aggregation is that models and estimation methods can (and must) be adapted.
When Spurious Results May Appear
Spurious relations can easily appear. Better models and, for example, properly taking care of autocorrelation can prevent spurious results. We have seen that cohort studies are particularly prone to spurious findings (when autocorrelation is neglected).
For time series data it holds that low Durbin Watson values and exceptionally large t values are signs of spuriousness.
Anyway, with increasing numbers of regressors spurious results can also appear (owing to data mining). This is the downside of including lots of regressors on the right-hand side of your model to explain or predict a dependent variable.
When Data Tell Different Stories
We have seen for key macroeconomic variables such as inflation and unemployment that outliers, different regimes, and ARCH can capture similar data features. This implies that we must be specific what the target of our research is. Do we want to describe different regimes? Do we want to make outliers less influential in our forecasting model? Do we want to allow for time-varying prediction intervals? Or do we want to propose a model that most accurately describes the data?
We have learned that ignoring level shifts suggests the presence of unit roots. On the other hand, imposing various unit roots makes forecasts robust to changes in trends. And we have learned that ignoring additive outliers points us in the wrong direction in subsequent modeling.
Honesty About Forecasts
A key message when it comes to out-of-sample forecasting is that we should report prediction intervals. This allows us and our advisees to understand the limits to predictability. It is best to report forecasts from models with and without imposing unit roots. And, again, it is commendable to work with combined forecasts.
We could show forecasts for a range of samples where each time the sample starts at another moment in time. We should better present forecasts for models with trends and breaks, as we have seen that longer-term point forecasts can be quite different.
In sum, we should be honest about the limits to predictability.
When There Are More Forecasts
Combined forecasts are useful and often successful. It is not the way in which experts agree in their judgment, but it is the way in which they do not agree that can make the average forecast work well.
When There Are (Too) Many Data
For Big Data it holds even more that we should document all steps taken and all choices made. We should not take available software or the outcomes of machine learning methods for granted. With available data, we can always run simulations to check how these tools work. To examine what certain methods do, prior to incorporating these in analysis, we can create artificial data and then try to recreate the data generating process.
We should not blindly adopt default options in programs, and should rather try variants. If computations are based on seed starting values, then we should run the computations many times and take an average of the outcomes. Without the seed, we cannot definitively replicate outcomes. If we run 100 trials and take averages, someone else’s results can come closer to our results.
When It Becomes (Too) Much for Humans
We have learned that algorithms can create misleading information, and that they can lead to destructive outcomes. We should not forget, though, that humans program those algorithms, and that they feed information to algorithms. So yes, algorithms discriminate, for example by allocating people into clusters, but the variables that are used to cluster are delivered by humans.
Finally
When we give advice as an econometrician, we must report all choices that we have made (if at all possible). And these choices concern the data, the sample, the method, and more.
In fact, we should quite literally report everything. And this constitutes ethics in econometrics.
Epilogue: This Is What You Do Not Want!
This article retracts the following:
Adriano A. Rampini, S. Viswanathan, Guillaume Vuillemey, “Risk Management in Financial Institutions”
First published: December 12, 2019
The authors hereby retract the above article, published in print in the April 2020 issue of The Journal of Finance. A replication study finds that the replication code provided in the supplementary information section of the article does not reproduce some of the central findings reported in the article. Upon reexamination of the work, the authors confirmed that the replication code does not fully reproduce the published results and were unable to provide revised code that does. Therefore, the authors conclude that the published results are not reliable and that the responsible course of action is to retract the article and return the Brattle Group Distinguished Paper Prize that the article received. The authors deeply regret the damage this caused to the journal and the scholarly community. The specific contributions of the authors to the article were as follows: the first and second author provided the theoretical hypothesis; all three authors jointly designed the empirical approach and identification strategy; the third author constructed and handled the data, implemented the empirical analysis, and provided the empirical results as well as the replication data and code. The third author states that the original data and code that produced the published results were lost. The first and second author were not notified of the loss of the original data and code at the time it occurred and had no prior knowledge of the issues with the replication data and code provided to the journal.Footnote 3