To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter deals with features of data that suggest a certain model or method, but where this suggestion is erroneous. We highlight a few cases in which an econometrician could be directed in the wrong direction, and at the same time we show how this can be prevented from happening. These situations happen in cases where there is no strong prior information on how the model should be specified. The data are then used to guide model construction. This guidance can be in an inappropriate direction. We review a few empirical cases where some data features obscure a potentially proper view of the data and may suggest inappropriate models. We discuss spurious cycles and the impact of additive outliers on detecting ARCH and nonlinearity. We also focus on a time series that may exhibit recessions and expansions, allowing you to (wrongly) interpret the recession observations as outliers. Finally, we deal with structural breaks and trends and unit roots, and see how data with these features can look alike.
This last chapter summarizes most of the material in this book in a range of concluding statements. It provides a summary of the lessons learned. These lessons can be viewed as guidelines for research practice.
We first discuss a phenomenon called data mining. This can involve multiple tests on which variables or correlations are relevant. If used improperly, data mining may associate with scientific misconduct. Next, we discuss one way to arrive at a single final model, involving stepwise methods. We see that various stepwise methods lead to different final models. Next, we see that various configurations in test situations, here illustrated for testing for cointegration, lead to different outcomes. It may be possible to see which configurations make most sense and can be used for empirical analysis. However, we suggest that it is better to keep various models and somehow combine inferences. This is illustrated by an analysis of the losses in airline revenues in the United States owing to 9/11. We see that out of four different models, three estimate a similar loss, while the fourth model suggests only 10 percent of that figure. We argue that it is better to maintain various models, that is, models that stand various diagnostic tests, for inference and for forecasting, and to combine what can be learned from them.
In practice it may happen that a first-try econometric model is not appropriate because it violates one or more of the key assumptions that are needed to obtain valid results. In case there is something wrong with the variables, such as measurement error or strong collinearity, we may better modify the estimation method or change the model. In the present chapter we deal with endogeneity, which can, for example, be caused by measurement error, and which implies that one or more regressors are correlated with the unknown error term. This is of course not immediately visible because the errors are not known beforehand and are estimated jointly with the unknown parameters. Endogeneity can thus happen when a regressor is measured with error, and, as we see, when the data are aggregated at too low a frequency. Another issue is called multicollinearity, in which it is difficult to disentangle (the statistical significance of) the separate effects. This certainly holds for levels and squares of the same variable. Finally, we deal with the interpretation of model outcomes.
This chapter opens with some quotes and insights on megaprojects. We turn to the construction and the use of prediction intervals in a time series context. We see that depending on the choice of the number of unit roots (stochastic trends) or the sample size (when does the sample start?), we can compute a wide range of prediction intervals. Next, we see that those trends, and breaks in levels and breaks in trend, can yield a wide variety of forecasts. Again, we reiterate that maintaining a variety of models and outcomes is useful, and that an equal-weighted combination of results can be most appropriate. Indeed, any specific choice leads to a different outcome. Finally, we discuss for a simple first-order autoregression how you can see what the limits to predictability are. We see that these limits are closer than we may think at the onset.
This chapter uses a range of quotes and findings from the internet and the literature. The key premises of this chapter, which is illustrated with examples, are as follows. First, Big Data requires the use of algorithms. Second, algorithms can create misleading information. Third, algorithms can lead to destructive outcomes. But we should not forget that humans program algorithms. With Big Data come algorithms to run many and involved computations. We cannot oversee all these data ourselves, so we need the help of algorithms to make computations for us. We might label these algorithms as Artificial Intelligence, but this might suggest that they can do things on their own. They can run massive computations, but they need to be fed with data. And this feeding is usually done by us, by humans, and we also choose the algorithms to be used.
In practice we do not always have clear guidance from economic theory about specifying an econometric model. At one extreme, it may be said that we should “let the data speak.” It is good to know that when they “speak” that what they say makes sense. We must be aware of a particularly important phenomenon in empirical econometrics: the spurious relationship. If you encounter a spurious relationship but do not recognize it as such, you may inadequately consider such a relationship for hypothesis testing or for the creation of forecasts. A spurious relationship appears when the model is not well specified. In this chapter, we see from a case study that people can draw strong but inappropriate conclusions if the econometric model is not well specified. We see that if you a priori hypothesize a structural break at a particular moment in time, and based on that very assumption analyze the data, then it is easy to draw inaccurate conclusions. As with influential observations, the lesson here is that one should first create an econometric model, and, given that model, investigate whether there could have been a structural break.
This chapter deals with missing data and a few approaches to managing such. There are several reasons why data can be missing. For example, people can throw away older data, which can sometimes be sensible. It may also be the case that you want to analyze a phenomenon that occurs at an hourly level but only have data at the daily level; thus, the hourly data are missing. It may also be that a survey is simply too long, so people get tired and do not answer all questions. In this chapter we review various situations where data are missing and how we can recognize them. Sometimes we know how to manage the situation of missing data. Often there is no need to panic and modifications of models and/or estimation methods can be used. We encounter a case in which data can be made missing on purpose, by selective sampling, to subsequently facilitate empirical analysis. Such analysis explicitly takes account of the missingness, and the impact of missing data can become minor.
Currently we may have access to large databases, sometimes coined as Big Data, and for those large datasets simple econometric models will not do. When you have a million people in your database, such as insurance firms or telephone providers or charities, and you have collected information on these individuals for many years, you simply cannot summarize these data using a small-sized econometric model with just a few regressors. In this chapter we address diverse options for how to handle Big Data. We kick off with a discussion about what Big Data is and why it is special. Next, we discuss a few options such as selective sampling, aggregation, nonlinear models, and variable reduction. Methods such as ridge regression, lasso, elastic net, and artificial neural networks are also addressed; these latter concepts are nowadays described as so-called machine learning methods. We see that with these methods the number of choices rapidly increases, and that reproducibility can reduce. The analysis of Big Data therefore comes at a cost of more analysis and of more choices to make and to report.
Originating from a unique partnership between data scientists (datavaluepeople) and peacebuilders (Build Up), this commentary explores an innovative methodology to overcome key challenges in social media analysis by developing customized text classifiers through a participatory design approach, engaging both peace practitioners and data scientists. It advocates for researchers to focus on developing frameworks that prioritize being usable and participatory in field settings, rather than perfect in simulation. Focusing on a case study investigating the polarization within online Christian communities in the United States, we outline a testing process with a dataset consisting of 8954 tweets and 10,034 Facebook posts to experiment with active learning methodologies aimed at enhancing the efficiency and accuracy of text classification. This commentary demonstrates that the inclusion of domain expertise from peace practitioners significantly refines the design and performance of text classifiers, enabling a deeper comprehension of digital conflicts. This collaborative framework seeks to transition from a data-rich, analysis-poor scenario to one where data-driven insights robustly inform peacebuilding interventions.
This study presents surveillance data from 1 July 2003 to 30 June 2023 for community-associated methicillin-resistant Staphylococcus aureus (CA-MRSA) notified in the Kimberley region of Western Australia (WA) and describes the region’s changing CA-MRSA epidemiology over this period. A subset of CA-MRSA notifications from 1 July 2003 to 30 June 2015 were linked to inpatient and emergency department records. Episodes of care (EOC) during which a positive CA-MRSA specimen was collected within the first 48 hours of admission and emergency presentations (EP) during which a positive CA-MRSA specimen was collected on the same day as presentation were selected and analysed further. Notification rates of CA-MRSA in the Kimberley region of WA increased from 250 cases per 100,000 populations in 2003/2004 to 3,625 cases per 100,000 in 2022/2023, peaking at 6,255 cases per 100,000 in 2016/2017. Since 2010, there has been an increase in notifications of Panton-Valentine leucocidin positive (PVL+) CA-MRSA, predominantly due to the ‘Queensland Clone’. PVL+ CA-MRSA infections disproportionately affect younger, Aboriginal people and are associated with an increasing burden on hospital services, particularly emergency departments. It is unclear from this study if PVL+ MRSA are associated with more severe skin and soft-tissue infections, and further investigation is needed.
Alarm flood classification (AFC) methods are crucial in assisting human operators to identify and mitigate the overwhelming occurrences of alarm floods in industrial process plants, a challenge exacerbated by the complexity and data-intensive nature of modern process control systems. These alarm floods can significantly impair situational awareness and hinder decision-making. Existing AFC methods face difficulties in dealing with the inherent ambiguity in alarm sequences and the task of identifying novel, previously unobserved alarm floods. As a result, they often fail to accurately classify alarm floods. Addressing these significant limitations, this paper introduces a novel three-tier AFC method that uses alarm time series as input. In the transformation stage, alarm floods are subjected to an ensemble of convolutional kernel-based transformations (MultiRocket) to extract their characteristic dynamic properties, which are then fed into the classification stage, where a linear ridge regression classifier ensemble is used to identify recurring alarm floods. In the final novelty detection stage, the local outlier probability (LoOP) is used to determine a confidence measure of whether the classified alarm flood truly belongs to a known or previously unobserved class. Our method has been thoroughly validated using a publicly available dataset based on the Tennessee-Eastman process. The results show that our method outperforms two naive baselines and four existing AFC methods from the literature in terms of overall classification performance as well as the ability to optimize the balance between accurately identifying alarm floods from known classes and detecting alarm flood classes that have not been observed before.
Reducing antimicrobial use (AMU) in livestock may be one of the keys to limit the emergence of antimicrobial resistance (AMR) in bacterial populations, including zoonotic pathogens. This study assessed the temporal association between AMU in livestock and AMR among Campylobacter isolates from human infections in the Netherlands between 2004 – 2020. Moreover, the associations between AMU and AMR in livestock and between AMR in livestock and AMR in human isolates were assessed. AMU and AMR data per antimicrobial class (tetracyclines, macrolides and fluoroquinolones) for Campylobacter jejuni and Campylobacter coli from poultry, cattle, and human patients were retrieved from national surveillance programs. Associations were assessed using logistic regression and the Spearman correlation test. Overall, there was an increasing trend in AMR among human C. jejuni/coli isolates during the study period, which contrasted with a decreasing trend in livestock AMU. In addition, stable trends in AMR in broilers were observed. No significant associations were observed between AMU and AMR in domestically produced broilers. Moderate to strong positive correlations were found between the yearly prevalence of AMR in broiler and human isolates. Reducing AMU in Dutch livestock alone may therefore not be sufficient to tackle the growing problem of AMR in Campylobacter among human cases in the Netherlands. More insight is needed regarding the population genetics and the evolutionary processes involved in resistance and fitness among Campylobacter.