To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter uses a range of quotes and findings from the internet and the literature. The key premises of this chapter, which is illustrated with examples, are as follows. First, Big Data requires the use of algorithms. Second, algorithms can create misleading information. Third, algorithms can lead to destructive outcomes. But we should not forget that humans program algorithms. With Big Data come algorithms to run many and involved computations. We cannot oversee all these data ourselves, so we need the help of algorithms to make computations for us. We might label these algorithms as Artificial Intelligence, but this might suggest that they can do things on their own. They can run massive computations, but they need to be fed with data. And this feeding is usually done by us, by humans, and we also choose the algorithms to be used.
In practice we do not always have clear guidance from economic theory about specifying an econometric model. At one extreme, it may be said that we should “let the data speak.” It is good to know that when they “speak” that what they say makes sense. We must be aware of a particularly important phenomenon in empirical econometrics: the spurious relationship. If you encounter a spurious relationship but do not recognize it as such, you may inadequately consider such a relationship for hypothesis testing or for the creation of forecasts. A spurious relationship appears when the model is not well specified. In this chapter, we see from a case study that people can draw strong but inappropriate conclusions if the econometric model is not well specified. We see that if you a priori hypothesize a structural break at a particular moment in time, and based on that very assumption analyze the data, then it is easy to draw inaccurate conclusions. As with influential observations, the lesson here is that one should first create an econometric model, and, given that model, investigate whether there could have been a structural break.
This chapter deals with missing data and a few approaches to managing such. There are several reasons why data can be missing. For example, people can throw away older data, which can sometimes be sensible. It may also be the case that you want to analyze a phenomenon that occurs at an hourly level but only have data at the daily level; thus, the hourly data are missing. It may also be that a survey is simply too long, so people get tired and do not answer all questions. In this chapter we review various situations where data are missing and how we can recognize them. Sometimes we know how to manage the situation of missing data. Often there is no need to panic and modifications of models and/or estimation methods can be used. We encounter a case in which data can be made missing on purpose, by selective sampling, to subsequently facilitate empirical analysis. Such analysis explicitly takes account of the missingness, and the impact of missing data can become minor.
Currently we may have access to large databases, sometimes coined as Big Data, and for those large datasets simple econometric models will not do. When you have a million people in your database, such as insurance firms or telephone providers or charities, and you have collected information on these individuals for many years, you simply cannot summarize these data using a small-sized econometric model with just a few regressors. In this chapter we address diverse options for how to handle Big Data. We kick off with a discussion about what Big Data is and why it is special. Next, we discuss a few options such as selective sampling, aggregation, nonlinear models, and variable reduction. Methods such as ridge regression, lasso, elastic net, and artificial neural networks are also addressed; these latter concepts are nowadays described as so-called machine learning methods. We see that with these methods the number of choices rapidly increases, and that reproducibility can reduce. The analysis of Big Data therefore comes at a cost of more analysis and of more choices to make and to report.
Originating from a unique partnership between data scientists (datavaluepeople) and peacebuilders (Build Up), this commentary explores an innovative methodology to overcome key challenges in social media analysis by developing customized text classifiers through a participatory design approach, engaging both peace practitioners and data scientists. It advocates for researchers to focus on developing frameworks that prioritize being usable and participatory in field settings, rather than perfect in simulation. Focusing on a case study investigating the polarization within online Christian communities in the United States, we outline a testing process with a dataset consisting of 8954 tweets and 10,034 Facebook posts to experiment with active learning methodologies aimed at enhancing the efficiency and accuracy of text classification. This commentary demonstrates that the inclusion of domain expertise from peace practitioners significantly refines the design and performance of text classifiers, enabling a deeper comprehension of digital conflicts. This collaborative framework seeks to transition from a data-rich, analysis-poor scenario to one where data-driven insights robustly inform peacebuilding interventions.
This study presents surveillance data from 1 July 2003 to 30 June 2023 for community-associated methicillin-resistant Staphylococcus aureus (CA-MRSA) notified in the Kimberley region of Western Australia (WA) and describes the region’s changing CA-MRSA epidemiology over this period. A subset of CA-MRSA notifications from 1 July 2003 to 30 June 2015 were linked to inpatient and emergency department records. Episodes of care (EOC) during which a positive CA-MRSA specimen was collected within the first 48 hours of admission and emergency presentations (EP) during which a positive CA-MRSA specimen was collected on the same day as presentation were selected and analysed further. Notification rates of CA-MRSA in the Kimberley region of WA increased from 250 cases per 100,000 populations in 2003/2004 to 3,625 cases per 100,000 in 2022/2023, peaking at 6,255 cases per 100,000 in 2016/2017. Since 2010, there has been an increase in notifications of Panton-Valentine leucocidin positive (PVL+) CA-MRSA, predominantly due to the ‘Queensland Clone’. PVL+ CA-MRSA infections disproportionately affect younger, Aboriginal people and are associated with an increasing burden on hospital services, particularly emergency departments. It is unclear from this study if PVL+ MRSA are associated with more severe skin and soft-tissue infections, and further investigation is needed.
Alarm flood classification (AFC) methods are crucial in assisting human operators to identify and mitigate the overwhelming occurrences of alarm floods in industrial process plants, a challenge exacerbated by the complexity and data-intensive nature of modern process control systems. These alarm floods can significantly impair situational awareness and hinder decision-making. Existing AFC methods face difficulties in dealing with the inherent ambiguity in alarm sequences and the task of identifying novel, previously unobserved alarm floods. As a result, they often fail to accurately classify alarm floods. Addressing these significant limitations, this paper introduces a novel three-tier AFC method that uses alarm time series as input. In the transformation stage, alarm floods are subjected to an ensemble of convolutional kernel-based transformations (MultiRocket) to extract their characteristic dynamic properties, which are then fed into the classification stage, where a linear ridge regression classifier ensemble is used to identify recurring alarm floods. In the final novelty detection stage, the local outlier probability (LoOP) is used to determine a confidence measure of whether the classified alarm flood truly belongs to a known or previously unobserved class. Our method has been thoroughly validated using a publicly available dataset based on the Tennessee-Eastman process. The results show that our method outperforms two naive baselines and four existing AFC methods from the literature in terms of overall classification performance as well as the ability to optimize the balance between accurately identifying alarm floods from known classes and detecting alarm flood classes that have not been observed before.
Reducing antimicrobial use (AMU) in livestock may be one of the keys to limit the emergence of antimicrobial resistance (AMR) in bacterial populations, including zoonotic pathogens. This study assessed the temporal association between AMU in livestock and AMR among Campylobacter isolates from human infections in the Netherlands between 2004 – 2020. Moreover, the associations between AMU and AMR in livestock and between AMR in livestock and AMR in human isolates were assessed. AMU and AMR data per antimicrobial class (tetracyclines, macrolides and fluoroquinolones) for Campylobacter jejuni and Campylobacter coli from poultry, cattle, and human patients were retrieved from national surveillance programs. Associations were assessed using logistic regression and the Spearman correlation test. Overall, there was an increasing trend in AMR among human C. jejuni/coli isolates during the study period, which contrasted with a decreasing trend in livestock AMU. In addition, stable trends in AMR in broilers were observed. No significant associations were observed between AMU and AMR in domestically produced broilers. Moderate to strong positive correlations were found between the yearly prevalence of AMR in broiler and human isolates. Reducing AMU in Dutch livestock alone may therefore not be sufficient to tackle the growing problem of AMR in Campylobacter among human cases in the Netherlands. More insight is needed regarding the population genetics and the evolutionary processes involved in resistance and fitness among Campylobacter.
In this paper, we use an information theoretic approach called cumulative residual extropy (CRJ) to compare mixed used systems. We establish mixture representations for the CRJ of mixed used systems and then explore the measure and comparison results among these systems. We compare the mixed used systems based on stochastic orders and stochastically ordered conditional coefficients vectors. Additionally, we derive bounds for the CRJ of mixed used systems with independent and identically distributed components. We also propose the Jensen-cumulative residual extropy (JCRJ) divergence to calculate the complexity of systems. To demonstrate the utility of these results, we calculate and compare the CRJ and JCRJ divergence of mixed used systems in the Exponential model. Furthermore, we determine the optimal system configuration based on signature under a criterion function derived from JCRJ in the exponential model.
We design the insurance contract when the insurer faces arson-type risks. We show that the optimal contract must be manipulation-proof. Therefore, it is continuous, has a bounded slope, and satisfies the no-sabotage condition when arson-type actions are free. Any contract that mixes a deductible, coinsurance, and an upper limit is manipulation-proof. A key feature of our models is that we provide a simple, general, and entirely elementary proof of manipulation-proofness that is easily adapted to different settings. We also show that the ability to perform arson-type actions reduces the insured’s welfare as less coverage is offered in equilibrium.
We study the mixing time of the single-site update Markov chain, known as the Glauber dynamics, for generating a random independent set of a tree. Our focus is obtaining optimal convergence results for arbitrary trees. We consider the more general problem of sampling from the Gibbs distribution in the hard-core model where independent sets are weighted by a parameter $\lambda \gt 0$; the special case $\lambda =1$ corresponds to the uniform distribution over all independent sets. Previous work of Martinelli, Sinclair and Weitz (2004) obtained optimal mixing time bounds for the complete $\Delta$-regular tree for all $\lambda$. However, Restrepo, Stefankovic, Vera, Vigoda, and Yang (2014) showed that for sufficiently large $\lambda$ there are bounded-degree trees where optimal mixing does not hold. Recent work of Eppstein and Frishberg (2022) proved a polynomial mixing time bound for the Glauber dynamics for arbitrary trees, and more generally for graphs of bounded tree-width.
We establish an optimal bound on the relaxation time (i.e., inverse spectral gap) of $O(n)$ for the Glauber dynamics for unweighted independent sets on arbitrary trees. We stress that our results hold for arbitrary trees and there is no dependence on the maximum degree $\Delta$. Interestingly, our results extend (far) beyond the uniqueness threshold which is on the order $\lambda =O(1/\Delta )$. Our proof approach is inspired by recent work on spectral independence. In fact, we prove that spectral independence holds with a constant independent of the maximum degree for any tree, but this does not imply mixing for general trees as the optimal mixing results of Chen, Liu, and Vigoda (2021) only apply for bounded-degree graphs. We instead utilize the combinatorial nature of independent sets to directly prove approximate tensorization of variance via a non-trivial inductive proof.
We answer the following question: if the occupied (or vacant) set of a planar Poisson Boolean percolation model contains a crossing of an $n\times n$ square, how wide is this crossing? The answer depends on whether we consider the critical, sub-, or super-critical regime, and is different for the occupied and vacant sets.
Assisted and automated driving functions will rely on machine learning algorithms, given their ability to cope with real-world variations, e.g. vehicles of different shapes, positions, colors, and so forth. Supervised learning needs annotated datasets, and several automotive datasets are available. However, these datasets are tremendous in volume, and labeling accuracy and quality can vary across different datasets and within dataset frames. Accurate and appropriate ground truth is especially important for automotive, as “incomplete” or “incorrect” learning can negatively impact vehicle safety when these neural networks are deployed. This work investigates the ground truth quality of widely adopted automotive datasets, including a detailed analysis of KITTI MoSeg. According to the identified and classified errors in the annotations of different automotive datasets, this article provides three different criteria collections for producing improved annotations. These criteria are enforceable and applicable to a wide variety of datasets. The three annotations sets are created to (i) remove dubious cases; (ii) annotate to the best of human visual system; and (iii) remove clear erroneous BBs. KITTI MoSeg has been reannotated three times according to the specified criteria, and three state-of-the-art deep neural network object detectors are used to evaluate them. The results clearly show that network performance is affected by ground truth variations, and removing clear errors is beneficial for predicting real-world objects only for some networks. The relabeled datasets still present some cases with “arbitrary”/“controversial” annotations, and therefore, this work concludes with some guidelines related to dataset annotation, metadata/sublabels, and specific automotive use cases.
Many physical systems exhibit limit-cycle oscillations that can typically be modeled as stochastically driven self-oscillators. In this work, we focus on a self-oscillator model where the nonlinearity is on the damping term. In various applications, it is crucial to determine the nonlinear damping term and the noise intensity of the driving force. This article presents a novel approach that employs a deep operator network (DeepONet) for parameter identification of self-oscillators. We build our work upon a system identification methodology based on the adjoint Fokker–Planck formulation, which is robust to the finite sampling interval effects. We employ DeepONet as a surrogate model for the operator that maps the first Kramers–Moyal (KM) coefficient to the first and second finite-time KM coefficients. The proposed approach can directly predict the finite-time KM coefficients, eliminating the intermediate computation of the solution field of the adjoint Fokker–Planck equation. Additionally, the differentiability of the neural network readily facilitates the use of gradient-based optimizers, further accelerating the identification process. The numerical experiments demonstrate that the proposed methodology can recover desired parameters with a significant reduction in time while maintaining an accuracy comparable to that of the classical finite-difference approach. The low computational time of the forward path enables Bayesian inference of the parameters. Metropolis-adjusted Langevin algorithm is employed to obtain the posterior distribution of the parameters. The proposed method is validated against numerical simulations and experimental data obtained from a linearly unstable turbulent combustor.
This study focuses on the practicalities of establishing and maintaining AI infrastructure, as well as the considerations for responsible governance by investigating the integration of a pre-trained large language model (LLM) with an organisation’s knowledge management system via a chat interface. The research adopts the concept of “AI as a constituted system” to emphasise the social, technical, and institutional factors that contribute to AI’s governance and accountability. Through an ethnographic approach, this article details the iterative processes of negotiation, decision-making, and reflection among organisational stakeholders as they develop, implement, and manage the AI system. The findings indicate that LLMs can be effectively governed and held accountable to stakeholder interests within specific contexts, specifically, when clear institutional boundaries facilitate innovation while navigating the risks related to data privacy and AI misbehaviour. Effective constitution and use can be attributed to distinct policy creation processes to guide AI’s operation, clear lines of responsibility, and localised feedback loops to ensure accountability for actions taken. This research provides a foundational perspective to better understand algorithmic accountability and governance within organisational contexts. It also envisions a future where AI is not universally scaled but consists of localised, customised LLMs tailored to stakeholder interests.
Open data promises various benefits, including stimulating innovation, improving transparency and public decision-making, and enhancing the reproducibility of scientific research. Nevertheless, numerous studies have highlighted myriad challenges related to preparing, disseminating, processing, and reusing open data, with newer studies revealing similar issues to those identified a decade prior. Several researchers have proposed the open data ecosystem (ODE) as a lens for studying and devising interventions to address these issues. Since actors in the ecosystem are individually and collectively impacted by the sustainability of the ecosystem, all have a role in tackling the challenges in the ODE. This paper asks what the contributions of open data intermediaries may be in addressing these challenges. Open data intermediaries are third-party actors providing specialized resources and capabilities to (i) enhance the supply, flow, and/or use of open data and/or (ii) strengthen the relationships among various open data stakeholders. They are critical in ensuring the flow of resources within the ODE. Through semi-structured interviews and a validation exercise in the European Union context, this study explores the potential contribution of open data intermediaries and the specific ODE challenges they may address. This study identified 20 potential contributions, addressing 27 challenges. The findings of this study pave the way for further inquiry into the internal incentives (viable business models) and external incentives (policies and regulations) to direct the contributions of open data intermediaries toward addressing challenges in the ODE.