To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This paper identifies the potential benefits of data sharing and open science, supported by artificial intelligence tools and services, and dives into the challenges to make data open and findable, accessible, interoperable, and reusable (FAIR).
Machine-generated artworks are now part of the contemporary art scene: they are attracting significant investments and they are presented in exhibitions together with those created by human artists. These artworks are mainly based on generative deep learning (GDL) techniques, which have seen a formidable development and remarkable refinement in the very recent years. Given the inherent characteristics of these techniques, a series of novel legal problems arise. In this article, we consider a set of key questions in the area of GDL for the arts, including the following: is it possible to use copyrighted works as training set for generative models? How do we legally store their copies in order to perform the training process? Who (if someone) will own the copyright on the generated data? We try to answer these questions considering the law in force in both the United States and the European Union, and potential future alternatives. We then extend our analysis to code generation, which is an emerging area of GDL. Finally, we also formulate a set of practical guidelines for artists and developers working on deep learning generated art, as well as some policy suggestions for policymakers.
Open innovation programmes related to data and artificial intelligence have interested European policy-makers as a means of supporting startups and small and medium-sized enterprises to succeed in the digital economy. We discuss the objectives behind the typical service offerings of such programmes and propose a case for exploring how they align with the motivations of individual companies who are targeted by these initiatives. Using a qualitative analysis of 50 startup applications from the Data Market Services Accelerator programme, we find that applicants wrote most frequently about fundraising, acceleration and data skills. A smaller number of startups expressed interest in services related to standardization or legal guidance on General Data Protection Regulation and intellectual property rights, which are some of the ongoing priority areas for the European Commission. We discuss how the value propositions of these less desired offerings can be amplified by appealing the existing business motivations of data-driven startups.
In the past 10–15 years, the government of China has made various efforts in tackling excessive antibiotics use. Yet, little is known about their effects at rural primary care settings. This study aimed to determine the impact of government policies and the COVID-19 pandemic on antibiotic prescribing practices at such settings utilizing data from separate studies carried out pre- and during the pandemic, in 2016 and 2021 in Anhui province, China, using identical sampling and survey approaches. Data on antibiotics prescribed, diagnosis, socio-demographic, etc., were obtained through non-participative observation and a structured exit survey. Data analysis comprised mainly descriptive comparisons of 1153 and 762 patients with respiratory infections recruited in 2016 and 2021, respectively. The overall antibiotics prescription rate decreased from 89.6% in 2016 to 69.1% in 2021, and the proportion of prescriptions for two or more classes of antibiotics was estimated as 35.9% in 2016 and 11.0% in 2021. There was a statistically significant decrease in the number of days from symptom onset to clinic visits between the year groups. In conclusion, measures to constrain excessive prescription of antibiotics have led to some improvements at the rural primary care level, and the COVID-19 pandemic has had varying effects on antibiotic use.
A graph $H$ is common if the number of monochromatic copies of $H$ in a 2-edge-colouring of the complete graph $K_n$ is asymptotically minimised by the random colouring. Burr and Rosta, extending a famous conjecture of Erdős, conjectured that every graph is common. The conjectures of Erdős and of Burr and Rosta were disproved by Thomason and by Sidorenko, respectively, in the late 1980s. Collecting new examples of common graphs had not seen much progress since then, although very recently a few more graphs were verified to be common by the flag algebra method or the recent progress on Sidorenko’s conjecture. Our contribution here is to provide several new classes of tripartite common graphs. The first example is the class of so-called triangle trees, which generalises two theorems by Sidorenko and answers a question of Jagger, Šťovíček, and Thomason from 1996. We also prove that, somewhat surprisingly, given any tree $T$, there exists a triangle tree such that the graph obtained by adding $T$ as a pendant tree is still common. Furthermore, we show that adding arbitrarily many apex vertices to any connected bipartite graph on at most $5$ vertices yields a common graph.
There has been a growing interest among pension plan sponsors in envisioning how the mortality experience of their active and deferred members may turn out to be if a pandemic similar to the COVID-19 occurs in the future. To address their needs, we propose in this paper a stochastic model for simulating future mortality scenarios with COVID-alike effects. The proposed model encompasses three parameter levels. The first level includes parameters that capture the long-term pattern of mortality, whereas the second level contains parameters that gauge the excess age-specific mortality due to COVID-19. Parameters in the first and second levels are estimated using penalised quasi-likelihood maximisation method which was proposed for generalised linear mixed models. Finally, the third level includes parameters that draw on expert opinions concerning, for example, how likely a COVID-alike pandemic will occur in the future. We illustrate our proposed model with data from the United States and a range of expert opinions.
The duration of immunity after first severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection and the extent to which prior immunity prevents reinfection is uncertain and remains an important question within the context of new variants. This is a retrospective population-based matched observational study where we identified the first polymerase chain reaction (PCR) positive of primary SARS-CoV-2 infection case tests between 1 March 2020 and 30 September 2020. Each case was matched by age, sex, upper tier local authority of residence and testing route to one individual testing negative in the same week (controls) by PCR. After a 90-day pre-follow-up period for cases and controls, any subsequent positive tests up to 31 December 2020 and deaths within 28 days of testing positive were identified, this encompassed an essentially vaccine-free period. We used a conditional logistic regression to analyse the results. There were 517 870 individuals in the matched cohort with 2815 reinfection cases and 12 098 first infections. The protective effect of a prior SARS-CoV-2 PCR-positive episode was 78% (odds ratio (OR) 0.22, 0.21–0.23). Protection rose to 82% (OR 0.18, 0.17–0.19) after a sensitivity analysis excluded 933 individuals with a first test between March and May and a subsequent positive test between June and September 2020. Amongst individuals testing positive by PCR during follow-up, reinfection cases had 77% lower odds of symptoms at the second episode (adjusted OR 0.23, 0.20–0.26) and 45% lower odds of dying in the 28 days after reinfection (adjusted OR 0.55, 0.42–0.71). Prior SARS-CoV-2 infection offered protection against reinfection in this population. There was some evidence that reinfections increased with the alpha variant compared to the wild-type SARS-CoV-2 variant highlighting the importance of continued monitoring as new variants emerge.
The choice of a copula model from limited data is a hard but important task. Motivated by the visual patterns that different copula models produce in smoothed density heatmaps, we consider copula model selection as an image recognition problem. We extract image features from heatmaps using the pre-trained AlexNet and present workflows for model selection that combine image features with statistical information. We employ dimension reduction via Principal Component and Linear Discriminant Analyses and use a Support Vector Machine classifier. Simulation studies show that the use of image data improves the accuracy of the copula model selection task, particularly in scenarios where sample sizes and correlations are low. This finding indicates that transfer learning can support statistical procedures of model selection. We demonstrate application of the proposed approach to the joint modelling of weekly returns of the MSCI and RISX indices.
In this paper, we first introduce a simulator of cases estimates of incurred losses called SPLICE (Synthetic Paid Loss and Incurred Cost Experience). In three modules, case estimates are simulated in continuous time, and a record is output for each individual claim. Revisions for the case estimates are also simulated as a sequence over the lifetime of the claim in a number of different situations. Furthermore, some dependencies in relation to case estimates of incurred losses are incorporated, particularly recognising certain properties of case estimates that are found in practice. For example, the magnitude of revisions depends on ultimate claim size, as does the distribution of the revisions over time. Some of these revisions occur in response to occurrence of claim payments, and so SPLICE requires input of simulated per-claim payment histories. The claim data can be summarised by accident and payment “periods” whose duration is an arbitrary choice (e.g. month, quarter, etc.) available to the user. SPLICE is built on an existing simulator of individual claim experience called SynthETIC (introduced in Avanzi et al. 2021b, Insurance: Mathematics and Economics, 100, 296–308), which offers flexible modelling of occurrence, notification, as well as the timing and magnitude of individual partial payments. This is in contrast with the incurred losses, which constitute the additional contribution of SPLICE. The inclusion of incurred loss estimates provides a facility that almost no other simulators do. SPLICE is is a fully documented R package that is publicly available and open source (on CRAN). SPLICE, combined with SynthETIC, provides 11 modules (occurrence, notification, etc.), any one or more of which may be re-designed according to the user’s requirements. It comes with a default version that is loosely calibrated to resemble a specific (but anonymous) Auto Bodily Injury portfolio, as well as data generation functionality that outputs alternative data sets under a range of hypothetical scenarios differing in complexity. The general structure is suitable for most lines of business, with some reparameterisation.
We describe a (nonparametric) prediction algorithm for spatial data, based on a canonical factorization of the spectral density function. We provide theoretical results showing that the predictor has desirable asymptotic properties. Finite sample performance is assessed in a Monte Carlo study that also compares our algorithm to a rival nonparametric method based on the infinite $AR$ representation of the dynamics of the data. Finally, we apply our methodology to predict house prices in Los Angeles.
This textbook for students in the health and social sciences covers the basics of linear model methods with a minimum of mathematics, assuming only a pre-calculus background. Numerous examples drawn from the news and current events with an emphasis on health issues, illustrate the concepts in an immediately accessible way. Methods covered include linear regression models, Poisson regression, logistic regression, proportional hazards regression, survival analysis, and nonparametric regression. The author emphasizes interpretation of computer output in terms of the motivating example. All of the R code is provided and carefully explained, allowing readers to quickly apply the methods to their own data. Plenty of exercises help students think about the issues involved in the analysis and its interpretation. Code and datasets are available for download from the book's website at www.cambridge.org/zelterman
We consider an extreme renewal process with no-mean heavy-tailed Pareto(II) inter-renewals and shape parameter $\alpha$ where $0\lt\alpha \leq 1$. Two steps are required to derive integral expressions for the analytic probability density functions (pdfs) of the fixed finite time $t$ excess, age, and total life, and require extensive computations. Step 1 creates and solves a Volterra integral equation of the second kind for the limiting pdf of a basic underlying regenerative process defined in the text, which is used for all three fixed finite time $t$ pdfs. Step 2 builds the aforementioned integral expressions based on the limiting pdf in the basic underlying regenerative process. The limiting pdfs of the fixed finite time $t$ pdfs as $t\rightarrow \infty$ do not exist. To reasonably observe the large $t$ pdfs in the extreme renewal process, we approximate them using the limiting pdfs having simple well-known formulas, in a companion renewal process where inter-renewals are right-truncated Pareto(II) variates with finite mean; this does not involve any computations. The distance between the approximating limiting pdfs and the analytic fixed finite time large $t$ pdfs is given by an $L_{1}$ metric taking values in $(0,1)$, where “near $0$” means “close” and “near $1$” means “far”.
Machine learning has recently entered the mortality literature in order to improve the forecasts of stochastic mortality models. This paper proposes to use two pure, tree-based machine learning models: random forests and gradient boosting, based on the differenced log-mortality rates to produce more accurate mortality forecasts. These forecasts are compared with forecasts from traditional, stochastic mortality models and with forecasts from random forests and gradient boosting variants of the stochastic models. The comparisons are based on the Model Confidence Set procedure. The results show that the pure, tree-based models significantly outperform all other models in the majority of cases considered. To address the lack of interpretability issue associated with machine learning models, we demonstrate how to extract information about the relationships uncovered by the tree-based models. For this purpose, we consider variable importance, partial dependence plots, and variable split conditions. Results from the in-sample fit suggest that tree-based models can be very useful tools for detecting patterns within and between variables that are not commonly identifiable with traditional methods.
Data-driven analysis of complex networks has been in the focus of research for decades. An important area of research is to study how well real networks can be described with a small selection of metrics, furthermore how well network models can capture the relations between graph metrics observed in real networks. In this paper, we apply machine-learning techniques to investigate the aforementioned problems. We study 500 real-world networks along with 2000 synthetic networks generated by four frequently used network models with previously calibrated parameters to make the generated graphs as similar to the real networks as possible. This paper unifies several branches of data-driven complex network analysis, such as the study of graph metrics and their pair-wise relationships, network similarity estimation, model calibration, and graph classification. We find that the correlation profiles of the structural measures significantly differ across network domains and the domain can be efficiently determined using a small selection of graph metrics. The structural properties of the network models with fixed parameters are robust enough to perform parameter calibration. The goodness-of-fit of the network models highly depends on the network domain. By solving classification problems, we find that the models lack the capability of generating a graph with a high clustering coefficient and relatively large diameter simultaneously. On the other hand, models are able to capture exactly the degree-distribution-related metrics.
In this paper, we treat a nonlinear and unbalanced $2$-color urn scheme, subjected to two different nonlinear drawing rules, depending on the color withdrawn. We prove a central limit theorem as well as a law of large numbers for the urn composition. We also give an estimate of the mean and variance of both types of balls.
In November 2019, an outbreak of Shiga toxin-producing Escherichia coli O157:H7 was detected in South Yorkshire, England. Initial investigations established consumption of milk from a local dairy as a common exposure. A sample of pasteurised milk tested the next day failed the phosphatase test, indicating contamination of the pasteurised milk by unpasteurised (raw) milk. The dairy owner agreed to immediately cease production and initiate a recall. Inspection of the pasteuriser revealed a damaged seal on the flow divert valve. Ultimately, there were 21 confirmed cases linked to the outbreak, of which 11 (52%) were female, and 12/21 (57%) were either <15 or >65 years of age. Twelve (57%) patients were treated in hospital, and three cases developed haemolytic uraemic syndrome. Although the outbreak strain was not detected in the milk samples, it was detected in faecal samples from the cattle on the farm. Outbreaks of gastrointestinal disease caused by milk pasteurisation failures are rare in the UK. However, such outbreaks are a major public health concern as, unlike unpasteurised milk, pasteurised milk is marketed as ‘safe to drink’ and sold to a larger, and more dispersed, population. The rapid, co-ordinated multi-agency investigation initiated in response to this outbreak undoubtedly prevented further cases.
Repeated serosurveys are an important tool for understanding trends in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection and vaccination. During 1 September 2020–20 March 2021, the NYC Health Department conducted a population-based SARS-CoV-2 antibody prevalence survey of 2096 NYC adults who either provided a blood specimen or self-reported the results of a previous antibody test. The serosurvey, the second in a series of surveys conducted by the NYC Health Department, aimed to estimate SARS-CoV-2 antibody prevalence across the city and for different groups at higher risk for adverse health outcomes. Weighted citywide prevalence was 23.5% overall (95% confidence interval (CI) 20.1–27.4) and increased from 19.2% (95% CI 14.7–24.6) before coronavirus disease 2019 vaccines were available to 31.3% (95% CI 24.5–39.0) during the early phases of vaccine roll-out. We found no differences in antibody prevalence by age, race/ethnicity, borough, education, marital status, sex, health insurance coverage, self-reported general health or neighbourhood poverty. These results show an overall increase in population-level seropositivity in NYC following the introduction of SARS-CoV-2 vaccines and highlight the importance of repeated serosurveys in understanding the pandemic's progression.
Heavy tails –extreme events or values more common than expected –emerge everywhere: the economy, natural events, and social and information networks are just a few examples. Yet after decades of progress, they are still treated as mysterious, surprising, and even controversial, primarily because the necessary mathematical models and statistical methods are not widely known. This book, for the first time, provides a rigorous introduction to heavy-tailed distributions accessible to anyone who knows elementary probability. It tackles and tames the zoo of terminology for models and properties, demystifying topics such as the generalized central limit theorem and regular variation. It tracks the natural emergence of heavy-tailed distributions from a wide variety of general processes, building intuition. And it reveals the controversy surrounding heavy tails to be the result of flawed statistics, then equips readers to identify and estimate with confidence. Over 100 exercises complete this engaging package.
The Chow–Robbins game is a classical, still partly unsolved, stopping problem introduced by Chow and Robbins in 1965. You repeatedly toss a fair coin. After each toss, you decide whether you take the fraction of heads up to now as a payoff, otherwise you continue. As a more general stopping problem this reads $V(n,x) = \sup_{\tau }\mathbb{E} \left [ \frac{x + S_\tau}{n+\tau}\right]$, where S is a random walk. We give a tight upper bound for V when S has sub-Gaussian increments by using the analogous time-continuous problem with a standard Brownian motion as the driving process. For the Chow–Robbins game we also give a tight lower bound and use these to calculate, on the integers, the complete continuation and the stopping set of the problem for $n\leq 489\,241$.