To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Handling nominal covariates with a large number of categories is challenging for both statistical and machine learning techniques. This problem is further exacerbated when the nominal variable has a hierarchical structure. We commonly rely on methods such as the random effects approach to incorporate these covariates in a predictive model. Nonetheless, in certain situations, even the random effects approach may encounter estimation problems. We propose the data-driven Partitioning Hierarchical Risk-factors Adaptive Top-down algorithm to reduce the hierarchically structured risk factor to its essence, by grouping similar categories at each level of the hierarchy. We work top-down and engineer several features to characterize the profile of the categories at a specific level in the hierarchy. In our workers’ compensation case study, we characterize the risk profile of an industry via its observed damage rates and claim frequencies. In addition, we use embeddings to encode the textual description of the economic activity of the insured company. These features are then used as input in a clustering algorithm to group similar categories. Our method substantially reduces the number of categories and results in a grouping that is generalizable to out-of-sample data. Moreover, we obtain a better differentiation between high-risk and low-risk companies.
Lymph node tuberculosis is particularly common in regions with a high tuberculosis burden, and it has a great risk of rupture. This study aims to investigate the utility of ultrasound multimodal imaging in predicting the rupture of cervical tuberculous lymphadenitis (CTL). 128 patients with unruptured CTL confirmed by pathology or laboratory tests were included. Various ultrasonic image features, including long-to-short-axis ratio (L/S), margin, internal echotexture, coarse calcification, Color Doppler Flow Imaging (CDFI), perinodal echogenicity, elastography score, and non-enhanced area proportion in contrast-enhanced ultrasound (CEUS), were analyzed to determine their predictive value for CTL rupture within a one-year follow-up period. As a result, L/S (P < 0.001), margin (P < 0.001), internal echotexture (P < 0.001), coarse calcification (P < 0.001), perinodal echogenicity (P < 0.001), and the area of non-enhancement in CEUS (P < 0.001) were identified as significant imaging features for predicting CTL rupture. The prognostic prediction showed a sensitivity of 89.29%, specificity of 100%, accuracy of 95.31%, respectively. Imaging findings such as L/S < 2, unclear margin, heterogeneous internal echotexture, perinodal echogenicity changed, and non-enhancement area in CEUS > 1/2, are indicative of CTL rupture, while coarse calcification in the lymph nodes is associated with a favorable prognosis.
In this paper, we compare the entropy of the original distribution and its corresponding compound distribution. Several results are established based on convex order and relative log-concave order. The necessary and sufficient condition for a compound distribution to be log-concave is also discussed, including compound geometric distribution, compound negative binomial distribution and compound binomial distribution.
Introduction of African swine fever (ASF) to China in mid-2018 and the subsequent transboundary spread across Asia devastated regional swine production, affecting live pig and pork product-related markets worldwide. To explore the spatiotemporal spread of ASF in China, we reconstructed possible ASF transmission networks using nearest neighbour, exponential function, equal probability, and spatiotemporal case-distribution algorithms. From these networks, we estimated the reproduction numbers, serial intervals, and transmission distances of the outbreak. The mean serial interval between paired units was around 29 days for all algorithms, while the mean transmission distance ranged 332 –456 km. The reproduction numbers for each algorithm peaked during the first two weeks and steadily declined through the end of 2018 before hovering around the epidemic threshold value of 1 with sporadic increases during 2019. These results suggest that 1) swine husbandry practices and production systems that lend themselves to long-range transmission drove ASF spread; 2) outbreaks went undetected by the surveillance system. Efforts by China and other affected countries to control ASF within their jurisdictions may be aided by the reconstructed spatiotemporal model. Continued support for strict implementation of biosecurity standards and improvements to ASF surveillance is essential for halting transmission in China and spread across Asia.
Statistics Using Stata uses a highly accessible and lively writing style to seamlessly integrate the learning of the latest version of Stata (17) with an introduction to applied statistics using real data in the behavioral, social, and health sciences. The text is comprehensive in its content coverage and is suitable at undergraduate and graduate levels. It requires knowledge of basic algebra, but no prior coding experience. It is uniquely focused on the importance of data management as an underlying and key principle of data analysis. It includes a .do-file for each chapter, that was used to generate all figures, tables, and analyses for that chapter. These files are intended as models to be adapted and used by readers in conducting their own research. Additional teaching and learning aids include solutions to all end-of-chapter exercises and PowerPoint slides to highlight the important take-aways of each chapter.
Social network analysis is known to provide a wealth of insights relevant to many aspects of policymaking. Yet, the social data needed to construct social networks are not always available. Furthermore, even when they are, interpreting such networks often relies on extraneous knowledge. Here, we propose an approach to infer social networks directly from the texts produced by actors and the terminological similarities that these texts exhibit. This approach relies on fitting a topic model to the texts produced by these actors and measuring topic profile correlations between actors. This reveals what can be called “hidden communities of interest,” that is, groups of actors sharing similar semantic contents but whose social relationships with one another may be unknown or underlying. Network interpretation follows from the topic model. Diachronic perspectives can also be built by modeling the networks over different time periods and mapping genealogical relationships between communities. As a case study, the approach is deployed over a working corpus of academic articles (domain of philosophy of science; N=16,917).
It is common to assume in empirical research that observables and unobservables are additively separable, especially when the former are endogenous. This is because it is widely recognized that identification and estimation challenges arise when interactions between the two are allowed for. Starting from a nonseparable IV model, where the instrumental variable is independent of unobservables, we develop a novel nonparametric test of separability of unobservables. The large-sample distribution of the test statistics is nonstandard and relies on a Donsker-type central limit theorem for the empirical distribution of nonparametric IV residuals, which may be of independent interest. Using a dataset drawn from the 2015 U.S. Consumer Expenditure Survey, we find that the test rejects the separability in Engel curves for some commodities.
We consider De Finetti’s control problem for absolutely continuous strategies with control rates bounded by a concave function and prove that a generalized mean-reverting strategy is optimal in a Brownian model. In order to solve this problem, we need to deal with a nonlinear Ornstein–Uhlenbeck process. Despite the level of generality of the bound imposed on the rate, an explicit expression for the value function is obtained up to the evaluation of two functions. This optimal control problem has, as special cases, those solved in Jeanblanc-Picqué and Shiryaev (1995) and Renaud and Simard (2021) when the control rate is bounded by a constant and a linear function, respectively.
Current research on data in policy has primarily focused on street-level bureaucrats, neglecting the changes in the work of policy advisors. This research fills this gap by presenting an explorative theoretical understanding of the integration of data, local knowledge and professional expertise in the work of policy advisors. The theoretical perspective we develop builds upon Vickers’s (1995, The Art of Judgment: A Study of Policy Making, Centenary Edition, SAGE) judgments in policymaking. Empirically, we present a case study of a Dutch law enforcement network for preventing and reducing organized crime. Based on interviews, observations, and documents collected in a 13-month ethnographic fieldwork period, we study how policy advisors within this network make their judgments. In contrast with the idea of data as a rationalizing force, our study reveals that how data sources are selected and analyzed for judgments is very much shaped by the existing local and expert knowledge of policy advisors. The weight given to data is highly situational: we found that policy advisors welcome data in scoping the policy issue, but for judgments more closely connected to actual policy interventions, data are given limited value.
Deep reinforcement learning (DRL) is promising for solving control problems in fluid mechanics, but it is a new field with many open questions. Possibilities are numerous and guidelines are rare concerning the choice of algorithms or best formulations for a given problem. Besides, DRL algorithms learn a control policy by collecting samples from an environment, which may be very costly when used with Computational Fluid Dynamics (CFD) solvers. Algorithms must therefore minimize the number of samples required for learning (sample efficiency) and generate a usable policy from each training (reliability). This paper aims to (a) evaluate three existing algorithms (DDPG, TD3, and SAC) on a fluid mechanics problem with respect to reliability and sample efficiency across a range of training configurations, (b) establish a fluid mechanics benchmark of increasing data collection cost, and (c) provide practical guidelines and insights for the fluid dynamics practitioner. The benchmark consists in controlling an airfoil to reach a target. The problem is solved with either a low-cost low-order model or with a high-fidelity CFD approach. The study found that DDPG and TD3 have learning stability issues highly dependent on DRL hyperparameters and reward formulation, requiring therefore significant tuning. In contrast, SAC is shown to be both reliable and sample efficient across a wide range of parameter setups, making it well suited to solve fluid mechanics problems and set up new cases without tremendous effort. In particular, SAC is resistant to small replay buffers, which could be critical if full-flow fields were to be stored.
We collected infant food samples from 714 households in Kisumu, Kenya, and estimated the prevalence and concentration of Enterococcus, an indicator of food hygiene conditions. In a subset of 212 households, we quantified the change in concentration in stored food between a morning and afternoon feeding time. In addition, household socioeconomic characteristics and hygiene practices of the caregivers were documented. The prevalence of Enterococcus in infant foods was 50% (95% confidence interval: 46.1 - 53.4), and the mean log10 colony-forming units (CFUs) was 1.1 (SD + 1.4). No risk factors were significantly associated with the prevalence and concentration of Enterococcus in infant foods. The mean log10 CFU of Enterococcus concentration was 0.47 in the morning and 0.73 in the afternoon foods with a 0.64 log10 mean increase in matched samples during storage. Although no factors were statistically associated with the prevalence and the concentration of Enterococcus in infant foods, household flooring type was significantly associated with an increase in concentration during storage, with finished floors leading to 1.5 times higher odds of concentration increase compared to unfinished floors. Our study revealed high prevalence but low concentration of Enterococcus in infant food in low-income Kisumu households, although concentrations increased during storage implying potential increases in risk of exposure to foodborne pathogens over a day. Further studies aiming at investigating contamination of infant foods with pathogenic organisms and identifying effective mitigation measures are required to ensure infant food safety.
This study assessed the efficacy of ThinPrep cytologic test and human papillomavirus (HPV) co-test in cervical cancer screening during pregnancy. A cohort of 8,712 pregnant women from Ren Ji Hospital participated in the study. Among them, 601 (6.90%) tested positive for high-risk HPV (HR-HPV) and 38 (0.44%) exhibited abnormal cytology results (ASCUS+). Following positive HR-HPV findings, 423 patients underwent colposcopy, and 114 individuals suspected of having high-grade squamous intraepithelial lesion and cervical cancer (HSIL+) underwent cervical biopsy. Histological examination revealed 60 cases of normal pathology (52.63%), 35 cases of low‐grade squamous intraepithelial lesion (30.70%), 17 cases of HSIL (14.91%), and 2 cases of cervical cancer (1.75%). The incidence of HSIL+ in HPV 16/18 group was significantly higher than that in non-HPV16/18 group (10.53% vs. 6.14%, P < 0.05). Subsequent evaluation of the clinical performance of cytology alone, primary HPV screening, and co-testing for HSIL+ detection revealed that the HSIL+ detection rate was lowest with cytology alone. These findings suggest that HPV testing, either alone or combined with cytology, presents an efficient screening strategy for pregnant women, underscoring the potential for improved sensitivity in cervical cancer screening during pregnancy. The significantly higher incidence of HSIL+ in the HPV16/18 group emphasizes the importance of genotype-specific considerations.
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged in 2019 in China and rapidly spread worldwide, leading to a pandemic. The threat of SARS-CoV-2 is subsiding as most people have acquired sufficient antibodies through vaccination and/or infection to prevent severe COVID-19. After the emergence of the omicron variants, the seroprevalence of antibodies against the N protein elicited by SARS-CoV-2 infection ranged from 44.4% to 80.2% in countries other than Japan. Here, we assessed the seroprevalence in Japan before and after the appearance of omicron variants. Serosurveillance of antibodies against N was conducted between December 2021 and March 2023 in Japan. In total, 7604 and 3354 residual serum or plasma samples were collected in the Tokyo metropolitan area and Sapporo, respectively. We found that the seroprevalence in representative regions of Japan increased approximately 3% to 23% after the emergence of the omicron variants. We also found higher seroprevalence among the young compared with the elderly. Our findings indicate that unlike other countries, most of the Japanese population has not been infected, raising the possibility of future SARS-CoV-2 epidemics in Japan.
We show that stationary time series can be uniformly approximated over all finite time intervals by mixing, non-ergodic, non-mean-ergodic, and periodic processes, and by codings of aperiodic processes. A corollary is that the ergodic hypothesis—that time averages will converge to their statistical counterparts—and several adjacent hypotheses are not testable in the non-parametric case. Further Baire category implications are also explored.
Cash transfer programs are the most common anti-poverty tool in low- and middle-income countries, reaching more than one billion people globally. Benefits are typically targeted using prediction models. In this paper, we develop an extended targeting assessment framework for proxy means testing that accounts for societal sensitivity to targeting errors. Using a social welfare framework, we weight targeting errors based on their position in the welfare distribution and adjust for different levels of societal inequality aversion. While this approach provides a more comprehensive assessment of targeting performance, our two case studies show that bias in the data, particularly in the form of label bias and unstable proxy means testing weights, leads to a substantial underestimation of welfare losses, disadvantaging some groups more than others.
Lymphocytic choriomeningitis virus (LCMV) is one of the arenaviruses infecting humans. LCMV infections have been reported worldwide in humans with varying levels of severity. To detect arenavirus RNA and LCMV-reactive antibodies in different geographical regions of Finland, we screened human serum and cerebrospinal fluid (CSF) samples, taken from suspected tick-borne encephalitis (TBE) cases, using reverse transcriptase polymerase chain reaction (RT-PCR) and immunofluorescence assay (IFA). No arenavirus nucleic acids were detected, and the overall LCMV seroprevalence was 4.5%. No seroconversions were detected in paired serum samples. The highest seroprevalence (5.2%) was detected among individuals of age group III (40–59 years), followed by age group I (under-20-year-olds, 4.9%), while the lowest seroprevalence (3.8%) was found in age group IV (60 years or older). A lower LCMV seroprevalence in older age groups may suggest waning of immunity over time. The observation of a higher seroprevalence in the younger age group and the decreasing population size of the main reservoir host, the house mouse, may suggest exposure to another LCMV-like virus in Finland.
We consider inference for possibly misspecified GMM models based on possibly nonsmooth moment conditions. While it is well known that misspecified GMM estimators with smooth moments remain $\sqrt {n}$ consistent and asymptotically normal, globally misspecified nonsmooth GMM estimators are $n^{1/3}$ consistent when either the weighting matrix is fixed or when the weighting matrix is estimated at the $n^{1/3}$ rate or faster. Because the estimator’s nonstandard asymptotic distribution cannot be consistently estimated using the standard bootstrap, we propose an alternative rate-adaptive bootstrap procedure that consistently estimates the asymptotic distribution regardless of whether the GMM estimator is smooth or nonsmooth, correctly or incorrectly specified. Monte Carlo simulations for the smooth and nonsmooth cases confirm that our rate-adaptive bootstrap confidence intervals exhibit empirical coverage close to the nominal level.
To investigate the symptoms of SARS-CoV-2 infection, their dynamics and their discriminatory power for the disease using longitudinally, prospectively collected information reported at the time of their occurrence. We have analysed data from a large phase 3 clinical UK COVID-19 vaccine trial. The alpha variant was the predominant strain. Participants were assessed for SARS-CoV-2 infection via nasal/throat PCR at recruitment, vaccination appointments, and when symptomatic. Statistical techniques were implemented to infer estimates representative of the UK population, accounting for multiple symptomatic episodes associated with one individual. An optimal diagnostic model for SARS-CoV-2 infection was derived. The 4-month prevalence of SARS-CoV-2 was 2.1%; increasing to 19.4% (16.0%–22.7%) in participants reporting loss of appetite and 31.9% (27.1%–36.8%) in those with anosmia/ageusia. The model identified anosmia and/or ageusia, fever, congestion, and cough to be significantly associated with SARS-CoV-2 infection. Symptoms’ dynamics were vastly different in the two groups; after a slow start peaking later and lasting longer in PCR+ participants, whilst exhibiting a consistent decline in PCR- participants, with, on average, fewer than 3 days of symptoms reported. Anosmia/ageusia peaked late in confirmed SARS-CoV-2 infection (day 12), indicating a low discrimination power for early disease diagnosis.
In this paper, we develop methods for statistical inferences in a partially identified nonparametric panel data model with endogeneity and interactive fixed effects. Under some normalization rules, we can concentrate out the large-dimensional parameter vector of factor loadings and specify a set of conditional moment restrictions that are involved with only the finite-dimensional factor parameters along with the infinite-dimensional nonparametric component. For a conjectured restriction on the parameter, we consider testing the null hypothesis that the restriction is satisfied by at least one element in the identified set and propose a test statistic based on a novel martingale difference divergence measure for the distance between a conditional expectation object and zero. We derive a tight asymptotic distributional upper bound for the resultant test statistic under the null and show that it is divergent at rate-N under the global alternative. To obtain the critical values for our test, we propose a version of multiplier bootstrap and establish its asymptotic validity. Simulations demonstrate the finite sample properties of our inference procedure. We apply our method to study Engel curves for major nondurable expenditures in China by using a panel dataset from the China Family Panel Studies.
We study heterogeneously interacting diffusive particle systems with mean-field-type interaction characterized by an underlying graphon and their finite particle approximations. Under suitable conditions, we obtain exponential concentration estimates over a finite time horizon for both 1- and 2-Wasserstein distances between the empirical measures of the finite particle systems and the averaged law of the graphon system.