To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Women politicians report that social media abuse harms their personal and professional lives. However, prior text-based research finds that men receive more general online hostility than women – except among the most visible politicians. I hypothesize that backlash to perceived gender-role violations – such as public visibility – will include distinctly gendered content, such as slurs and references to appearance. Using a novel and replicable method, I analyze hostile and gendered language in three million social media mentions of US state representatives. I find that hostility towards visible women differs from men in content, not volume. Visible women face similar volumes of generic hostility but twice as much gender-specific abuse as men. This pattern holds across two alternate measures of perceived conformity to traditional gender roles: legislator tone and the presence of women in the chamber. Incorporating gendered content into text-based analyses reconciles discrepancies between observational and self-reported data and validates women politicians’ reports.
The rise of health care AI raises concerns over whether patent disclosure supports reproducibility and legal validity. This study analyzes 865 granted medical AI patents (2015–2025) from the US, China, and the EU using a five-dimensional framework (algorithm transparency, training data accessibility, model reproducibility, result verifiability, and mathematical support) implemented through NLP-assisted expert scoring. Results suggest limited technical transparency; approximately 40% of patents score zero in at least two dimensions. Performance varies significantly: algorithm transparency is relatively strong (>60% score 2), while training data accessibility is less prevalent (4.6% score 2) and mathematical support is frequently omitted (39.4% score 0). Statistical testing indicates US patents significantly outperform Chinese patents (p < 0.001), while EU results remain exploratory (N = 31, mean 6.2). These patterns appear associated with institutional factors, strategic applicant behaviour, and technical complexity. Such limitations may pose risks to enforceability and market development, highlighting the need for targeted disclosure improvements. This study contributes a replicable framework for translating legal standards into measurable indicators, providing cross-jurisdictional evidence to guide examination, litigation, and policy refinement in medical AI governance.
Advances in content analysis present significant opportunities for social scientists who develop and analyze concepts. This chapter introduces some basic approaches for formalizing and sharing conceptual frameworks (i.e., sets of terms, classes, properties, etc.) and demonstrates some dividends of such formalization for both scholars and their audiences in the field of comparative law. Specifically, the chapter describes an experiment in systematizing the concepts that represent ideas in national constitutions using a set of methods proposed for modern web design. In general, these machine-friendly approaches to concepts – which may be summarized as “digital semantics” – represent a natural extension of traditional concept analysis, much of which is focused on coordinating vocabulary among scholars. Since “concepts about concepts” can themselves be opaque, a glossary with key terms is appended.
Construction safety inspections typically involve a human inspector identifying safety concerns on-site. With the rise of powerful vision language models (VLMs), researchers are exploring their use for tasks such as detecting safety rule violations from on-site images. However, there is a lack of open datasets to comprehensively evaluate and further fine-tune VLMs in construction safety inspection. Current applications of VLMs use small, supervised datasets, limiting their applicability in tasks they are not directly trained for. In this article, we propose the ConstructionSite 10 k, featuring 10,000 construction site images with annotations for three inter-connected tasks, including image captioning, safety rule violation visual question answering (VQA), and construction element visual grounding. Our subsequent evaluation of current state-of-the-art large pre-trained VLMs shows notable generalization abilities in zero-shot and few-shot settings, while additional training is needed to make them applicable to actual construction sites. This dataset allows researchers to train and evaluate their own VLMs with new architectures and techniques, providing a valuable benchmark for construction safety inspection.
This article examines the use of neural networks in electromechanical sound art and music, where sound is materially enacted through physical means such as motors, solenoids, and physical resonators. It begins with a survey of documented works, outlining a range of current strategies and discussing how technical, material, and performative factors influence their design. Identifying natural language processing as underexplored in this domain, a practice-based case study, Seven Studies for Electric Motors, develops one such language-based approach. The project embeds a small language model for real-time sentence generation, extracts syntax structures, and translates these into patterns of motor-driven sound. Taken together, the survey and case study offer a picture of how machine learning has been integrated into electromechanical practices over the past decade and point to possible directions for further work.
Manual submission of clinical trial data to the ClinicalTrials.gov registry is labor-intensive and error-prone, contributing to variability in the completeness and consistency of registry entries. To explore whether recent advances in large language models could support this process, we developed ChatCT, a pilot retrieval-augmented system that drafts ClinicalTrials.gov registry elements.
Methods:
We evaluated ChatCT-generated registry elements across three dimensions: 1. semantic similarity to the public ClinicalTrials.gov record, 2. formatting compliance with ClinicalTrials.gov requirements, and 3. coverage of key trial biomedical concepts.
Results:
ChatCT-generated registry elements were highly semantically similar to human-authored ClinicalTrials.gov records (median BERTScore F1 ≈ 0.82). Formatting compliance was high for structured elements, including Study Design (91% of required fields present; mean completeness 0.897) and Arms/Interventions (75%; 0.772), while narrative sections showed greater variability, including Outcome Measures (79%; 0.929) and Study Description (57%; 0.784). Ontology-based concept extraction and matching demonstrated consistently high precision, with scores ranging from 90% to 100%.
Conclusions:
A retrieval-augmented large language model can generate ClinicalTrials.gov registry drafts that preserve essential protocol details and adhere to most formatting requirements. However, light post-processing (e.g., automated schema validation) remains necessary for full submission readiness. This proof-of-concept evaluation suggests that ChatCT-assisted drafting could support registry reporting by improving consistency between protocol documents and publicly reported trial information.
Comorbid obsessive–compulsive disorder (OCD) or obsessive–compulsive symptoms (OCS) are common in people with severe mental illness (SMI; including schizophrenia, bipolar disorder and schizoaffective disorder), with little known about associations with smoking.
Aims
To estimate the association between OCD/OCS and smoking status among people with SMI in a huge electronic database.
Method
Using the Clinical Records Interactive Search (CRIS) platform for data of service users in the South London and Maudsley (SLaM) NHS Foundation Trust, tobacco smoking status was retrospectively detected through an algorithm of natural language processing technique, categorising into ‘current smoker’, ‘ex-smoker’ and ‘non-smoker’ by the clinical notes of SMI individuals during 2007–2015. A hierarchical assignment rule was applied following the order of ‘smoker’, ‘ex-smoker’ and then ‘non-smoker’ in an individual. Logistic regression was used to examine the association between smoking and OCS in people with SMI for univariable and multivariable analyses.
Results
We identified 15 479 SMI individuals (56% male; mean age 41 years old), with 90.4% ever smoked. Among them, 2320 (15%) had OCS (without OCD), while 2174 (14%) had a clinical diagnosis of comorbid OCD. After adjusting for demographics and functional status as confounders, both SMI individuals with OCS only and an OCD diagnosis were significantly more likely to have ever smoked (adj. odds ratio 1.47, 95% CI 1.23, 1.76 and adj. odds ratio 1.33, 95% CI 1.11, 1.60, respectively) compared with those without OCD/OCS.
Conclusions
In this large-scale analysis of people with SMI, we found that individuals with OCS or OCD were more likely to have ever smoked.
Emphasizing how and why machine learning algorithms work, this introductory textbook bridges the gap between the theoretical foundations of machine learning and its practical algorithmic and code-level implementation. Over 85 thorough worked examples, in both Matlab and Python, demonstrate how algorithms are implemented and applied whilst illustrating the end result. Over 75 end-of-chapter problems empower students to develop their own code to implement these algorithms, equipping them with hands-on experience. Matlab coding examples demonstrate how a mathematical idea is converted from equations to code, and provide a jumping off point for students, supported by in-depth coverage of essential mathematics including multivariable calculus, linear algebra, probability and statistics, numerical methods, and optimization. Accompanied online by instructor lecture slides, downloadable Python code and additional appendices, this is an excellent introduction to machine learning for senior undergraduate and graduate students in Engineering and Computer Science.
Technological similarity enables wine operators to share best practices, benchmark against industry standards, and identify new areas of innovation. Despite this, measuring similarity is notoriously challenging. In this paper, I use sentence embeddings on wine patent data to show how similarity compares across different models. I validate the results both internally and externally, showing large discrepancies in annual trends. The results underscore the importance of selecting suitable models for market assessment, providing a valuable primer for both wine operators and technologists.
Language impairments are common in affective and psychotic disorders, yet their patterns and underlying pathomechanisms remain insufficiently understood. A transdiagnostic perspective provides a framework for identifying shared and disorder-specific language alterations across diagnostic boundaries. Combining natural language processing (NLP) with network analysis enables the investigation of complex associations between linguistic, cognitive, and psychopathological features.
Methods
Spontaneous speech from N = 372 participants (119 MDD, 27 BD, 48 SSD and 178 HC) was elicited using four Thematic Apperception Test pictures (~12 min per participant). NLP models were applied to extract latent linguistic variables across various levels, including lexical diversity, syntactic complexity, semantic coherence, and disfluencies. Network analysis was used to relate linguistic variables, psychopathology (SAPS, SANS, HAM-A, HAM-D, YMRS, TLI, GAF), and cognitive performance (attention, verbal memory, recognition, and verbal fluency).
Results
Linguistic variables formed the densest network cluster, with type–token ratio, mean length of utterance, and syntactic complexity emerging as central nodes. Psychopathology variables were less cohesive, while TLI “Impoverishment”, coherence mean, and executive functioning bridged linguistic, cognitive, and psychopathological domains. Network comparison tests revealed no significant differences in linguistic–cognitive network structure across HC, MDD, BD, and SSD.
Conclusions
Linguistic networks show high structural consistency across healthy individuals and patients, whereas psychopathological symptom networks reflect transdiagnostic profiles. These findings support a dimensional and transdiagnostic framework underscore shared language–cognition mechanisms, and highlight executive functioning as key cross-domain connection, which opens up new avenues for dimensional research into the pathophysiological and etiological mechanisms underlying language dysfunctions.
Numerous studies show an association between military pressures and fiscal development, often based on cross-national correlations between wars and fiscal outcomes (e.g., tax ratios). However, investments in fiscal capacity may take time to yield higher tax revenues, obscuring the importance of factors that contributed to those investments. This article shifts attention from fiscal outcomes to the policymaking process. Using text-as-data techniques to analyse British parliamentary debates from 1803 to 1913, it offers new micro-level evidence of the relationship between military pressures and fiscal policymaking in the United Kingdom during the long 19th century. Our analyses show that military issues were associated with higher fiscal salience and lower contestation in tax debates. Qualitative analyses indicate that military issues were recurrently invoked to support the renewal of the personal income tax despite attempts to repeal it, confirming the close link between military and fiscal issues in shaping the modern British fiscal state.
What channels can an authoritarian state employ to steer social science research towards topics preferred by the regime? I researched the Chinese coauthor network of civil society studies, examining 14,088 researchers and their peer-reviewed journal articles published between 1998 and 2018. Models with individual and time fixed-effects reveal that scholars at the center of the network closely follow the narratives of the state’s policy plans and could serve as effective state agents. However, those academics who connect different intellectual communities tend to pursue novel ideas deviating from the official narratives. Funding is an ineffective direct means for co-opting individual scholars, possibly because it is routed through institutions. Combining these findings, this study proposes a preliminary formation of authoritarian knowledge regime that consists of (1) the state’s official narrative, (2) institutionalized state sponsorship, (3) co-opted intellectuals centrally embedded in scholarly networks, and (4) intellectual brokers as sources of novel ideas.
We review our recent ConfliBERT language model (Hu et al. 2022 [ConfliBERT: A Pre-Trained Language Model for Political Conflict and Violence]) to process political and violence-related texts. When fine-tuned, results show that ConfliBERT has superior performance in accuracy, precision, and recall over other large language models (LLMs) like Google’s Gemma 2 (9B), Meta’s Llama 3.1 (7B), and Alibaba’s Qwen 2.5 (14B) within its relevant domains. It is also hundreds of times faster than these more generalist LLMs. These results are illustrated using texts from the BBC, re3d, and the Global Terrorism Database. We demonstrate that open, fine-tuned models can outperform the more general models in terms of accuracy, precision, and recall, and at a fraction of the cost.
Ultra-processed foods (UPF), defined using frameworks such as NOVA, are increasingly linked to adverse health outcomes, driving interest in ways to identify and monitor their consumption. Artificial intelligence (AI) offers potential, yet its application in classifying UPF remains underexamined. To address this gap, we conducted a scoping review mapping how AI has been used, focusing on techniques, input data, classification frameworks, accuracy and application. Studies were eligible if peer-reviewed, published in English (2015–2025), and they applied AI approaches to assess or classify UPF using recognised or study-specific frameworks. A systematic search in May 2025 across PubMed, Scopus, Medline and CINAHL identified 954 unique records with eight ultimately meeting the inclusion criteria; one additional study was added in October following an updated search after peer review. Records were independently screened and extracted by two reviewers. Extracted data covered AI methods, input types, frameworks, outputs, validation and context. Studies used diverse techniques, including random forest classifiers, large language models and rule-based systems, applied across various contexts. Four studies explored practical settings: two assessed consumption or purchasing behaviours, and two developed substitution tools for healthier options. All relied on NOVA or modified versions to categorise processing. Several studies reported predictive accuracy, with F1 scores from 0·86 to 0·98, while another showed alignment between clusters and NOVA categories. Findings highlight the potential of AI tools to improve dietary monitoring and the need for further development of real-time methods and validation to support public health.
Social scientists have quickly adopted large language models (LLMs) for their ability to annotate documents without supervised training, an ability known as zero-shot classification. However, due to their computational demands, cost, and often proprietary nature, these models are frequently at odds with open science standards. This article introduces the Political Domain Enhanced BERT-based Algorithm for Textual Entailment (DEBATE) language models: Foundation models for zero-shot, few-shot, and supervised classification of political documents. As zero-shot classifiers, the models are designed to be used for common, well-defined tasks, such as topic and opinion classification. When used in this context, the DEBATE models are not only as good as state-of-the-art LLMs at zero-shot classification, but are orders of magnitude more efficient and completely open source. We further demonstrate that the models are effective few-shot learners. With a simple random sample of 10–25 documents, they can outperform supervised classifiers trained on hundreds or thousands of documents and state-of-the-art generative models. Additionally, we release the PolNLI dataset used to train these models—a corpus of over 200,000 political documents with highly accurate labels across over 800 classification tasks.
This article proposes a topic modeling method that scales linearly to billions of documents. We make three core contributions: i) we present a topic modeling method, tensor latent Dirichlet allocation, that has identifiable and recoverable parameter guarantees and sample complexity guarantees for large data; ii) we show that this method is computationally and memory efficient (achieving speeds over 3$\times $–4$\times $ those of prior parallelized latent Dirichlet allocation methods), and that it scales linearly to text datasets with over a billion documents; and iii) we provide an open-source, GPU-based implementation of this method. This scaling enables previously prohibitive analyses, and we perform two real-world, large-scale new studies of interest to political scientists: we provide the first thorough analysis of the evolution of the #MeToo movement through the lens of over two years of Twitter conversation and a detailed study of social media conversations about election fraud in the 2020 presidential election. Thus, this method provides social scientists with the ability to study very large corpora at scale and to answer important theoretically-relevant questions about salient issues in near real-time.
The traditional case register involved assembling records of people with a given condition in order to support cohort studies to describe and investigate the course of their condition and other outcomes. This old design has been resurrected and revolutionised following the widespread implementation of fully electronic healthcare records over the past few decades, providing ‘big data’ resources that are both large and very detailed. These, in turn, are being further enhanced through linkages with complementary administrative data (both health and non-health) and through natural language processing generating structured meta-data from source text fields. This chapter provides an overview of this rapidly developing research infrastructure, considering and advising on some of the challenges faced by researchers planning studies using clinical data and by those considering future resource development.
The last decade has seen an exponential increase in the development and adoption of language technologies, from personal assistants such as Siri and Alexa, through automatic translation, to chatbots like ChatGPT. Yet questions remain about what we stand to lose or gain when we rely on them in our everyday lives. As a non-native English speaker living in an English-speaking country, Vered Shwartz has experienced both amusing and frustrating moments using language technologies: from relying on inaccurate automatic translation, to failing to activate personal assistants with her foreign accent. English is the world's foremost go-to language for communication, and mastering it past the point of literal translation requires acquiring not only vocabulary and grammar rules, but also figurative language, cultural references, and nonverbal communication. Will language technologies aid us in the quest to master foreign languages and better understand one another, or will they make language learning obsolete?
Researchers across disciplines increasingly use Generative Artificial Intelligence (GenAI) to label text and images or as pseudo-respondents in surveys. But of which populations are GenAI models most representative? We use an image classification task—assessing crowd-sourced street view images of urban neighborhoods in an American city—to compare assessments generated by GenAI models with those from a nationally representative survey and a locally representative survey of city residents. While GenAI responses, on average, correlate strongly with the perceptions of a nationally representative survey sample, the models poorly approximate the perceptions of those actually living in the city. Examining perceptions of neighborhood safety, wealth, and disorder reveals a clear bias in GenAI toward national averages over local perspectives. GenAI is also better at recovering relative distributions of ratings, rather than mimicking absolute human assessments. Our results provide evidence that GenAI performs particularly poorly in reflecting the opinions of hard-to-reach populations. Tailoring prompts to encourage alignment with subgroup perceptions generally does not improve accuracy and can lead to greater divergence from actual subgroup views. These results underscore the limitations of using GenAI to study or inform decisions in local communities but also highlight its potential for approximating “average” responses to certain types of questions. Finally, our study emphasizes the importance of carefully considering the identity and representativeness of human raters or labelers—a principle that applies broadly, whether GenAI tools are used or not.
The emergence of large language models has significantly expanded the use of natural language processing (NLP), even as it has heightened exposure to adversarial threats. We present an overview of adversarial NLP with an emphasis on challenges, policy implications, emerging areas, and future directions. First, we review attack methods and evaluate the vulnerabilities of popular NLP models. Then, we review defense strategies that include adversarial training. We describe major policy implications, identify key trends, and suggest future directions, such as the use of Bayesian methods to improve the security and robustness of NLP systems.