To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Young people are experiencing worsening mental health and a growing reliance on online tools and services to address mental health difficulties. At the same time, next-generation large language models (LLMs) that are deployed through ‘chatbot style interfaces’, using deep learning artificial intelligence akin to interacting with a human appear to mark an opportunity for mental health therapeutics when designed specifically for clinical intervention. However, emergent evidence suggests the use of more generic LLM chatbots may pose a risk of providing misinformation, bias, or over reliance for some individuals when used outside of clinical contexts for mental health. This perspective paper examines the intersection of youth mental health and the rapid adoption of LLM chatbots. It first contextualises rising mental health challenges among young people alongside their increasing reliance on digital solutions. The paper then explores the potential benefits of LLM chatbot style interfaces in clinical mental health interventions. Following this, we discuss the evidence surrounding adverse mental health outcomes from the use of generic LLMs to support mental health at population level, describing complex system-level and human-level factors noted from the evidence. Finally, we outline considerations for public health and youth mental health discourse, purpose built LLM platform design, and a supporting research agenda. While current evidence on benefits and risks from generic LLMs is emergent and not youth-specific, this perspective highlights a need for research focused on young people to ensure safe and effective use of widely available LLMs for mental health support.
Systematic reviews (SRs) are critical for evidence-based research but are time-consuming and labor-intensive. The rapid expansion of academic publications further challenges the performance and applicability of existing screening and classification methods. While large language models (LLMs) present new opportunities for automation, limited research has examined whether they can achieve classification performance comparable to human reviewers in large-scale, multi-class settings. With the goal of improving classification performance, we proposed an LLM-based framework that leverages full-text key-insight extraction to enhance literature classification. We constructed a manually curated dataset of 900 articles from 17 published SRs to quantitatively evaluate the classification capabilities of LLMs. The results provided empirical evidence of LLMs’ potential in supporting large-scale SRs and introduced a practical pathway for improving efficiency and reliability in evidence synthesis. Empirical results showed that key-insight-based classification (KBC) significantly outperforms abstract-based classification (ABC). We implemented a confidence-weighted voting (CWV) mechanism using multiple LLMs to improve robustness. The CWV method achieved the highest macro F1-score of 0.796, substantially exceeding KBC (0.732), ABC (0.676), and unsupervised K-means clustering (0.446). By employing zero-shot LLMs, our approach demonstrated the potential for enhanced adaptability across diverse domains and classification tasks without requiring fine-tuning, demonstrating that a carefully designed pipeline can enable LLMs to achieve classification performance comparable to human reviewers.
In the digital information age, artificial intelligence is increasingly being applied to national governance and judicial decision-making assistance. Existing studies lack case studies and empirical analyses of the effectiveness of large models in aiding judicial decisions. To address this research gap, this study designs a comprehensive evaluation framework encompassing five core task dimensions: Task-oriented Information Extraction, Legal Article Citation, Event Extraction, Judicial Decision Generation, and Legal Opinion Generation. By using carefully crafted prompts to activate the legal reasoning capabilities of the models, we conducted extensive testing on 13 mainstream large language models (LLMs). The experimental results demonstrate that large models perform excellently in processing legal texts and providing preliminary legal opinions, but still exhibit shortcomings in complex legal reasoning and precise decision-making. On this basis, we applied a weakly supervised learning strategy to fine-tune the LLMs for targeted improvements. The results indicate that introducing a small amount of task-specific learning can significantly enhance the performance of LLMs in judicial tasks. This further underscores the critical role of data and the acquisition of domain-specific knowledge in applying AI technology to judicial tasks. Additionally, this study briefly discusses the issue of the boundaries of AI’s involvement in judicial activities, aiming to provide theoretical foundations and practical guidance for the deep integration of AI technology with legal practice.
Understanding how political information is transmitted requires tools that can reliably and scalably capture complex signals in text. While existing studies highlight interest groups as strategic information providers, empirical analysis has been constrained by reliance on expert annotation. Using policy documents released by interest groups, this study shows that fine-tuned large language models (LLMs) outperform lightly trained workers, crowdworkers, and zero-shot LLMs in distinguishing two difficult-to-separate categories: informative signals that help improve political decision-making and associative signals that shape preferences but lack substantive relevance. We further demonstrate that the classifier generalizes out of distribution across two applications. Although the empirical setting is domain-specific, the approach offers a scalable method for expert-driven text coding applicable to other areas of political inquiry.
To assess the feasibility of using large language models (LLM) to develop research questions about changes to the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) food packages.
Design:
We conducted a controlled experiment using ChatGPT-4 and its plugin, MixerBox Scholarly, to generate research questions based on a section of the U.S. Department of Agriculture (USDA) summary of the final public comments on the WIC revision. Five questions weekly for 3 weeks were generated using LLM under two conditions: fed with or without relevant literature. The experiment generated ninety questions, which were evaluated using the Feasibility, Innovation, Novelty, Ethics and Relevance criteria. t tests and multivariate regression examined the difference by feeding status, artificial intelligence model, evaluator and criterion.
Setting:
The United States.
Participants:
Six WIC expert evaluators from academia, government, industry and non-profit sectors.
Results:
Five themes were identified: administrative barriers, nutrition outcomes, participant preferences, economics and other topics. Feeding and non-feeding groups had no significant differences (Coeff. = 0·03, P = 0·52). MixerBox-generated questions received significantly lower scores than ChatGPT (Coeff. = –0·11, P = 0·02). Ethics scores were significantly higher than feasibility scores (Coeff. = 0·65, P < 0·001). Significant differences were found between the evaluators (P < 0·001).
Conclusions:
The LLM applications can assist in developing research questions with acceptable qualities related to the WIC food package revisions. Future research is needed to compare the development of research questions between LLM and human researchers.
Housing affordability is one of the main aspects required for sustainable development and society. However, the timely delivery of new homes is often constrained by the need to upgrade and expand essential infrastructure such as water and electricity networks. For water utilities, responses to growth typically involve intensive hydraulic analysis to assess water distribution systems (WDS) capacity, identify upgrade needs and evaluate options for system extensions. This process becomes significantly complex and resource-intensive under high growth conditions, where a higher volume of faster answers is required to address a wide range of uncertain future scenarios. This paper presents a concept of using generative artificial intelligence (Gen AI) integrating with hydraulic models to form an AI Agent to support WDS design. Specific features of Gen AI used within the hydraulic agent are discussed. A real-life case study demonstrated that the AI agent can analyse land development requests, trigger hydraulic simulations and identify augmentation needed, significantly reducing manual tasks. This offers a breakthrough strategy for water distribution system design and planning to enable sustainable water infrastructure development.
Informed consent is a cornerstone of ethical research, but the lack of widely accepted standards for the key information (KI) section in informed consent documents (ICDs) creates challenges in institutional review board (IRB) reviews and participant comprehension. This study explored the use of GPT-4o, a large language model (collectively, AI), to generate standardized KI sections.
Methods:
An AI tool was developed to interpret and generate KI content from ICDs. The evaluation involved a multi-phased process where IRB subject matter experts, principal investigators (PIs), and IRB reviewers assessed the AI output for accuracy, differentiation between standard care and research, appropriate information prioritization, and structural coherence.
Results:
Iterative refinements improved the AI’s accuracy and clarity, with initial assessments highlighting factual errors that decreased over time. Many PIs found the AI-generated sections comparable to their own and expressed a high likelihood of using the tool for future drafts. Blinded evaluations by IRB reviewers highlighted the AI tool’s strengths in describing study benefits and maintaining readability. However, the findings underscore the need for further improvements, particularly in ensuring accurate risk descriptions, to enhance regulatory compliance and IRB reviewer confidence.
Conclusions:
The AI tool shows promise in enhancing the consistency and efficiency of KI section drafting in ICDs. However, it requires ongoing refinement and human oversight to fully comply with regulatory and institutional standards. Collaboration between AI and human experts is essential to maximize benefits while maintaining high ethical and accuracy standards in informed consent processes.
Artificial intelligence (AI) has achieved human-level performance in specialised tasks such as Go, image recognition and protein folding, raising the prospect of an AI singularity – where machines not only match, but surpass human reasoning. Here, we demonstrate a step towards this vision in the context of turbulence modelling. By treating a large language model (LLM), DeepSeek-R1, as an equal partner, we establish a closed-loop, iterative workflow in which the LLM proposes, refines and reasons about near-wall turbulence models under adverse pressure gradients (APGs), system rotation and surface roughness. Through multiple rounds of interaction involving long-chain reasoning and a priori and a posteriori evaluations, the LLM generates models that not only rediscover established strategies, but also synthesise new ones that outperform baseline wall models. Specifically, it recommends incorporating a material derivative to capture history effects in APG flows, modifying the law of the wall to account for system rotation and developing rough-wall models informed by surface statistics. In contrast to conventional data-driven turbulence modelling – often characterised by human-designed, black-box architectures – the models developed here are physically interpretable and grounded in clear reasoning.
Meta-research and evidence synthesis require considerable resources. Large language models (LLMs) have emerged as promising tools to assist in these processes, yet their performance varies across models, limiting their reliability. Taking advantage of the large availability of small size (<10 billion parameters) open-source LLMs, we implemented an agreement-based framework in which a decision is taken only if at least a given number of LLMs produce the same response. The decision is otherwise withheld. This approach was tested on 1020 abstracts of randomized controlled trials in rheumatology, using 2 classic literature review tasks: (1) classifying each intervention as drug or nondrug based on text interpretation and (2) extracting the total number of randomized patients, a task that sometimes required calculations. Re-examining abstracts where at least 4 LLMs disagreed with the human gold standard (dual review with adjudication) allowed constructing an improved gold standard. Compared to a human gold standard and single large LLMs (>70 billion parameters), our framework demonstrated robust performance: several model combinations achieved accuracies above 95% exceeding the human gold standard on at least 85% of abstracts (e.g., 3 of 5 models, 4 of 6 models, or 5 of 7 models). Performance variability across individual models was not an issue, as low-performing models contributed fewer accepted decisions. This agreement-based framework offers a scalable solution that can replace human reviewers for most abstracts, reserving human expertise for more complex cases. Such frameworks could significantly reduce the manual burden in systematic reviews while maintaining high accuracy and reproducibility.
The emergence of large language models (LLMs) provides an opportunity for AI to operate as a co-ideation partner during the creative processes. However, designers currently lack a comprehensive methodology for engaging in co-ideation with LLMs, and there is a limited framework that describes the process of co-ideation between a designer and ChatGPT. This research thus aimed to explore how LLMs can act as codesigners and influence creative ideation processes of industrial designers and whether the ideation performance of a designer could be improved by employing the proposed framework for co-ideation with custom GPT. A survey was first conducted to detect how LLMs influenced the creative ideation processes of industrial designers and to understand the problems that designers face when using ChatGPT to ideate. Then, a framework which based on mapping content to guide the co-ideation between humans and custom GPT (named as Co-Ideator) was promoted. Finally, a design case study followed by a survey and an interview was conducted to evaluate the ideation performance of the custom GPT and framework compared with traditional ideation methods. Also, the effect of custom GPT on co-ideation was compared with a non-artificial intelligence (AI)-used condition. The findings indicated that if users employed co-ideation with custom GPT, the novelty and quality of ideation outperformed by using traditional ideation.
Word processing during reading is known to be influenced by lexical features, especially word length, frequency, and predictability. This study examined the relative importance of these features in word processing during second language (L2) English reading. We used data from an eye-tracking corpus and applied a machine-learning approach to model word-level eye-tracking measures and identify key predictors. Predictors comprised several lexical features, including length, frequency, and predictability (e.g., surprisal). Additionally, sentence, passage, and reader characteristics were considered for comparison. The analysis found that word length was the most important variable across several eye-tracking measures. However, for certain measures, word frequency and predictability were more important than length, and in some cases, reader characteristics such as proficiency were more significant than lexical features. These findings highlight the complexity of word processing during reading, the shared processes between first language (L1) and L2 reading, and their potential to refine models of eye-movement control.
Large Language Models (LLMs) have advanced the extraction and generation of engineering design (ED) knowledge from textual data. However, assessing their accuracy in ED tasks remains challenging due to the lack of benchmark datasets specifically designed for ED applications. To address this, the study examines how theoretical concepts from Axiomatic Design Theory—such as Functional Requirements, Design Parameters, and their relationship—are expressed in natural language and develops a systematic approach for annotating ED concepts in text. It introduces a novel dataset of 6,000 patent sentences, annotated by domain experts. Annotation performance is assessed using inter-annotator agreement metrics, providing insights into the challenges of identifying ED concepts in text. The findings aim to support designers in better integrating design theories within LLMs for extracting ED knowledge.
Recent advancements in machine learning (ML) offer substantial potential for enhancing product development. However, adoption in companies remains limited due to challenges in framing domain-specific problems as ML tasks and selecting suitable ML algorithms, requiring expertise often lacking. This study investigates the use of large language models (LLMs) as recommender systems for facilitating ML implementation. Using a dataset derived from peer-reviewed publications, the LLMs were evaluated for their ability to recommend ML algorithms for product development-related problems. The results indicate moderate success, with GPT-4o achieving the highest accuracy by recommending suitable ML algorithms in 61% of cases. Key limitations include inaccurate recommendations and challenges in identifying multiple sub-problems. Future research will explore prompt engineering to improve performance.
A design catalog is a repository of design problems and their solutions, enabling designers to explore and discover applicable solutions for their specific design challenges. Creating such catalogs has depended on human knowledge and implicit judgment, with no systematic approach established. This study aims to develop a systematic method to create a design catalog from patent documents. We utilize a large language model (LLM) to extract problem-solution pairs described in the documents, presenting them as general purpose-means pairs. Subsequently, we create a design catalog by classifying the problems using similarity-based clustering, enhanced by the LLM’s semantic text similarity capabilities. We demonstrate a case study of creating a design catalog for martial arts devices and generating new design concepts based on the catalog to verify the effectiveness of the proposed method.
The chapter examines the legal regulation and governance of ‘generative AI,’ ‘foundation AI,’ ‘large language models’ (LLMs), and the ‘general-purpose’ AI models of the AI Act. Attention is drawn to two potential sorcerer’s apprentices, namely, in the spirit of J. W. Goethe’s poem, people who were unable to control a situation they created. Focus is on developers and producers of such technologies, such as LLMs that bring about risks of discrimination and information hazards, malicious uses and environmental harms; furthermore, the analysis dwells on the normative attempt of EU legislators to govern misuses and overuses of LLMs with the AI Act. Scholars, private companies, and organisations have stressed limits of such normative attempts. In addition to issues of competitiveness and legal certainty, bureaucratic burdens and standard development, the threat is the over-frequent revision of the law to tackle advancements of technology. The chapter illustrates this threat since the inception of the AI Act and recommends some ways in which the law has not to be continuously amended to address the challenges of technological innovation.
The utilization of creative design methodologies plays a pivotal role in nurturing innovation within the contemporary competitive market landscape. Although Theory of Inventive Problem Solving (TRIZ) has been recognized as a potent methodology for engendering innovative concepts, its intricate nature and time-consuming learning and application processes pose significant challenges. Furthermore, TRIZ has faced criticism for its limitations in processing design problems and facilitating designers in knowledge acquisition. Conversely, Environment-Based Design (EBD), a question-driven design methodology, provides robust methods and approaches for formulating design problems and identifying design conflicts. Large Language Models (LLMs) have also demonstrated the ability to streamline the design process and enhance design productivity. This study aims to propose an iteration of TRIZ integrated by EBD and supported by an LLM. This LLM-based conceptual design model assists designers through the conceptual design process. It begins by using question-asking and answering methods from EBD to gather relevant information. It then follows the EBD methodology to formulate the information into an interaction-dependence network, leading to the identification of functions and conflicts required by TRIZ. Lastly, TRIZ is used to generate inventive solutions. An evaluation is carried out to measure the effectiveness of the integrated approach. The results indicate that this approach successfully generates questions, processes designers’ responses, produces functional analysis elements, and generates ideas to resolve contradictions.
This article presents a novel conversational artificial intelligence (CAI)-enabled active ideation system as a creative idea generation tool to assist novice product designers in mitigating the initial latency and ideation bottlenecks that are commonly observed. It is a dynamic, interactive, and contextually responsive approach, actively involving a large language model (LLM) from the domain of natural language processing (NLP) in artificial intelligence (AI) to produce multiple statements of potential ideas for different design problems. Integrating such AI models with ideation creates what we refer to as an active ideation scenario, which helps foster continuous dialog-based interaction, context-sensitive conversation, and prolific idea generation. An empirical study was conducted with 30 novice product designers to generate multiple ideas for given problems using traditional methods and the new CAI-based interface. The ideas generated by both methods were qualitatively evaluated by a panel of experts. The findings demonstrated the relative superiority of the proposed tool for generating prolific, meaningful, novel, and diverse ideas. The interface was enhanced by incorporating a prompt-engineered structured dialog style for each ideation stage to make it uniform and more convenient for the product designers. A pilot study was conducted and the resulting responses of such a structured CAI interface were found to be more succinct and aligned toward the subsequent design stage. The article thus established the rich potential of using generative AI (Gen-AI) for the early ill-structured phase of the creative product design process.
Recent studies utilizing AI-driven speech-based Alzheimer’s disease (AD) detection have achieved remarkable success in detecting AD dementia through the analysis of audio and text data. However, detecting AD at an early stage of mild cognitive impairment (MCI), remains a challenging task, due to the lack of sufficient training data and imbalanced diagnostic labels. Motivated by recent advanced developments in Generative AI (GAI) and Large Language Models (LLMs), we propose an LLM-based data generation framework, leveraging prior knowledge encoded in LLMs to generate new data samples. Our novel LLM generation framework introduces two novel data generation strategies, namely, the cross-lingual and the counterfactual data generation, facilitating out-of-distribution learning over new data samples to reduce biases in MCI label prediction due to the systematic underrepresentation of MCI subjects in the AD speech dataset. The results have demonstrated that our proposed framework significantly improves MCI Detection Sensitivity and F1-score on average by a maximum of 38% and 31%, respectively. Furthermore, key speech markers in predicting MCI before and after LLM-based data generation have been identified to enhance our understanding of how the novel data generation approach contributes to the reduction of MCI label prediction biases, shedding new light on speech-based MCI detection under low data resource constraint. Our proposed methodology offers a generalized data generation framework for improving downstream prediction tasks in cases where limited and/or imbalanced data have presented significant challenges to AI-driven health decision-making. Future study can focus on incorporating more datasets and exploiting more acoustic features for speech-based MCI detection.
Biomedical entity normalization is critical to biomedical research because the richness of free-text clinical data, such as progress notes, can often be fully leveraged only after translating words and phrases into structured and coded representations suitable for analysis. Large Language Models (LLMs), in turn, have shown great potential and high performance in a variety of natural language processing (NLP) tasks, but their application for normalization remains understudied.
Methods
We applied both proprietary and open-source LLMs in combination with several rule-based normalization systems commonly used in biomedical research. We used a two-step LLM integration approach, (1) using an LLM to generate alternative phrasings of a source utterance, and (2) to prune candidate UMLS concepts, using a variety of prompting methods. We measure results by $F_{\beta }$, where we favor recall over precision, and F1.
Results
We evaluated a total of 5,523 concept terms and text contexts from a publicly available dataset of human-annotated biomedical abstracts. Incorporating GPT-3.5-turbo increased overall $F_{\beta }$ and F1 in normalization systems +16.5 and +16.2 (OpenAI embeddings), +9.5 and +7.3 (MetaMapLite), +13.9 and +10.9 (QuickUMLS), and +10.5 and +10.3 (BM25), while the open-source Vicuna model achieved +20.2 and +21.7 (OpenAI embeddings), +10.8 and +12.2 (MetaMapLite), +14.7 and +15 (QuickUMLS), and +15.6 and +18.7 (BM25).
Conclusions
Existing general-purpose LLMs, both propriety and open-source, can be leveraged to greatly improve normalization performance using existing tools, with no fine-tuning.