To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Large Language Models (LLMs) like OpenAI’s ChatGPT or Google’s Gemini are the new sensation in artificial intelligence (AI) research. These systems exhibit impressive conversational abilities and have even managed to convince some people of their possible sentience. But do LLMs actually speak our language, or do they merely appear as if they do? Do they really reason and think, or are they simply good at superficially imitating these abilities? In this chapter, I argue that Wilfrid Sellars’s functionalist-pragmatist approach to language and concept learning might be especially useful in the context of answering the questions above. In particular, conceiving the process of learning and mastering language as analogous to the process of learning to play a game within a set of normative social practices can shed light on the kind of abilities LLMs possess and what we can expect them to do in the future, including becoming genuine members of our linguistic community rather than mere “stochastic parrots.”
How likely it is that literacy, as we have known it, will be preserved in the years ahead? Or, perhaps the question has already shifted, from whether the written medium will fade to how soon that disappearance might occur. Generative artificial intelligence and related technology can support the transition toward new forms of literacy that evolve alongside emerging media. Large language models, in particular, may help preserve some of the cognitive and communicative advantages associated with “traditional” book-based language. In this way, technology could shape future media landscapes, keeping the perks of being a bookworm while softening some of the downsides of newer formats.
Web-enabled large language models (LLMs) frequently answer queries without crediting the web pages they consume, creating an “attribution gap” in responsible artificial intelligence (AI) usage—defined as the difference between relevant URLs read and those actually cited. Drawing on approximately 14,000 real-world LMArena conversation logs with search-enabled LLM systems, we document three exploitation patterns: (1) no search: 34% of Google Gemini and 24% of OpenAI GPT-4o responses are generated without explicitly fetching any online content; (2) no citation: Gemini provides no clickable citation source in 92% of answers; (3) high-volume, low-credit: Perplexity’s Sonar visits approximately 10 relevant pages per query but cites only three to four. A negative binomial hurdle model shows that the average query answered by Gemini or Sonar leaves about three relevant websites uncited, whereas GPT-4o’s tiny uncited gap is best explained by its selective log disclosures rather than by better attribution. Citation efficiency—extra citations provided per additional relevant web page visited—varies widely across models, from 0.19 to 0.45 on identical queries, underscoring that retrieval design, not technical limits, shapes ecosystem impact. To advance auditing and monitoring of AI systems, we recommend a transparent LLM search architecture based on standardized telemetry and full disclosure of search traces and citation logs.
This paper evaluates the performance of baseline and domain-augmented ChatGPT models for literature-based knowledge support in flood susceptibility mapping (FSM) using machine Learning approaches. To assess this, we designed five key questions related to FSM, with benchmark responses derived from our comprehensive review article (Pourzangbar et al., Journal of Flood Risk Management18, e70042), which analyzed 100 studies on ML applications in FSM. The same questions were posed (i) to standard ChatGPT-4 and ChatGPT-4o models without additional contextual material, and (ii) to a domain-augmented GPT-4 configuration (Chat-FSM) equipped with retrieval access to the 100 reviewed articles. The comparison highlights that GPT-based models can reasonably reproduce frequently reported machine learning models and conditioning factors from the reviewed literature, but show weaker consistency in feature selection methods, often suggesting less relevant techniques. Among the models, ChatGPT-4o demonstrated the weakest alignment with benchmark data, while Chat-FSM demonstrated the highest agreement with the benchmark dataset across most evaluated questions. In terms of application-level efficiency, GPT models required substantially less time and computational effort compared to manual literature synthesis under the defined experimental setup. While ChatGPT-based systems can support literature-informed exploration in FSM, human expertise remains essential for critical reasoning, methodological design, and application to novel or context-specific scenarios.
Traditional perceptual models are ill-equipped for the high-dimensional data, such as text embeddings, central to modern psychology and AI. We introduce the double machine learning lens model, a framework that utilizes machine learning to handle such data. We applied this model to analyze how a modern AI and human perceivers judge social class from 9,513 aspirational essays written by 11-year-olds in 1969. A systematic comparison of 45 analytical approaches revealed that regularized linear models using dimensionality-reduced language embeddings significantly outperformed traditional dictionary-based methods and more complex non-linear models. Our top model accurately predicted human $(R^{2}_{CV} =0.61)$ and AI $(R^{2}_{CV} =0.56)$ social class perceptions, capturing over 85% of the total accuracy. These results suggest that “unmodeled knowledge” in perception may be an artifact of insufficient measurement tools rather than an unmeasurable intuitive process. We find that both AI and humans use many of the same textual cues (e.g., grammar, occupations, and cultural activities), only a subset of which are valid. Both appear to amplify subtle, real-world patterns into powerful, yet potentially discriminatory heuristics, where a small difference in actual social class creates a large difference in perception.
Interest in large language models (LLMs) as a tool for meta-analyses and systematic reviews (MA/SRs). We prospectively developed 515 unique prompts by predefined screening-related categories and tested with open-access LLMs (Llama, Mistral) against four gold-standard MA/SRs from different medical fields published after the LLMs’ training cut-offs, using a Python-based pipeline. Heterogeneity between prompts was quantified, and hypothetical workload/cost reduction with top-performing prompts calculated. Across 12,360 pipeline runs, LLMs versus MA/SRs reached average recall/sensitivity = 83.6 ± 17.0%, precision = 18.5 ± 15.6%, specificity = 36.6 ± 23.7% F1-score = 27.6 ± 17.2%, and accuracy = 61.1 ± 11.0%. F1-scores were significantly higher when prompts focused on methods (0.78 ± 0.40%), explicitly mentioned MA/SR screening (0.81 ± 0.37%), included the comparison MA/SR’s title (5.64 ± 0.37%) or selection criteria (8.05 ± 0.68%), and with more LLM parameters (70b = 4.48 ± 0.31%, 123b = 7.77 ± 0.31%), but lower when screening abstracts instead of titles (−3.67 ± 0.28%). In LLM-base preselection, top-performing F1-score prompts (recall/sensitivity = 72.2%, specificity = 66.1%, precision = 28.6%) would reduce screening demands by 34.5%−37.5%, saving 8.4–8.8 weeks of work and 17,592–18,552. Recall/sensitivity increased with less MA/SR information contrasting F1-score results, which highlights a recall/sensitivity-precision/specificity trade-off. F1-score increased with detailed MA/SR information, while recall/sensitivity increased with shorter, zeroshot prompts. We provide the first prospectively assessed prompt engineering framework for early-stage LLM-based paper screening across medical fields. The publicly available Python pipeline and full prompt list used here support further development of LLM-based evidence synthesis.
This concluding chapter argues that language is a first-order driver of economic behaviour and outlines where the research should go next. It extends the LENS framework beyond one-shot decisions to strategic settings shaped by beliefs, and outlines the co-evolution between language and behaviour. Large language models are proposed as virtual laboratories, while a quantitative utility approach must accommodate multidimensional, non-linear emotions and norms, and expand to visual cues (VENS). The chapter highlights applications – from policy design to norm-sensitive AI – alongside serious risks of manipulation, surveillance, and bias. It closes with a call for transparent, ethically governed models that explain and responsibly influence decisions.
This chapter argues that language matters for economic decisions and that modern large language models (LLMs) can quantify this effect. After outlining the limits of lexicon-based tools, it examines BERT and MoralBERT, showing that generic sentiment scores struggle to predict human behaviour, while adding moral dimensions helps but the results remain imperfect. LLM-based chatbots (e.g., GPT-4) enable context-sensitive sentiment estimates that predict framing effects, particularly in Dictator Games. Building on this, the chapter formalises language-based utility functions that combine payoffs with sentiment or moral polarity and derives testable predictions. Evidence across Dictator, Equity–Efficiency, and Bribery games supports the approach, while highlighting caveats and aveThe chapter highlights applicationsnues for refinement.
This article examines the transformative impact of large language models (LLMs) on online content moderation, revealing a critical gap between platforms’ rule-based policies and their AI-driven enforcement mechanisms. Using Facebook’s hate speech moderation policies and practices as a case study, we identify a paradox: while content policies are increasingly rule-oriented, AI-driven enforcement seems to operate in a standard-like manner. This disconnect creates transparency, consistency and accountability challenges relating to the delineation of online freedom of expression that are not addressed in the literature, and require attention and mitigation. In this specific context, we introduce the concept of ‘rules by the millions’ to describe how AI systems actually operate through generating vast networks of micro-rules that evade traditional regulatory oversight. This phenomenon disrupts the conventional rules-versus-standards framework used in legal theory, raising urgent questions about the adequacy of current AI governance mechanisms. Indeed, the rapid adoption of LLMs in content moderation has outpaced the human capacity to monitor them, creating a pressing need for adaptive frameworks capable of managing the evolving capacities of AI.
This study investigates the use of large language models (LLMs) to classify question utterances within verbal design protocols according to Eris’ (2004) taxonomy. We evaluate the performance of two proprietary LLMs – OpenAI’s GPT-4.1 and Anthropic’s Claude Sonnet 4.5 – across experiments designed to assess classification accuracy, sensitivity to prompt configuration and in-context learning (ICL), and generalization across datasets and models. Using two human-coded datasets of differing size and quality, we measure alignment between LLM-generated labels and human judgments at both question category and subcategory levels. Results show that both LLMs achieved moderate to strong alignment rates at the category level (up to 85.7% for GPT-4.1 and 82.9% for Claude Sonnet 4.5), with substantially lower alignment at the more granular subcategory level. Performance differences across prompt configurations and ICL conditions were small, indicating robust generalization across datasets and transferability of prompt designs. While these results suggest that LLMs can effectively support scalable question classification, human judgment and oversight remain essential. Future research should explore the development and evaluation of alternative hybrid human–LLM workflows in protocol analysis, as well as the use of smaller or open-source models to address data privacy concerns.
The development of artificial intelligence and machine learning is leading to a revolution in the way we think about economic decisions. The Economics of Language explores how the use of generative AI and large language models (LLMs) can transform the way we think about economic behaviour. It introduces the LENS framework (Linguistic content triggers Emotions and suggests Norms, which shape Strategy choice) and presents empirical evidence that LLMs can predict human behaviour in economic games more accurately than traditional outcome-based models. It draws on years of research to provide a step-by-step development of the theory, combining accessible examples with formal modelling. Offering a roadmap for future research at the intersection of economics, psychology, and AI, this book equips readers with tools to quantify the role of language in decision-making and redefines how we think about utility, rationality, and human choice.
Large Language Models (LLMs) have the potential to profoundly transform and enrich experimental economic research. We propose a new software framework, “alter_ego”, which makes it easy to design experiments between LLMs and to integrate LLMs into oTree-based experiments with human subjects. Our toolkit is freely available at github.com/mrpg/ego. To illustrate, we run differently framed prisoner’s dilemmas with interacting machines as well as with human-machine interaction. Framing effects in machine-only treatments are strong and similar to those expected from previous human-only experiments, yet less pronounced and qualitatively different if machines interact with human participants.
Chapter 5 addresses yet another aspect of word meanings. Back in the mid-twentieth century, the linguist J. R. Firth (1957, p. 11) stated that “you shall know a word by the company it keeps.” More recently, this idea has been supported by distributional semantic models (DSMs), which come from computational linguistics and demonstrate that a word’s meaning can in fact be derived partly from its statistical co-occurrence patterns with other words. For instance, part of the meaning of scissors can be derived from its tendency to be used together with certain other words like sharp, pointy, cut, snip, paper, hair, etc. DSMs are surprisingly good at predicting people’s performance on many (although not all) conceptual tasks, and they are now so sophisticated that they constitute the engines of many chatbots and AI systems. What’s more, by combining DSMs with brain mapping methods, a rapidly growing line of research has been accumulating evidence that the distributionally based properties of word meanings are not only captured by purely verbal representations in the core language network, but enable a “quick and dirty” shortcut to comprehension.
Chapter 1 introduces basic terminology. Terms such as artificial intelligence, data, algorithm, machine learning, neural networks, deep learning, large language models, generative AI and symbolic AI are presented to develop a sense of what AI is, how it has evolved, and what it does. This chapter also introduces some of the major conceptual disagreements in the field. Different ideas about how to develop AI in the best way drive disagreements, as well as philosophical differences over what intelligence means and whether machines can develop human-like intelligence.
Emphasizing how and why machine learning algorithms work, this introductory textbook bridges the gap between the theoretical foundations of machine learning and its practical algorithmic and code-level implementation. Over 85 thorough worked examples, in both Matlab and Python, demonstrate how algorithms are implemented and applied whilst illustrating the end result. Over 75 end-of-chapter problems empower students to develop their own code to implement these algorithms, equipping them with hands-on experience. Matlab coding examples demonstrate how a mathematical idea is converted from equations to code, and provide a jumping off point for students, supported by in-depth coverage of essential mathematics including multivariable calculus, linear algebra, probability and statistics, numerical methods, and optimization. Accompanied online by instructor lecture slides, downloadable Python code and additional appendices, this is an excellent introduction to machine learning for senior undergraduate and graduate students in Engineering and Computer Science.
Creative thinking is a crucial step in the design ideation process, where analogical reasoning plays a vital role in expanding the design concept space. The emergence of Generative AI has brought a significant revolution in co-creative systems, with a growing number of studies on Design-by-Analogy support tools. However, there is a lack of studies investigating the creative performance of Large Language Model (LLM)-generated analogical content and benchmarking of language models in creative tasks such as design ideation. Through this study, we aim to (i) investigate the effect of creativity heuristics by leveraging LLMs to generate analogical stimuli for novice designers in ideation tasks and (ii) evaluate and benchmark language models across analogical creative tasks. We developed a support tool based on the proposed conceptual framework and validated it by conducting controlled ideation experiments with 24 undergraduate design students. Groups assisted with the support tool generated higher-rated ideas, thus validating the proposed framework and the effectiveness of analogical reasoning for augmenting creative output with LLMs. Benchmarking of the models revealed significant differences in the creative performance of analogies across various language models, suggesting that future studies should focus on evaluating language models across creative, subjective tasks.
Efforts to curb online hate speech depend on our ability to reliably detect it at scale. Previous studies have highlighted the strong zero-shot classification performance of large language models (LLMs), offering a potential tool to efficiently identify harmful content. Yet for complex and ambivalent tasks like hate speech detection, pre-trained LLMs can be insufficient and carry systemic biases. Domain-specific models fine-tuned for the given task and empirical context could help address these issues, but, as we demonstrate, the quality of data used for fine-tuning decisively matters. In this study, we fine-tuned GPT-4o-mini using a unique corpus of online comments annotated by diverse groups of coders with varying annotation quality: research assistants, activists, two kinds of crowd workers, and citizen scientists. We find that only annotations from those groups of annotators that are better than zero-shot GPT-4o-mini in recognizing hate speech improve the classification performance of the fine-tuned LLM. Specifically, fine-tuning using the highest-quality annotator group – trained research assistants – boosts classification performance by increasing the model’s precision without notably sacrificing the good recall of zero-shot GPT-4o-mini. In contrast, lower-quality annotations do not improve and may even decrease the ability to identify hate speech. By examining tasks reliant on human judgment and context, we offer insights that go beyond hate speech detection.
As people increasingly interact with large language models (LLMs), a critical question emerges: do humans process language differently when communicating with an LLM versus another human? While there is good evidence that people adapt comprehension based on their expectations toward their interlocutor in human–human interaction, human–computer interaction research suggests the adaptation to machines is often suspended until expectation violation occurs. We conducted two event-related potential experiments examining Chinese sentence comprehension, measuring neural responses to semantic and syntactic anomalies attributed to an LLM or a human. Experiment 1 revealed reduced N400 but larger P600 responses to semantic anomalies in LLM-attributed text than human-attributed one, suggesting participants anticipated semantic errors yet required increased composition/integration efforts. Experiment 2 showed enhanced P600 responses to LLM-attributed than human-attributed syntactic anomalies, reflecting greater reanalysis or integration difficulty in the former than in the latter. Notably, neural responses to LLM-attributed semantic anomalies (but not syntactic anomalies) were further modulated by participants’ belief about humanlike knowledge in LLMs, with a larger N400 and a smaller P600 in participants with stronger belief of humanlike knowledge in LLMs. These findings provide the first neurocognitive evidence that people develop mental models of LLM capabilities and adapt neural processing accordingly, offering theoretical insights aligned with multidisciplinary frameworks and practical implications for designing effective human–AI communication systems.
The use of Artificial Intelligence (AI) in Health Technology Assessment (HTA) activities presents an opportunity to enhance the efficiency, accuracy, and speed of HTA processes worldwide. However, the adoption of AI tools in HTA comes with diverse challenges and concerns that must be carefully managed to ensure their responsible, ethical, and effective deployment. The 2025 Health Technology Assessment international Global Policy Forum (GPF) informed GPF members of the integration of AI into HTA activities, with a particular focus on the use of Generative AI (GenAI). With the overarching goal of illuminating and inspiring tangible outputs and actionable recommendations, the event brought together a diverse range of interest holders to explore the opportunities and challenges of AI in HTA. This article summarizes the key discussions and themes that informed the GPF outcomes, including trust, human agency, and risk-based approaches, culminating in a proposed set of priority next steps for the HTA community regarding the integration of GenAI. It also highlights insights into the current state of digital transformation within HTA organizations and the life sciences industry, providing insights into where the field stands and where it is heading.
Artificial Intelligence is an area of law where legal frameworks are still in early stages. The chapter discusses some of the core HCI-related concerns with AI, including deepfakes, bias and discrimination, and concepts within AI and intellectual property including AI infringement and AI protection