Hostname: page-component-89b8bd64d-z2ts4 Total loading time: 0 Render date: 2026-05-08T03:05:26.773Z Has data issue: false hasContentIssue false

The attribution crisis in LLM search results: Estimating ecosystem exploitation

Published online by Cambridge University Press:  28 April 2026

Ilan Strauss*
Affiliation:
Institute for Innovation and Public Purpose (The Bartlett School of Architecture), University College London, UK DSI/NRF South African Research Chair in Industrial Development (SARChI-ID), University of Johannesburg, South Africa AI Disclosures Project (Code for Science and Society), USA
Jangho Yang
Affiliation:
University of Waterloo, Canada
Tim O’Reilly
Affiliation:
AI Disclosures Project (Code for Science and Society), USA O’Reilly Media, Inc., USA
Sruly Rosenblat
Affiliation:
AI Disclosures Project (Code for Science and Society), USA
Isobel Moure
Affiliation:
AI Disclosures Project (Code for Science and Society), USA
*
Corresponding author: Ilan Strauss; Email: ilanstrauss@gmail.com

Abstract

Web-enabled large language models (LLMs) frequently answer queries without crediting the web pages they consume, creating an “attribution gap” in responsible artificial intelligence (AI) usage—defined as the difference between relevant URLs read and those actually cited. Drawing on approximately 14,000 real-world LMArena conversation logs with search-enabled LLM systems, we document three exploitation patterns: (1) no search: 34% of Google Gemini and 24% of OpenAI GPT-4o responses are generated without explicitly fetching any online content; (2) no citation: Gemini provides no clickable citation source in 92% of answers; (3) high-volume, low-credit: Perplexity’s Sonar visits approximately 10 relevant pages per query but cites only three to four. A negative binomial hurdle model shows that the average query answered by Gemini or Sonar leaves about three relevant websites uncited, whereas GPT-4o’s tiny uncited gap is best explained by its selective log disclosures rather than by better attribution. Citation efficiency—extra citations provided per additional relevant web page visited—varies widely across models, from 0.19 to 0.45 on identical queries, underscoring that retrieval design, not technical limits, shapes ecosystem impact. To advance auditing and monitoring of AI systems, we recommend a transparent LLM search architecture based on standardized telemetry and full disclosure of search traces and citation logs.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press
Figure 0

Table 1. Attribution statistics by model family

Figure 1

Table 2. Expected total attribution gap by model family

Figure 2

Figure 1. Expected attribution gaps, predicted (by model family). Note: Predicted values for number of citations missing relative to web pages consumed. Based on negative binomial hurdle model regression coefficients. Bars show the model’s expected citation gap (websites visited in the logs minus websites cited in the output), estimated at the median conversation length and median website visits, without interaction effects included. Showing 95% confidence intervals calculated with the emmeans package in $ \mathrm{\mathbb{R}} $.

Figure 3

Figure 2. Focal model: citation difference per extra URL visited. Note: Extra citations gained for each additional URL the focal model opens (differences between models). This holds match-up effects constant, isolating technology effects. Regression coefficient $ {\beta}_{1m} $ shown, predicting differences in citations for model pairs, for a given query. See equation before.

Supplementary material: File

Strauss et al. supplementary material

Strauss et al. supplementary material
Download Strauss et al. supplementary material(File)
File 105.6 KB
Submit a response

Comments

No Comments have been published for this article.