Hostname: page-component-89b8bd64d-j4x9h Total loading time: 0 Render date: 2026-05-07T23:01:04.884Z Has data issue: false hasContentIssue false

Evaluating science: A comparison of human and AI reviewers

Published online by Cambridge University Press:  21 November 2024

Anna Shcherbiak
Affiliation:
Institute for Cognition and Behavior, WU Vienna University of Economics and Business, Vienna, Austria
Hooman Habibnia*
Affiliation:
Institute for Cognition and Behavior, WU Vienna University of Economics and Business, Vienna, Austria
Robert Böhm
Affiliation:
Faculty of Psychology, University of Vienna, Vienna, Austria Department of Psychology and Copenhagen Center for Social Data Science (SODAS), University of Copenhagen, Copenhagen, Denmark
Susann Fiedler
Affiliation:
Institute for Cognition and Behavior, WU Vienna University of Economics and Business, Vienna, Austria
*
Corresponding author: Hooman Habibnia; Email: hooman.habibnia@wu.ac.at
Rights & Permissions [Opens in a new window]

Abstract

Scientists have started to explore whether novel artificial intelligence (AI) tools based on large language models, such as GPT-4, could support the scientific peer review process. We sought to understand (i) whether AI versus human reviewers are able to distinguish between made-up AI-generated and human-written conference abstracts reporting on actual research, and (ii) how the quality assessments by AI versus human reviewers of the reported research correspond to each other. We conducted a large-scale field experiment during a medium-sized scientific conference, relying on 305 human-written and 20 AI-written abstracts that were reviewed either by AI or 217 human reviewers. The results show that human reviewers and GPTZero were better in discerning (AI vs. human) authorship than GPT-4. Regarding quality assessments, there was rather low agreement between both human–human and human–AI reviewer pairs, but AI reviewers were more aligned with human reviewers in classifying the very best abstracts. This indicates that AI could become a prescreening tool for scientific abstracts. The results are discussed with regard to the future development and use of AI tools during the scientific peer review process.

Information

Type
Empirical Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of Society for Judgment and Decision Making and European Association for Decision Making
Figure 0

Figure 1 Research likelihood to be generated by AI (panel A) and research quality (panel B) of abstracts written by human or AI authors as evaluated by human or AI reviewers. Barplots represent mean scores in each group, and error bars represent 95% confidence intervals. Dots represent individual observations, and the size of the dots illustrates the number of observations.

Figure 1

Table 1 Regression models predicting the perceived likelihood of an abstract to be AI-generated (Models 1 and 2) and the perceived quality of an abstract (Models 3 and 4) using human and ChatGPT ratings

Figure 2

Table 2 Regression models predicting the perceived likelihood of an abstract to be AI-generated using human and GPTZero ratings

Figure 3

Figure 2 Bland–Altman plots of agreement in research quality ratings between AI and the mean rating of human and AI reviewers (A), between AI and the mean rating of several human reviewers (B), and between 2 randomly paired human reviewers (C).