Hostname: page-component-77f85d65b8-8v9h9 Total loading time: 0 Render date: 2026-03-27T07:55:24.681Z Has data issue: false hasContentIssue false

How to do human evaluation: A brief introduction to user studies in NLP

Published online by Cambridge University Press:  06 February 2023

Hendrik Schuff*
Affiliation:
Institut für Maschinelle Sprachverarbeitung, University of Stuttgart, Stuttgart, Germany Bosch Center for Artificial Intelligence, Renningen, Germany
Lindsey Vanderlyn
Affiliation:
Institut für Maschinelle Sprachverarbeitung, University of Stuttgart, Stuttgart, Germany
Heike Adel
Affiliation:
Bosch Center for Artificial Intelligence, Renningen, Germany
Ngoc Thang Vu
Affiliation:
Institut für Maschinelle Sprachverarbeitung, University of Stuttgart, Stuttgart, Germany
*
*Corresponding author. E-mail: Hendrik.Schuff@de.bosch.com
Rights & Permissions [Opens in a new window]

Abstract

Many research topics in natural language processing (NLP), such as explanation generation, dialog modeling, or machine translation, require evaluation that goes beyond standard metrics like accuracy or F1 score toward a more human-centered approach. Therefore, understanding how to design user studies becomes increasingly important. However, few comprehensive resources exist on planning, conducting, and evaluating user studies for NLP, making it hard to get started for researchers without prior experience in the field of human evaluation. In this paper, we summarize the most important aspects of user studies and their design and evaluation, providing direct links to NLP tasks and NLP-specific challenges where appropriate. We (i) outline general study design, ethical considerations, and factors to consider for crowdsourcing, (ii) discuss the particularities of user studies in NLP, and provide starting points to select questionnaires, experimental designs, and evaluation methods that are tailored to the specific NLP tasks. Additionally, we offer examples with accompanying statistical evaluation code, to bridge the gap between theoretical guidelines and practical applications.

Information

Type
Survey Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. Relying on automatic evaluation alone (e.g., via accuracy, F1 or BLEU scores) can be misleading as good performance with respect to scores does not imply good performance with respect to human evaluation.

Figure 1

Figure 2. Normalized frequencies of “human evaluation” and “Likert” (as in the Likert scale questionnaire type) in the ACL anthology from 2005 to 2020 showing the growing attention on human evaluation.

Figure 2

Figure 3

Figure 3. A subset of Likert items from the trust in automation scale by Körber (2018).

Figure 4

Figure 4. A flow chart to help find an appropriate test to analyze collected responses. Starting from the middle, the chart shows tests suited to analyze experiments with two levels of independent variables (e.g., system A and system B) on the left and tests suited to analyze experiments with more than two levels of independent variables (e.g., systems A, B, and C) on the right. A paired test needs to be used if; for example, a within-subject design is used and the level of measurement determines whether a parametric test can be used. For example, yes/no ratings are nominal/dichotomous by definition and cannot be analyzed using a t-test. $^*$The pairwise differences have to be on an ordinal scale, see Colquhoun (1971) for more details.

Figure 5