Hostname: page-component-77f85d65b8-t6st2 Total loading time: 0 Render date: 2026-03-26T19:22:58.937Z Has data issue: false hasContentIssue false

An experimental study measuring human annotator categorization agreement on commonsense sentences

Subject: Computer Science

Published online by Cambridge University Press:  18 June 2021

Henrique Santos*
Affiliation:
Tetherless World Constellation, Rensselaer Polytechnic Institute, NY, 12180, United States
Mayank Kejriwal
Affiliation:
Information Sciences Institute, University of Southern California, CA, 90292, United States
Alice M. Mulvehill
Affiliation:
Tetherless World Constellation, Rensselaer Polytechnic Institute, NY, 12180, United States
Gretchen Forbush
Affiliation:
Tetherless World Constellation, Rensselaer Polytechnic Institute, NY, 12180, United States
Deborah L. McGuinness
Affiliation:
Tetherless World Constellation, Rensselaer Polytechnic Institute, NY, 12180, United States
*
Corresponding author. E-mail: oliveh@rpi.edu

Abstract

Developing agents capable of commonsense reasoning is an important goal in Artificial Intelligence (AI) research. Because commonsense is broadly defined, a computational theory that can formally categorize the various kinds of commonsense knowledge is critical for enabling fundamental research in this area. In a recent book, Gordon and Hobbs described such a categorization, argued to be reasonably complete. However, the theory’s reliability has not been independently evaluated through human annotator judgments. This paper describes such an experimental study, whereby annotations were elicited across a subset of eight foundational categories proposed in the original Gordon-Hobbs theory. We avoid bias by eliciting annotations on 200 sentences from a commonsense benchmark dataset independently developed by an external organization. The results show that, while humans agree on relatively concrete categories like time and space, they disagree on more abstract concepts. The implications of these findings are briefly discussed.

Information

Type
Research Article
Information
Result type: Novel result
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2021. Published by Cambridge University Press
Figure 0

Table 1. Balanced accuracy scores and p-value levels for each annotator pair for the Physical Entities (P.E.), Classes and Instances (C.I.), and Sets categories. A, G, H, M and R designate the five annotators.

Figure 1

Table 2. Balanced accuracy scores and p-value levels for each annotator pair for the World States (W.S.), and Values and Quantities (V.Q.) categories. A, G, H, M and R designate the five annotators.

Figure 2

Table 3. Balanced accuracy scores and p-value levels for each annotator pair for the Time (Ti.), Space (Sp.), and Events (Ev.) categories. A, G, H, M and R designate the five annotators.

Supplementary material: PDF

Santos et al. supplementary material

Santos et al. supplementary material

Download Santos et al. supplementary material(PDF)
PDF 35.5 KB
Reviewing editor:  Adín Ramírez Rivera UNICAMP, Institute of Computing, Av. Albert Einstein 1251, Campinas, São Paulo, Brazil, 13083-872
This article has been accepted because it is deemed to be scientifically sound, has the correct controls, has appropriate methodology and is statistically valid, and has been sent for additional statistical evaluation and met required revisions.

Review 1: An experimental study measuring human annotator categorization agreement on commonsense sentences

Conflict of interest statement

Reviewer declares none

Comments

Comments to the Author: The paper presents empirical results showing agreement of human annotations for 8 of 48 representational areas of commonsense concepts proposed by Gordon and Hobbs. This line of work is important for the field of artificial intelligence because the manually created categories can fit what people treat as “common” but sometimes the proposed categorization is not as universal as its creators assume. However, there are several issues with the research and its presentation:

[evaluation depth] Common sense evaluation is problematic (see e.g. Clinciu et al. latest paper “It’s Common Sense, isn’t it? Demystifying Human Evaluations in Commonsense-enhanced NLG systems”); references about similar experiments might be missing.

[evaluation method] No explanation why the classic Kendall’s τ, Kolmogorov-Smirnov’s D, or Cohen’s Kappa (free-marginal kappa?) couldn’t be used

[annotators info] Only five annotators with almost identical background (no information about sex or gender, probable problems with representativeness)

[data origin] CycIC dataset test set only has 3,000 sentences, where the 200 questions were taken from? No reference, probable problem with reproducibility.

[data choice] Why CycIC or widely used ConceptNet categories couldn’t be used instead or compared?

[probable overstatement] The authors have chosen 8 areas of common sense and claim that their work can be helpful to evaluate remaining 40 (but as Gordon and Hobbs note, “the difference between these areas is in the degrees of interdependency that theories in these two groups require - these first eight representational areas can be developed in isolation from each other, whereas the latter forty cannot”.

Presentation

Overall score 3.6 out of 5
Is the article written in clear and proper English? (30%)
5 out of 5
Is the data presented in the most useful manner? (40%)
3 out of 5
Does the paper cite relevant and related articles appropriately? (30%)
3 out of 5

Context

Overall score 3.5 out of 5
Does the title suitably represent the article? (25%)
4 out of 5
Does the abstract correctly embody the content of the article? (25%)
4 out of 5
Does the introduction give appropriate context? (25%)
3 out of 5
Is the objective of the experiment clearly defined? (25%)
3 out of 5

Analysis

Overall score 3.4 out of 5
Does the discussion adequately interpret the results presented? (40%)
3 out of 5
Is the conclusion consistent with the results and discussion? (40%)
4 out of 5
Are the limitations of the experiment as well as the contributions of the experiment clearly outlined? (20%)
3 out of 5

Review 2: An experimental study measuring human annotator categorization agreement on commonsense sentences

Conflict of interest statement

Reviewer declares none

Comments

Comments to the Author: The manuscript is centered on a very interesting and timely topic, which is also quite relevant to EXPR themes. Organization of the paper is good and the proposed method is quite novel. The length of the manuscript is about right and presentation is good.

The manuscript, however, does not link well with relevant literature on commonsense computing, e.g., check latest trends on transformer models for commonsense validation. Also, recent works on the ensemble application of symbolic and subsymbolic AI for commonsense reasoning are missing.

Finally, add some examples of those 200 sentence for better readability and understanding of the paper. In fact, some EXPR reader may not be aware of the importance of commonsense. To this end, I also suggest to include some applications of commonsense computing, e.g., dialogue systems with commonsense and fuzzy commonsense reasoning for multimodal sentiment analysis.

Presentation

Overall score 4 out of 5
Is the article written in clear and proper English? (30%)
4 out of 5
Is the data presented in the most useful manner? (40%)
4 out of 5
Does the paper cite relevant and related articles appropriately? (30%)
4 out of 5

Context

Overall score 4 out of 5
Does the title suitably represent the article? (25%)
4 out of 5
Does the abstract correctly embody the content of the article? (25%)
4 out of 5
Does the introduction give appropriate context? (25%)
4 out of 5
Is the objective of the experiment clearly defined? (25%)
4 out of 5

Analysis

Overall score 4 out of 5
Does the discussion adequately interpret the results presented? (40%)
4 out of 5
Is the conclusion consistent with the results and discussion? (40%)
4 out of 5
Are the limitations of the experiment as well as the contributions of the experiment clearly outlined? (20%)
4 out of 5