Hostname: page-component-77f85d65b8-grvzd Total loading time: 0 Render date: 2026-03-29T14:54:33.047Z Has data issue: false hasContentIssue false

Building an LLM-Powered Content Moderation Bot in the Classroom

Published online by Cambridge University Press:  23 March 2026

Shelby Grossman
Affiliation:
Arizona State University , USA
Anthony Mensah
Affiliation:
Stanford University , USA
Alex Stamos
Affiliation:
Stanford University , USA
Jeffrey Hancock
Affiliation:
Stanford University , USA
Rights & Permissions [Opens in a new window]

Abstract

As online platforms seek to improve content-moderation strategies, large language models (LLMs) may be a potential tool. This study examines opportunities and limitations of LLM-powered moderation through a unique lens: student projects for a Stanford University course titled Trust and Safety. In this course, students developed Discord bots using LLMs to moderate specific types of harmful content. Interviews with 16 of the students suggest that these models demonstrate high accuracy, often exceeding students’ expectations. Notably, in cases of disagreement between the student and the model, closer analysis frequently validated the model’s judgments. However, students also observed limitations: LLMs proved unhelpfully sensitive to prompt phrasing and exhibited many contextual interpretation challenges common to human moderators and traditional machine-learning classifiers.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of American Political Science Association
Figure 0

Table 1 GPT-4o Content-Moderator Performance Metrics