Hostname: page-component-77f85d65b8-9nbrm Total loading time: 0 Render date: 2026-03-29T09:19:24.209Z Has data issue: false hasContentIssue false

From dictionaries to LLMs – an evaluation of sentiment analysis techniques for German language data

Published online by Cambridge University Press:  03 July 2025

Jannis Klähn*
Affiliation:
Leipzig University, Computational Humanities , Leipzig, Germany Saxon Academy of Sciences and Humanities , Leipzig, Germany
Janos Borst-Graetz
Affiliation:
Leipzig University, Computational Humanities , Leipzig, Germany
Manuel Burghardt
Affiliation:
Leipzig University, Computational Humanities , Leipzig, Germany
*
Corresponding author: Jannis Klähn; Email: jannis.klaehn@uni-leipzig.de
Rights & Permissions [Opens in a new window]

Abstract

In this study, we perform a comprehensive evaluation of sentiment classification for German language data using three different approaches: (1) dictionary-based methods, (2) fine-tuned transformer models such as BERT and XLM-T and (3) various large language models (LLMs) with zero-shot capabilities, including natural language inference models, Siamese models and dialog-based models. The evaluation considers a variety of German language datasets, including contemporary social media texts, product reviews and humanities datasets. Our results confirm that dictionary-based methods, while computationally efficient and interpretable, fall short in classification accuracy. Fine-tuned models offer strong performance, but require significant training data and computational resources. LLMs with zero-shot capabilities, particularly dialog-based models, demonstrate competitive performance, often rivaling fine-tuned models, while eliminating the need for task-specific training. However, challenges remain regarding non-determinism, prompt sensitivity and the high resource requirements of large LLMs. The results suggest that for sentiment analysis in the computational humanities, where non-English and historical language data are common, LLM-based zero-shot classification is a viable alternative to fine-tuned models and dictionaries. Nevertheless, model selection remains highly context-dependent, requiring careful consideration of trade-offs between accuracy, resource efficiency and transparency.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Table 1. Aggregated list of all evaluated LLMs, including their respective names as they appear on Hugging Face or version name

Figure 1

Table 2. Statistics of all datasets used, including sentiment label distribution, average text length and temporal coverage

Figure 2

Table 3. This table presents the evaluation of all models and methods in this study based on micro F1 scores

Figure 3

Table 4. This table aggregates Table 3, presenting the F1-score averaged across all datasets within each domain

Figure 4

Table 5. Model runtime and items processed per second on the Amazon dataset

Figure 5

Table 6. Total number of incorrect outputs of all 280,418 data records by model

Submit a response

Rapid Responses

No Rapid Responses have been published for this article.