Hostname: page-component-89b8bd64d-r6c6k Total loading time: 0 Render date: 2026-05-06T10:06:44.413Z Has data issue: false hasContentIssue false

Capturing causal claims: A fine-tuned text mining model for extracting causal sentences from social science papers

Published online by Cambridge University Press:  10 March 2025

Rasoul Norouzi*
Affiliation:
Methodology and Statistics, Tilburg University, Tilburg, Netherlands
Bennett Kleinberg
Affiliation:
Methodology and Statistics, Tilburg University, Tilburg, Netherlands Department of Security and Crime Science, University College London, London, UK
Jeroen K. Vermunt
Affiliation:
Methodology and Statistics, Tilburg University, Tilburg, Netherlands
Caspar J. van Lissa
Affiliation:
Methodology and Statistics, Tilburg University, Tilburg, Netherlands
*
Corresponding author: Rasoul Norouzi; Email: r.norouzinikjeh@tilburguniversity.edu
Rights & Permissions [Opens in a new window]

Abstract

Understanding causality is crucial for social scientific research to develop strong theories and inform practice. However, explicit discussion of causality is often lacking in social science literature due to ambiguous causal language. This paper introduces a text mining model fine-tuned to extract causal sentences from full-text social science papers. A dataset of 529 causal and 529 non-causal sentences manually annotated from the Cooperation Databank (CoDa) was curated to train and evaluate the model. Several pre-trained language models (BERT, SciBERT, RoBERTa, LLAMA, and Mistral) were fine-tuned on this dataset and general-purpose causality datasets. Model performance was evaluated on held-out social science and general-purpose test sets. Results showed that fine-tuning transformer models on the social science dataset significantly improved causal sentence extraction, even with limited data, compared to the models fine-tuned only on the general-purpose data. Results indicate the importance of domain-specific fine-tuning and data for accurately capturing causal language in academic writing. This automated causal sentence extraction method enables comprehensive, large-scale analysis of causal claims across the social sciences. By systematically cataloging existing causal statements, this work lays the foundation for further research to uncover the mechanisms underlying social phenomena, inform theory development, and strengthen the methodological rigor of the field.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Table 1 Overview of five open-source datasets that combined to form the general-purpose dataset

Figure 1

Table 2 Examples of sentences and their corresponding labels from the final curated social science dataset

Figure 2

Table 3 Detailed performance metrics of language models on general-purpose and social science test sets, showcasing precision, and F1 scores for causal and non-causal classes