Hostname: page-component-6766d58669-rxg44 Total loading time: 0 Render date: 2026-05-16T23:50:03.453Z Has data issue: false hasContentIssue false

An unsupervised information-theoretic approach to identifying formulaic clusters in textual data

Published online by Cambridge University Press:  19 September 2025

Gideon Yoffe*
Affiliation:
Department of Statistics and Data Science, The Hebrew University of Jerusalem , Jerusalem, Israel
Yair Segev
Affiliation:
Faculty of Theology, Carl von Ossietzky Universität Oldenburg , Oldenburg, Germany
Barak Sober
Affiliation:
Department of Statistics and Data Science, The Hebrew University of Jerusalem , Jerusalem, Israel
*
Corresponding author: Gideon Yoffe; Email: gideon.yoffe@mail.huji.ac.il
Rights & Permissions [Opens in a new window]

Abstract

Texts, whether literary or historical, exhibit structural and stylistic patterns shaped by their purpose, authorship and cultural context. Formulaic texts, which are characterized by repetition and constrained expression, tend to differ in their information content (as defined by Shannon) compared to more dynamic compositions. Identifying such patterns in historical documents, particularly multi-author texts like the Hebrew Bible, provides insights into their origins, purpose and transmission. This study aims to identify formulaic clusters: sections exhibiting systematic repetition and structural constraints, by analyzing recurring phrases, syntactic structures and stylistic markers. However, distinguishing formulaic from non-formulaic elements in an unsupervised manner presents a computational challenge, especially in high-dimensional and sample-poor data sets where patterns must be inferred without predefined labels.

To address this, we develop an information-theoretic algorithm leveraging weighted self-information distributions to detect structured patterns in text. Our approach directly models variations in sample-wise self-information to identify formulaicity. By extending classical discrete self-information measures with a continuous formulation based on differential self-information in multivariate Gaussian distributions, our method remains applicable across different types of textual representations, including neural embeddings under Gaussian priors.

Applied to hypothesized authorial divisions in the Hebrew Bible, our approach successfully isolates stylistic layers, providing a quantitative framework for textual stratification. This method enhances our ability to analyze compositional patterns, offering deeper insights into the literary and cultural evolution of texts shaped by complex authorship and editorial processes.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices
Open data
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Classification results for the benchmarking experiment of discrete categorical one-hot encoded discrete data described in the “Clustering Benchmarking on Categorical Data” section. The test datasets included 100 samples of (equally-sized) formulaic and non-formulaic classes, of 200 (top panel), 50 (middle panel), and 20 dimensions (bottom panel), with varying degrees of the probability of the base- (p) and formulaic- feature activation ($p_{\text {form}}$), respectively, and the fraction of formulaic dimensions in the formulaic class. The colored areas represent one-standard-deviation intervals derived from 100 simulations.

Figure 1

Figure 2. Classification results of the experiment described in the “Clustering Benchmarking on Multivariate Gaussian Data” section, for a varying number of sample sizes and numbers of dimensions of multivariate Gaussian classes of varying entropy. Upper panel: Varying sample sizes for $d = 50$. Bottom panel: Varying sample sizes for $d = 10$. The colored areas represent one-standard-deviation intervals, derived from 100 simulations.

Figure 2

Figure 3. Clustering results for the book of Genesis across different parameter combinations, evaluated against expert annotations distinguishing between the main textual body and genealogical lists traditionally attributed to P. Results are shown for our cross-information-based clustering method (left) and k-means (right). Top panel: The 20 feature combinations that yield the highest MCC scores, indicating the strongest agreement with expert annotations. Bottom panel: Distribution of MCC scores across all parameter combinations, sorted into discrete performance intervals.

Figure 3

Figure 4. Clustering results for the P/non-P partition in the book of Exodus, similar to Figure 3.

Figure 4

Figure 5. Clustering results for the P/H partition in the book of Leviticus, similar to Figure 3.

Figure 5

Figure 6. Distinctive n-grams extracted from the formulaic cluster for two parameter combinations, capturing 30% of the variance (see Section 2.8 in Yoffe et al. (2023)). Left panel: Clustering of morphologically-represented Leviticus using $\ell = 28$, $n = 4$ and $f = \text {all}$, achieving an MCC score of 94%. Right panel: Similar to the left panel but with $\ell = 20$, $n = 2$ and $f = 300$, achieving an MCC score of 93%. n-grams discussed in the “Formulaic Structure and Parameter Sensitivity in the P/H Partition of Leviticus” section as examples for H- and P-associated features are outlined in red. The insets display the self-information distributions of both clusters, with blue and orange representing the non-formulaic and formulaic clusters, respectively.

Figure 6

Figure A1. Dependence of the resolving power of self-information on sample size and dimensionality. (a) Linear dependence of variance on dimension. (b) Linear dependence of variance on $1/n$. (c) Linear dependence of $\Delta I$ on dimension.

Figure 7

Figure C1. Clustering performance (MCC) as a function of key dataset parameters. Orange represents formulaic clustering, blue represents feature-distribution-based clustering. Upper panel: MCC vs. fraction of formulaic dimensions $d_{\text {form}}$, with fixed $p_{\text {form}} = 0.5$ and varying $p_{\text {feature}} \in \{0.1, 0.3, 0.5\}$. Middle panel: MCC vs. baseline feature activation probability $p_{\text {feature}}$, with fixed $d_{\text {formulaic}} = 0.1$ and varying $p_{\text {form}} \in \{0.8, 0.5, 0.1\}$Lower panel: MCC vs. formulaic activation probability $p_{\text {form}}$, with fixed $d_{\text {form}} = 0.3$ and varying $p_{\text {feature}} \in \{0.1, 0.3, 0.5\}$.

Figure 8

Figure D1. Clustering results for the hypothesized partitions of the books of Genesis, Exodus, and Leviticus (left to right, respectively), using GMM clustering, similar to Fig. 3.

Submit a response

Rapid Responses

No Rapid Responses have been published for this article.