Hostname: page-component-89b8bd64d-mmrw7 Total loading time: 0 Render date: 2026-05-06T13:33:46.212Z Has data issue: false hasContentIssue false

Unsupervised extraction of local and global keywords from a single text

Published online by Cambridge University Press:  05 December 2024

Lida Aleksanyan
Affiliation:
Alikhanyan National Laboratory, Yerevan, Armenia
Armen Allahverdyan*
Affiliation:
Alikhanyan National Laboratory, Yerevan, Armenia
*
Corresponding author: Armen Allahverdyan; Email: armen.allahverdyan@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

We propose an unsupervised, corpus-independent method to extract keywords from a single text. It is based on the spatial distribution of words and the response of this distribution to a random permutation of words. Our method has three advantages over existing unsupervised methods (such as YAKE). First, it is significantly more effective at extracting keywords from long texts in terms of precision and recall. Second, it allows inference of two types of keywords: local and global. Third, it extracts basic topics from texts. Additionally, our method is language-independent and applies to short texts. The results are obtained via human annotators with previous knowledge of texts from our database of classical literary works. The agreement between annotators is moderate to substantial. Our results are supported via human-independent arguments based on the average length of extracted content words and on the average number of nouns in extracted words. We discuss relations of keywords with higher-order textual features and reveal a connection between keywords and chapter divisions.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. For Anna Karenina by L. Tolstoy (Tolstoy 2013), we show space frequency $\tau [w]=1/C_1[w]$ and $1/C_2[w]$versus word rank for all distinct words $w$ of the text; cf. Eqs. (1, 6). We also show two additional quantities: $1/C_2[w]=1/C_{2\,\textrm{perm}}(w)$ after a random permutation of words in the text, and $f[w]/(1-f[w])$, where $f[w]$ is the frequency of $w$; see Eqs. (5, 3). Ranking of distinct words is done via $f[w]$, that is, the most frequent word got rank 1, etc. It is seen that $C_{2\,\textrm{perm}}[w]\lt C_{2}[w]$ holds for frequent words. Both $C_{2\,\textrm{perm}}[w]\lt C_{2}[w]$ and $C_{2\,\textrm{perm}}[w]\gt C_{2}[w]$ hold for less frequent words. Not shown in the figure: a random permutation of the words in the text leaves $\tau [w]$ unaltered for frequent words, while $\tau [w]$ generically increases for less frequent words (clusterization); cf. Eq. (5).

Figure 1

Figure 2. For Animal Farm (AF) by G. Orwell, we show the same quantities as for Anna Karenina (AK) in Figure 1 (also the same notations). AK is 11.6 times longer than AF; see Table 1. Some differences between these texts are as follows. Inequality $C_{2\,\textrm{perm}}(w)\lt C_2(w)$ holds for a lesser number of frequent words in AF compared with AK. Domain $C_{2\,\textrm{perm}}(w)\lt C_2(w)$ and $C_{2\,\textrm{perm}}(w)\gt C_2(w)$ are well separated in AK, and not so well separated in AF. For AF, relation (5) can be violated for some infrequent words.

Figure 2

Table 1. Analyzed long texts: Anna Karenina, War and Peace, part I, and War and Peace, part II by L. Tolstoy; Master and Margarita by M. Bulgakov; Twelve Chairs by I. Ilf and E. Petrov; The Glass Bead Game by H. Hesse; Crime and Punishment by F. Dostoevsky. Shorter texts: The Heart of Dog by M. Bulgakov; Animal Farm by G. Orwell. Alchemist by P. Coelho. Next to each text, we indicate the number of words in it, stop words included.For long texts we extracted for each text the same number of $\approx 300$ potential keywords via each method: our method (implemented via Eqs. (8, 9, 10)), LUHN and YAKE. The numbers below are percentages, that is, $15.6= 15.6\%$. For each text, the first percentage shows the values of precision (Prec.), that is, the fraction of keywords which were identified as keywords by human annotators. The second percentage shows recall (Rec.): the fraction of keywords that the methods were able to extract compared to ground-truth keywords; see (25). The third percentage shows the F1 score; see (26).For short texts, we extracted via each method $\sim 100$ words. Our method was implemented via Eq. (13); only the precision is shown. For longer texts, our method provides sizable advantages compared with LUHN and YAKE. For shorter texts the three methods are comparable (the values for recall are not shown)

Figure 3

Table 2. The values of Cohen’s kappa (14) for the agreement in keyword extraction tasks between two annotators for three different texts; see section 5.2. The keyword extraction employed the method discussed in Table 1 and section 3.2. Results for global and local keywords are shown separately. It is seen that the agreement is better for global keywords. A possible explanation is that the annotators do not focus on text details

Figure 4

Table 3. Comparison of three different keyword extraction methods for English, Russian, and French versions of Anna Karenina. Percentages for keywords indicate the precision [cf. Table 1], while ‘nouns’ means the percentage of nouns in candidate words that were not identified as keywords. For all cases, our method fares better than LUHN and YAKE

Figure 5

Table 4. Words of Anna Karenina extracted via our method. For global keywords strong and weak cases mean (resp.) that the words $w$ were chosen according to $A(w)\leq \frac{1}{5}$ and $\frac{1}{5}\leq A(w)\leq \frac{1}{3}$; cf. Eqs. (8, 10). Local keywords were chosen according to $A(w)\geq{5}$; see Eq. (9). For each column, the words were arranged according to their frequency Eq. (3). Keyword classes are denoted by upper indices; see details in the text. The last group $^{(10)}$ denotes words that were identified as keywords but did not belong to any of the above groups. Words without the upper index were not identified as keywords

Figure 6

Table 5. Here we discuss topical groups extracted from a short text. Heart of Dog by M. Bulgakov is a known satirical novella that shows the post-revolutionary Moscow (first half of the 1920s) under social changes, the emergence of new elites of Stalin’s era, and science-driven eugenic ideas of the intelligentsia. Eventually, the novella is about the life of a homeless dog Sharik (a standard name for an unpedigreed dog in Russia) picked up for medical and social experiments. The majority of keywords below were not even extracted via LUHN and/or YAKE

Figure 7

Table 6. First column: 36 words from Anna Karenina that have the highest score of YAKE (Campos et al.2018, 2020). Keywords are indicated by the number of their group; see Table 4. Among 36 words, there are 25 non-keywords. Keywords refer mostly to group $^{(1)}$. Second column: 36 words of Anna Karenina extracted via looking at the distribution of words over chapters, that is, at the largest value of Eq. (19). Only 2 words out of 36 are not keywords. Several keyword groups are represented