On the literary landscapes of vector embeddings

Daniel Rockmore; Jiayi Chen; Mohammad Javad Latifi Jebelli; Allen Riddell; Harrison Stropkay

doi:10.1017/chr.2025.10015

On the literary landscapes of vector embeddings

Part of: CHR Expanding the Toolkit: Large Language Models in Humanities Research

Published online by Cambridge University Press: 07 October 2025

Daniel Rockmore

Jiayi Chen ,

Mohammad Javad Latifi Jebelli ,

Allen Riddell and

Harrison Stropkay

Show author details

Daniel Rockmore*: Affiliation:
Mathematics, Dartmouth College , USA Santa Fe Institute, USA
Jiayi Chen: Affiliation:
Mathematics, Dartmouth College , USA
Mohammad Javad Latifi Jebelli: Affiliation:
Mathematics, Dartmouth College , USA
Allen Riddell: Affiliation:
School of Informatics, Computing, and Engineering, Indiana University, USA
Harrison Stropkay: Affiliation:
Asymmetric Operations Sector, Applied Physics Laboratory, Laurel, USA
*: Corresponding author: Daniel Rockmore; Email: daniel.n.rockmore@dartmouth.edu

Article contents

Abstract
Plain language summary
Introduction
Data preprocessing and sandbox construction
Results
Discussion
Data availability statement
Ethical standards
Author contributions
Funding statement
Competing interests
Footnotes
References

Rights & Permissions

Abstract

From the early use of TF-IDF to the high-dimensional outputs of deep learning, vector space embeddings of text, at a scale ranging from token to document, are at the heart of all machine analysis and generation of text. In this article, we present the first large-scale comparison of a sampling of such techniques on a range of classification tasks on a large corpus of current literature drawn from the well-known Books3 data set. Specifically, we compare TF-IDF, Doc2vec and several Transformer-based embeddings on a variety of text-specific tasks. Using industry-standard BISAC codes as a proxy for genre, we compare embeddings in their ability to preserve information about genre. We further compare these embeddings in their ability to encode inter- and intra-book similarity. All of these comparisons take place at the book “chunk” (1,024 tokens) level. We find Transformer-based (“neural”) embeddings to be best, in the sense of their ability to respect genre and authorship, although almost all embedding techniques produce sensible constructions of a “literary landscape” as embodied by the Books3 corpus. These experiments suggest the possibility of using deep learning embeddings not only for advances in generative AI, but also a potential tool for book discovery and as an aid to various forms of more traditional comparative textual analysis.

Keywords

distant reading machine reading text analysis Transformer-based model unsupervised learning word embeddings

Information

Type: Research Article
Information: Computational Humanities Research , Volume 1 , 2025 , e18

DOI: https://doi.org/10.1017/chr.2025.10015 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices: Open materials
Copyright: © The Author(s), 2025. Published by Cambridge University Press

Plain language summary

Texts – books – have many possible digital representations. Each one of them takes the sequence of words that is a book and represents them as a list, or lists of lists, or lists of lists of lists, etc. of numbers. A list of numbers has a mathematical interpretation as a “vector” or as coordinates in a space of dimension equal to the length of the list. Thus, do we represent books (or chunks of books) as points in a geometric space. Each of these encodings have their advantages and disadvantages, judgments that are made depending on the imagined end use of the representation. In this article, we present the first large-scale comparison of a sampling of such techniques on a range of classification tasks on a large corpus of current literature drawn from the well-known Books3 data set. Our comparison set ranges from the straightforward “term-frequency-inverse-document-frequency” (TF-IDF) that dates to the early days of machine text analysis to the representations used today (2025) in so-called neural network “Transformer” architectures. We compare these representations on a variety of text-specific tasks. We compare embeddings in their ability to preserve information about genre. We further compare these embeddings in their ability to encode inter- and intra-book similarity. All of these comparisons take place at the book “chunk” (contiguous subset of words) level. The comparisons look to see if in the geometric spaces that represent the books, is text of a book, author, genre, near texts of related books, authors and genres (and if they are distinguishable from other “different” texts). We find Transformer-based representations to be best in the sense of their ability to respect genre and authorship, although almost all embedding techniques produce sensible constructions of a “literary landscape” as embodied by the Books3 corpus. These experiments suggest the possibility of using deep learning embeddings not only for advances in generative AI, but also a potential tool for book discovery and as an aid to various forms of more traditional comparative textual analysis.

Introduction

The vector space has proved to be an enduring mathematical framework for the machine-aided analysis of text. The early “bag-of-words” (BOW) model and TF-IDF encodings, first introduced by Salton, Wong, and Yang (Reference Salton, Wong and Yang1975), proved to be foundational for document encoding and document search (“information retrieval”) at scale, and remained state-of-the-art for decades, even giving rise to the earliest forms of “distant reading” (Moretti Reference Moretti2000), or the ability to extract themes and concepts from documents via the deployment of machine-based methods on vectorized digitized corpora. The subsequent development of BOW-based topic models (Blei, Ng, and Jordan Reference Blei, Ng and Jordan2003) gave birth to an explosion of work in (and arguably the birth of) programs in digital humanities and a wide range of work based on machine reading of huge corpora of text (at the scale of hundreds of thousands of documents), across legal studies, history, comparative literature (see, e.g., Livermore and Rockmore Reference Livermore and Rockmore2019; Wilkens Reference Wilkens2015), all of which might be brought together under the banner of “cultural analytics.” This has changed the scale at which scholars can make evidence-based empirical judgments and conjectures in text-based disciplines.

BOW models were succeeded by the development of learned vector space “embeddings” (Bengio et al. Reference Bengio, Ducharme, Vincent and Janvin2003) at a range of textual scales, ranging from word (or token) to document (text chunks), which showed that various forms of machine learning, including neural networks could advance machine reading and especially it’s generative twin that is machine writing. Few were prepared for the degree to which deep learning architectures, when trained on large text corpora, would revolutionize the way in which we now regularly use machines to interact with and generate text. This is especially true of the Transformer-based architectures (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) that underlie all state-of-the-art generative AI processes and products.

The findings of this article are closer to the theme of the early information retrieval/organization thrust of text encoding. In particular, we compare several vectorized embeddings of the well-known “Books3” data set. Books3 comprises digitized versions of almost 200,000 relatively recently published books, none of which are out of copyright. The data set was publicly available for a time, but was eventually removed from its original high-profile internet locations. We are not the first to use it for research purposes, and sharing the entire data set is not permissible.Footnote ¹ We do however make available our code for use of those with access to the data and in that way, enable reproducibility of our results.Footnote ² See our “Data availability statement” below.

We compare a family of TF-IDF-based and Transformer-based embeddings of a large random sample of the Books3 data set, evaluating them with respect to the simple hypotheses that a “good” representation should be one in which text segments or “chunks” from a given book are relatively closer to one another (intra-book consistency) than to those from another book (inter-book consistency), and also that the embedding should respect genre. Genre is determined by the publishing industry Book Industry Standards and Communications (BISAC) coding system, that provides categories for book assignment (see the “Data” section below for details). Both inter-book and intra-book consistency effectively assume that the embeddings pick up on aspects of semantic similarity at different scales as well as aspects of style.

For reasons of computational limitations, we create a “sandbox” from Books3, comprising a random sampling of random “chunks” of randomly chosen books from the 20 best-represented genres in Books3. Broadly, we find (cf. the “Results” section) that the vector space of neural embeddings of these chunks, derived from Transformer-derived models, produces a landscape (data cloud) of points that well-respect genre and author. In addition to accuracy-based evaluation, we introduce an entropy-based framework to quantify classification uncertainty at both the genre and book levels. This analysis identifies genres with more ambiguous boundaries and books with semantically diverse content. These findings reveal limitations in existing genre labeling systems and suggest opportunities for refinement (cf. the “Results” section).

The neural representations outperform – although in the simplest cases, not by much – TF-IDF-based representations in genre classification accuracy. For the latter, the vector space has much higher dimension (up to 25,000), a scale at which the deficits that are usually attributed to “the curse of dimensionality” kick in, a challenge that has to be weighed against the advantages of interpretability that accompanies a space whose dimensions are either labeled by or easily traced to actual words. Our work suggests that these lower-dimensional neural representations could be helpful for reader-directed exploration of “book space” as well as the invention of bespoke search tools for literary scholars. We revisit this in the discussion (cf. the “Discussion” section).

We are not the first to explore genre detection in literature. In Worsham and Kalita (Reference Worsham, Kalita, Bender, Derczynski and Isabelle2018), genre detection using the publicly available Project Gutenberg repositoryFootnote ³ out of which is constructed a bespoke data set of 3,577 works grouped into six genres (science fiction, adventure stories, love stories, detective and mystery stories, historical fiction and western stories) labeled by their Library of Congress Subject Headers. Their comparison set includes standard machine learning classification schemas using BOW input (e.g., random forests) as well as a range of the main neural computing architectures of that time. They determine the “CNN-Kim” architecture (a convolutional neural network architecture due to Kim (Reference Kim2014)) to be “most reliable.”

Wilkins (Reference Wilkens2016) uses an extensive digitized corpus of novels (8,500 American novels with publication dates ranging from 1880 to 1990) to produce an unsupervised (in the machine learning sense of the word) exploration of literary fiction that produces a data-driven, rather than industry-determined notion of “genre” reflecting a “fluidity” of the industry label over time, that accelerates post-World War II. This is a paradigmatic example of the way in which data analytic approaches can dramatically augment classical approaches to literary studies. Our work differs from theirs both in the time period of the literature under consideration and features of interest (and analysis), as well as the study goals.

Figure 1.

Distribution of books in Books3 by genre (as encoded by first BISAC code – see the “BISAC” section of the Appendix for a key to the abbreviations).

In Bamman et al. (Reference Bamman, Chang, Lucy and Zhou2024), a broad range of classification problems of interest for literary studies is considered, among which is the challenge of genre identification, both in the large and also at a finer scale (within folktales and within haiku). Their methodologies also range widely, from BOW models to LLM prompting. For the former – and that which is most comparable to our work – the methodologies are deployed on five samples of 150 books from each of five genres from books chosen from Gutenberg.org. The leading (supervised) classifier for this task derives from Llama 3 8B (AIMeta 2024) (cf. (Bamman et al. Reference Bamman, Chang, Lucy and Zhou2024, Table 2)). They further present an important taxonomy of classification goals and effects. Genre exploration strongly intersects those of “category sensemaking,” “challenging category boundaries,” and “category coherence,” with implications for the tasks of “search” and “replacing human labeling at scale.” Our work differs from theirs both in the time period of the literary corpus under examination as well as the embeddings and derived classification methodologies.

Also of note is the interesting study of Walsh, Preus, and Antoniak (Reference Walsh, Preus, Antoniak, Al-Onaizan, Bansal and Chen2024) who use a nicely curated and tagged broad corpus of poetry (see https://github.com/maria-antoniak/poetry-eval) to consider the challenge of genre detection by LLMs (through a range of different prompts) within English-language poetry. They find that fixed forms (e.g., sonnets) can be successfully recognized, and note the relevance to search and discoverability (among other potential applications).

Machine genre detection has also been explored in music (Modrzejewski, Szachewicz, and Rokita Reference Modrzejewski, Szachewicz and Rokita2020; Oramas et al. Reference Oramas, Barbieri, Nieto and Serra2018) and in movies (Jain and Jadon Reference S. K. and Jadon2009; Wehrmann and Barros Reference Wehrmann and Barros2017).

Our investigation of intra-book similarity connects the work to stylometry, the quantitative measure of writing style, often used as a method of identifying authorial voice – loosely defined as author-dependent writing consistency. Various measures derived from word frequencies are generally believed to be most effective (see, e.g., (Binongo Reference Binongo2003; Hughes et al. Reference Hughes, Foti, Krakauer and Rockmore2012) and the references therein). Our results suggest that neural embeddings contain some stylometric information as well.

Data preprocessing and sandbox construction

Data and data preprocessing

Books3 dataset

The Books3 dataset comprises 197,000 digital books spanning a wide range of genres. Each book is stored in plain text format and each file is labeled by the book’s title and author. These files are organized into subfolders based on the first character of the file name.

Books3 is a part of a larger dataset known as the Pile, introduced to the language modeling community in (Gao et al. Reference Gao, Biderman, Black, Golding, Hoppe, Foster, Phang, He, Thite, Nabeshima, Presser and Leahy2020). Although other language modeling datasets incorporate Books3 as a component dataset (Shen et al. Reference Shen, Tao, Ma, Neiswanger, Hestness, Vassilieva, Soboleva and Xing2023), the Pile is the best known. The Pile has been used in the development of influential language models, such as Meta’s Llama (Touvron et al. Reference Touvron, Lavril, Izacard, Martinet, Lachaux, Lacroix, Rozière, Goyal, Hambro, Azhar, Rodriguez, Joulin, Grave and Lample2023) [>6,310 citations], Google’s Gopher (Rae et al. Reference Rae, Borgeaud, Cai, Millican, Hoffmann, Song, Aslanides, Henderson, Ring and Young2021) [>811 citations] and Microsoft’s MT-NLG (Smith et al. Reference Smith, Patwary, Norick, LeGresley, Rajbhandari, Casper, Liu, Prabhumoye, Zerveas, Korthikanti, Zhang, Child, Aminabadi, Bernauer, Song, Shoeybi, He, Houston, Tiwary and Catanzaro2022) [>524 citations]. In late 2023, the Pile and its component datasets ceased to be available for download from the U.S.-based dataset authors. The dataset was removed after the authors of books featured in Books3 complained that anyone compiling the Pile lacked the appropriate licenses to redistribute the books (De Vynck Reference De Vynck2023). Although the Pile is no longer publicly available, the rate at which the article describing the dataset is cited indicates the dataset continues to be widely used in research.Footnote ⁴ Our management of reproducibility of our results accounts for the current state of the availability of Books3 (see the “Online Resources section of the Appendix”).

BISAC codes

The Books3 dataset does not include genre labels for the books. The BISAC codes are a standardized classification system used by publishers to categorize books based on their subject matter. Developed by the Book Industry Study Group (BISG), these codes give publishers a standard way of describing a book’s genre.Footnote ⁵

Each BISAC code is a unique alphanumeric identifier that corresponds to a specific genre or subject area (see the “BISAC and the Sandbox” section of the Appendix). The system is hierarchical, with broad categories subdivided into more specific sub-topics. Each BISAC code consists of nine alphanumeric characters. The first three (letters) denote one of the 53 main genres, and the remaining characters specify sub-distinctions within that general genre. For instance, the code FIC027020 represents contemporary romance fiction, with FIC indicating the main genre “Fiction,” and 027020 indicating the subcategory “Romance/Contemporary.”

The BISAC code list pairs each book’s International Standard Book Number (ISBN) with its corresponding BISAC codes (books can have more than one code, representing a primary, secondary, etc. designation). The ISBN associated with each book is embedded in the plaintext. For a subset of books in Books3, we were able to extract the ISBN and use ISBN as an identifier. Through the help of library services and Syndetics, we were able to attach BISAC codes to approximately 147,000 books in Books3. For almost all of our experiments, we used only the highest-level BISAC genre for genre labeling, extracting just the first three letters of each BISAC code. For books marked by multiple BISAC codes, we used the primary code (the first listed) for simplicity and clarity.

Figure 1 shows the genre distribution of the 147,000 BISAC-labeled books. It falls off quickly. To focus on the most prevalent genres for our genre detection task, we further refined the dataset and considered only the top 20 genres. Fiction (FIC) is significantly larger than all other genres. While it is important to sample with respect to the distribution when it comes to understanding the scope of the dataset, for our interest in genre classification tasks, an imbalanced dataset introduces a good deal of bias. Therefore, to achieve some balance and reduce bias in our experiments, we randomly sampled 1,000 books from each of the top 20 genres and built our “sandbox” from these 20,000 books.

Sandbox construction

In this section, we explain how we construct our sandbox from the 20,000 sampled books. Processing considerations demand using a random sample. We randomly sampled 100 text chunks from each book to be used as representatives of a book. The text chunks are standardized such that they all contain the same number of tokens. Details now follow.

Tokenization and sampling of text chunks

Large language models work with tokens instead of words. Furthermore, the performance of these models hinges significantly on their training methodology and the models are sensitive to input size. Since we use Transformer-based embedding models (see the next section) we chose a chunk size to align optimally with the default token length of these models.

Specifically, each book was encoded using the widely adopted GPT-2 tokenizer.Footnote ⁶ We randomly sampled 100 text chunks from each of the 20,000 sampled books, resulting in a dataset comprising 2,000,000 chunks. Each text chunk contains 1,024 consecutive tokens. Given that different models may tokenize text slightly differently, these 1,024-token segments were decoded back into plain text and stored – along with the corresponding token string – in our sandbox in plain text format. This standardized approach mitigates potential experimental biases by allowing different models to use plain texts as inputs and ensuring that the text chunks implicitly have the same size in terms of GPT2-tokens. We note that sampling 100 text segments per book provides robust coverage of the majority of our 20,000 sampled books. Figure 2 demonstrates that for the majority of our 20,000 sampled books, our samples of 100 text chunks cover more than 60% of the entire book corpus.

Figure 2.

Histogram of book coverage rate with the 100-chunk sampling scheme.

Each chunk is accompanied by detailed metadata, including the filename of the book that the chunk belongs to (“filename”), the BISAC genre of the book (“BISAC”), the token counts of the book (“length”), the index of the first token of the chunk within the book (“start_index”) and the text chunk in plain text (“sampled_text”). Figure A.1 (see the “BISAC” section of the Appendix) provides an illustration of our constructed sandbox setup.

Embedding models

Text chunks are vectorized using several Transformer-based embedding models as well as traditional embedding models based on TF-IDF. For all the embedding models, we kept the dimension of our embedding space to be 768 for comparability, except for TFIDF-5k (see below). We also normalize all vectors to have length 1.

Transformer-based models

Among the many Transformer-based models now available, in our experiments, we explore the abilities of BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), S-BERT (Reimers and Gurevych Reference Reimers and Gurevych2019), Arctic-Embed (Merrick et al. Reference Merrick, Xu, Nuti and Campos2024), E5 (Wang et al. Reference Wang, Yang, Huang, Jiao, Yang, Jiang, Majumder and Wei2022) and BGE (Xiao et al. Reference Xiao, Zheng, Zhang and Muennighoff2023). BERT and S-BERT are among the early widely adopted models and are standard benchmarking environments. Arctic-Embed, E5 and BGE are still Transformer-based and, at the time of this writing, have performed well on various benchmarks, while differing in architecture and training. We also considered the fact that since the models were developed by well-resourced labs at Snowflake and Microsoft, their established teams would be more likely to accurately report the performance of their models. Below is a little more detail on each of the models.

1. BERT: BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019) is an encoder-only Transformer model which converts the input text to tokens and outputs an embedding vector for each input token. We further applied a mean pooling strategy to the array of token-level embeddings to extract a single vector for each input text chunk. We utilized the “google-bert/bert-base-uncased” variant, which has a maximum token length of 512.
2. SBERT: SBERT (Reimers and Gurevych Reference Reimers and Gurevych2019) modifies the BERT architecture to create meaningful sentence embeddings by fine-tuning BERT using Siamese networks. We selected the “sentence-Transformers/all-distilroberta-v1” variant, trained with a maximum token length of 512.
3. BGE: The BGE (Xiao et al. Reference Xiao, Zheng, Zhang and Muennighoff2023) series, developed by BAAI, are embedding models designed to generate versatile and high-quality text embeddings for applications, such as semantic search, text classification and information retrieval. We employed the “BAAI/bge-base-en-v1.5” variant, which has a maximum token length of 512.
4. E5: The E5 (Wang et al. Reference Wang, Yang, Huang, Jiao, Yang, Jiang, Majumder and Wei2022) models are built on the T5 (Raffel et al. Reference Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu2020) architecture to create versatile embeddings for various NLP tasks. We used the “intfloat/e5-base-v2” variant, which has a maximum token length of 512.
5. Arctic-Embed: The Snowflake Arctic-Embed model (Merrick et al. Reference Merrick, Xu, Nuti and Campos2024) is a fine-tuned version of the E5 model, and is a specialized large language model designed to generate high-quality embeddings for extended sequences of text. We employed the “Snowflake/snowflake-arctic-embed-m-long” variant, which has a maximum token length of 1,024.

For consistency, our text chunks are 1,024 tokens long. For models with a maximum token length of 512 (BERT, SBERT, BGE and E5), we split each 1,024-token chunk into two 512-token chunks and average the embeddings for the embedding of the 1,024-token chunk. Many of the embedding models we use, including BERT-based models, were originally trained with a maximum input length of 512 tokens. We recognize that the number of tokens in a text chunk could influence the genre classification performance reported below, as longer chunks can potentially capture richer stylistic and thematic information, which may improve classification accuracy. That said, we see the use of chunk size of 1,024 tokens as striking a balance among information content, model compatibility, and ease of implementation.

Traditional embedding models

As mentioned above, the sandbox includes both the token and word representations of each text chunk. The word representations are used to create four “traditional” embedding models.

1. TF-IDF: TF-IDF (Term Frequency-Inverse Document Frequency) is a standard NLP statistic that measures word importance in a text document relative to a larger document corpus. These provide the basis of a TF-IDF vectorization of a piece of text. Estimates of the size of the “typical” English vocabulary can range widely, but generally exceed 10,000 words. Working with this kind of dimensionality produces vectors too long to work with effectively to accomplish the experiments of the later sections. Thus, we instead work with two dimension-reduced versions of the full TF-IDF data.
- • TFIDF-5k: We start with 25,000 words and reduce the number of features to a manageable size while maintaining as much accuracy as possible in our classification tasks. We found that TF-IDF with 5,000 features is a reasonable alternative (see TFIDF-5K in the Appendix for details and some empirical justification). Each chunk text is then represented by a vector whose entries stand for TF-IDF scores of the 5,000 terms with the highest term frequency.
- • TFIDF-pca: Another method we adopted is to reduce the dimensionality of TF-IDF with 25,000 features to 768 dimensions – as per the LLM embeddings – using PCA.
2. Doc2Vec: Doc2Vec (Le and Mikolov Reference Le, Mikolov, Xing and Jebara2014) is a neural network-based model designed to generate a vector embedding for each piece of text while preserving some semantic and contextual meaning. Each text piece receives a unique vector representation.
3. FreqTerm: This is a “dummy” vectorization method that serves as a baseline model. Each text chunk is represented by the frequency of the most common 768 words across the corpus.

Results

Consistency results – Intra/inter-genre and intra/inter-book similarities

Having constructed our sandbox, in this section, we investigate the degree to which each of these embedding models recapitulates the similarities and distinctions implicit in the BISAC labeling – i.e., holding them as some kind of ground truth that encapsulates the principles stated above about a “good” (useful) embedding. In particular, for an embedding to be useful, two properties should hold:

1. embeddings preserve genre - meaning that intra-genre embeddings should be closer to each other than inter-genre embeddings;
2. embeddings preserve book - meaning that intra-book embeddings should be closer to each other than inter-book embeddings.

Intra-/inter-genre similarities

To compare intra-genre similarities with inter-genre similarities for each embedding model, we compared the distribution of pairwise distance between embedding vectors from the same genre to the distribution of pairwise distance between embedding vectors from different genres. Specifically, in each embedding model, there are $100,000$ vectors per genre, each representing a chunk of a book in that genre. Computing over all of the approximately $10^{10}$ pairs of embedding vectors either within or between genres is prohibitively expensive. Thus, for each genre i in each embedding model, we sampled $100,000$ intra-genre pairs of vectors (both vectors in the same genre i, but also having removed the very small number of pairs from the same book – in the next section, we consider inter- and intra-book similarity) and $100,000$ inter-genre pairs of vectors (one vector in genre i and one in a different genre j).

We then performed t-tests to compare the distributions of intra-genre and inter-genre pairwise distances. Table 1 presents the differences between the mean distance of inter-genre and intra-genre pairs. As per the caption, all but three differences are statistically significant, indicating that generally the difference in means is significant and thus, that genre is preserved in the sense that text segments of a given genre are kept together and away from chunks from different genres. For illustration, Figure 3 is a UMAP embedding of a sampling of BERT-embedded text chunks from CKB, FIC and POL. We see the regional structure implicit in Table 1.

Table 1.

Inter-genre distance vs. intra-genre distance

Note: Numbers are computed by subtracting the mean of intra-genre distance pairs from the mean of inter-genre distance pairs. A positive value indicates intra-genre pairs are closer than inter-genre pairs as expected. Negative values are marked by $\dagger $ . The p-values are all 0 except for cases subscripted by $\ddagger $ . The highest three values for each model are boldfaced.

Figure 3.

UMAP embedding of a sampling of BERT-embedded text chunks from BISAC-labeled CKB (Cookbooks), FIC (Fiction) and POL (Politics).

More precisely, our analysis reveals that, with a few exceptions in the non-neural models (TFIDF-5k, D2V and FreqTerm), intra-genre pairs are statistically significantly closer than inter-genre pairs. The unexpected performance of TFIDF-5k, D2V and FreqTerm suggests they may be less effective as embedding models for information retrieval purposes.

Additionally, certain genres, such as cookbooks (CKB), consistently exhibit a greater difference between intra-genre and inter-genre distances. This suggests that these genres are more distinct and easier to differentiate than other genres, while the books within the genre share more similarities. While this is not surprising for cookbooks, the general distinguishability is interesting.

In summary, our experiment on intra-genre and inter-genre pairwise distances indicates that most embedding models successfully capture the similarity within genres and the dissimilarity between genres. It also provides one way to evaluate the varying effectiveness of different embedding models.

Intra-/inter-book similarities

The previous section shows that generally genres are kept together and separated from one another. In this section, we descend to the book level and consider consistency within genre, by comparing the distribution of pairwise distances between embedding vectors from a given book with the distribution of pairwise distances between embedding vectors from different books, within genre.

Specifically, for each genre i in each embedding model, we computed for each book, the pairwise distances among all vectors within that book (intra-book), and sampled an equal number of pairwise distances between vectors from different books within genre i (inter-book pairs). We then conducted t-tests to compare the distributions of intra-book pairs and inter-book pairs. The results of these t-tests are summarized in Table 2, where we record the difference in means. As per the caption, all of the p-values are effectively zero, noting that all mean differences are significant. Thus, it is consistently found that (within genre, and irrespective of genre) intra-book pairs exhibit significantly closer distances compared to inter-book pairs across all genres and embedding models.

Table 2.

Inter-book vs. intra-book distance by genre

Note: Numbers in this table are computed by subtracting the mean of intra-book pairs from the mean of inter-book pairs. The P-value is 0 for all cases. The highest three values for each model are bolded.

Certain genres, such as art books (ART) and sports books (SPO), consistently show a larger disparity between intra-book and inter-book distances, suggesting greater variability within books of these genres. In contrast, genres like cookbooks (CKB) exhibit a smaller difference between intra-book and inter-book distances, indicating less variation among cookbooks, which aligns with our conjecture from the previous section.

For illustration, Figure 4 shows a t-SNE representation of the BERT embeddings of text chunks from ten randomly selected fiction books. Points of the same color represent text chunks from the same book. The clustering observed among similarly colored points supports our earlier observation that text chunks from the same book tend to be embedded closer together.

Figure 4.

t-SNE embedding of a sampling of fiction (BISAC code FIC) BERT-embedded tokenized chunks from a handful of books. Note the tight intra-book clustering.

In summary, our experiment on intra-book and inter-book pairwise distances indicates that all embedding models successfully capture the similarity within books.

Genre classifications

The previous sections show that the various embedding models satisfy our two conditions for a useful embedding – that they generally respect genre and book – using distance in the embedding space as our underlying metric. In this section, we turn the question of genre preservation into a classification problem, using the BISAC labels as ground truth.

Classification accuracy

We considered two classifier models: K-nearest neighbors (KNNs) and logistic regression. First, we evaluated the classification accuracy at chunk level. Recall that our sandbox comprises 2,000,000 chunks, each labeled with a genre and a corresponding embedding vector for each embedding model. For each embedding model, each chunk embedding was treated as an individual observation, and the embedding vectors were classified according to their respective genres using ten-fold cross-validation. To gain a better understanding of the performance trends, we varied key parameters in these classification models. Table 3 shows the chunk-level accuracy for each of the classification models we trained on different embedding models.

Table 3.

Classification accuracy by chunk

Table 4.

Classification accuracy by book

We further evaluated the classification accuracy at book level. For each variation of KNN and logistic regression, we took the predictions for chunks and considered a book to be classified correctly if more than 50 out of 100 chunks from that book were classified correctly. Using that criteria, the book-level classification accuracy for KNN and logistic regression is computed by taking the ratio of correctly classified books over the total number of books.

Table 4 shows the book-level accuracy for the two classification models we trained on different embedding models.

Our results show a consistent pattern across both chunk-level and book-level analyses. Transformer-based models outperformed traditional models in KNN classification tasks, with similar accuracy scores observed among the Transformer-based models. In logistic regression classification tasks, Transformer-based models and the two TFIDF embeddings outperformed D2V and the naive FreqTerm measurement. Specifically, the Arctic-Embed model achieved the highest performance among all models.

Figure 5.

Confusion matrix – E5 embedding & KNN ( $n=200$ .)

Notably, the KNN classifier generally exhibited higher accuracy scores compared to logistic regression. This difference in accuracy can be attributed to how KNN and logistic regression utilize the embedding spaces. KNN focuses on local relationships (relative distances) between data points, while logistic regression benefits from global structures and the separability of classes within the embedding space. The observed clustering effect at both the book-level and genre-level suggests that the local similarities in our data are well-suited to KNN classification. In contrast, logistic regression seeks a linear decision boundary to separate different classes across the entire feature space, which can be challenging if the genre boundaries are irregular. On the other hand, the accuracy of logistic regression could be improved with more parameters or sparser data. These characteristics likely explain why TFIDF-5k and TFIDF-pca achieved similar accuracy scores to most Transformer-based models. Figure A.2 (see TFIDF-5K in the Appendix) gives a sense of the improvement one might obtain if even more TF-IDF features were allowed.

Figure 6.

Averaged genre prediction entropy in descending order.

The classification results, while good, suggest that there is – at least from the point of view of the classifier – some inconsistency in BISAC coding. The confusion matrix shown in Figure 5 summarizes the classification uncertainty in the case of E5 and KNN (for $n=200$ ). Each row corresponds to the true genre, and each column corresponds to the predicted genre. The entries represent the percentage of text chunks from the row’s genre that are classified into the column’s genre. The anti-diagonal entries indicate the percentage of correctly classified chunks, and notably, Social Science (SOC) – with the highest entropy – has the lowest correct classification rate. Chunks from social science books are sometimes predicted as political science (POL), philosophy (PHI), history (HIS), etc. The overlap in these content areas likely contributes to the classifiers’ uncertainty, indicating the importance of more refined genre definitions for certain broad categories. To that last comment, it is natural to ask if the more fine-scaled sub-genre distinctions (as recognized by BISAC coding) are identified. For reasons of length, we have pushed to the Appendix (cf. the “Subgenre” section) an investigation of this restricted to the ten most common sub-genres of the FIC (fiction) BISAC labeling. Generally, the classifier works well.

One could also ask if particular segments of books are generally more or less informative for classification. We conducted a related analogous exploration, dividing texts into first, middle and last thirds, which showed that the first and middle thirds were slightly better predictors. See the “BISAC classification according to book location” section of the Appendix for a more detailed explanation. There is certainly room for more sophisticated work in this direction.

Classification entropy

The results of the previous section while good, show that in every model, a fair proportion of books labeled as one category seem – at the level of embeddings – to be more like books in another category. We used entropy as a measure of this uncertainty. Recall that entropy for a random variable X that can exist in a finite set K of discrete states is defined as

$$ \begin{align*}H(X) = -\sum_{k \in K} \quad p(k) \cdot \ln(p(k)),\end{align*} $$

where $p(k)$ is the probability of state k.

We focused on the predicted labels from each classifier for each embedding model listed in Table 3 for all embeddings generated by Transformer-based models. This analysis highlights where genre boundaries are less distinct in the writing and where classification models are more likely to produce varied predictions, shedding light on the limitations of rigid categorical genre labeling. We also drill down a little in the category of fiction and consider its most entropic book.

Genre-level entropy: We first examine which genre has the highest genre prediction entropy. For each Transformer-based embedding model, each classifier outputs a predicted genre for each text chunk. To compute genre entropy for genre i, where $X_i$ represents the classifier’s predictions for genre i text chunks, the entropy is calculated as

$$ \begin{align*}H(X_i) = -\sum_{\text{genre } k } p(k) \cdot \ln(p(k)).\end{align*} $$

Here, $p(k)$ is the proportion of genre i chunks predicted as genre k. That is, “i” is the ground-truth label, while the “k”s are the predicted labels. A higher value of $H(X_i)$ indicates a high level of uncertainty in the classifier’s predictions when deployed on chunks from books with i BISAC labels.

We computed the entropy for each genre across all Transformer-based embedding models and classifiers, then averaged the results to determine the final genre prediction entropy for each genre. Figure 6 displays the averaged genre prediction entropy for each genre. Genres, such as computer science (COM) and cooking (CKB), exhibited low entropy, indicating that the classifiers make consistent predictions for text chunks from these genres.Footnote ⁷

Book-level entropy: We also computed entropy at book level. Marketers sometimes call a book “genre-defying” or “genre-blurring” and missed calls (according to BISAC labels) might get at such labels. To quantify such “blurring,” for each book j, let $X_j$ represent the classifier’s predictions for text chunks from book j. Then, the entropy is calculated as

$$ \begin{align*}H(X_j) = -\sum_{\text{genre } k } p(k) \cdot \ln(p(k)).\end{align*} $$

Here, $p(k)$ represents the proportion of book j chunks predicted as genre k.

Figure 7.

Averaged genre prediction entropy in descending order.

Figure 8.

KNN ( $n=200$ ) predictions of the book “Junk Mail” using E5 embedding.

Once again, we used the average over all Transformer-based embedding models and all classifiers to compute the entropy for each book across all. Figure 7 shows the distribution of averaged genre prediction entropy for each fiction book. The majority of books are of relatively low entropy, but there is a reasonable tail. The most entropic book in the category is Junk Mail by Will Self. This is an original selection of pieces from Self’s nonfiction and journalism, and thus text pieces derived from these sections would naturally be classified as non-fiction. One section, for example, is about the author’s journey to Turkey for a religious ceremony and is classified as TRV (travel) by all the models. Figure 8 shows the distribution of text chunk genre predictions for Junk Mail using KNN with 200 nearest neighbors applied to the E5 embedding model. Diverse content is at the heart of a “genre-defying” work. In fact, four of the top five FIC works of highest entropy are collections of essays, podcast transcripts and historical pieces. In the “Entropy” section of the Appendix, we list the top five books with the highest entropy in each genre.

In summary, our investigation into a handful of outliers causing confusion in classification tasks at both genre and book levels found that genre overlap and content variety within books contribute to classifier uncertainty. This, in turn, is further evidence that at chunk-level, the Transformer-based embeddings work well for genre classification.

Discussion

In this article, we examine the use of both classical BOW-based vector encodings and those achieved through the use of the embeddings derived from Transformer-based encodings for representing books (or more precisely, in the case of Transformer-derived representations token chunks of books), evaluating them for their ability to create a (high-dimensional) “literary landscape” available for an exploration of literature. We use a sampling of the Books3 data set, a corpus of digitizations of approximately 200,000 recently published books. Due to copyright restrictions, we cannot share the data set, but make available on Github software for use by those who have access to the data (see the “Online resources” section of the Appendix).

We use the industry-standard BISAC codes as a form of ground-truth genre labeling. Of the representations we deployed, Doc2vec was the least successful. TF-IDF-derived and Transformer-derived embeddings are both useful, in the sense that they generally retain both intra-book consistency as well as genre consistency, as represented by the associated BISAC codes, and measured and evaluated by comparing average distances. This kind of evaluation is useful vis-a-vis considerations of the notion of “landscape” and provides a weak form of cluster evaluation. In a traditional classification paradigm, a KNN-classifier using 50 neighbors had over 80% accuracy for all Transformer-based embeddings, suggesting genre-specific neighborhoods in their derived landscapes. TFIDF-PCA approached that. A deeper dive into subgenres for FIC (fiction) with only E5 also produced successful classification. Generally, at the broad level of genre, the Transformer-based models outperform the BOW models, but suffer from a lack of clear interpretability. Further work in this regard is warranted.

The Arctic-Embed model, while performing similarly to other Transformer-based models in KNN classification tasks, outperformed all models in logistic regression classification tasks. This suggests that the Arctic-Embed model maintains a similar level of distinct genre clustering as other models but has refined the global structure of its embedding space to better fit linear separation between genres. Although specific details about the Arctic-Embed model’s training methodologies are less extensively documented, it includes a fine-tuning stage based on a basic version of the E5 model. This fine-tuning likely adjusts the relative distances between local clusters to enhance performance in linear classifiers, such as logistic regression, while maintaining local clustering performance for KNN classification.

Our conjecture regarding the Arctic-Embed embeddings may be partially validated by our experiment on local intrinsic dimensionality estimation. For each Transformer-based embedding model, we applied a PCA-based local dimensionality estimation tool to subsets of the embeddings, divided according to genre. Figure 9 displays the mean estimated local dimension per genre for each embedding model. Our findings indicate that the Transformer-based models generally agree on the relatively high and low genre-specific dimensionalities. Notably, the Arctic-Embed embeddings exhibit the highest estimated intrinsic dimension overall, suggesting that this model may most effectively utilize the high-dimensional space for textual information. Furthermore, the Arctic-Embed embeddings show the greatest variation between different genres, potentially providing the logistic regression classifier with more capacity to separate vectors from different genres using linear boundaries.

Figure 9.

Local dimensionality estimation per genre.

In summary, our analysis suggests that Transformer-based models provide embeddings that are better suited for genre classification tasks, suggesting the potential of these embeddings for broader applications in text analysis and information retrieval tasks.

Generally, our results show that genres are effectively connected regions in the literary landscapes. That said, there are individual works of high entropy and a closer examination of such works could be of interest vis-a-vis an exploration of the forms of genre “fluidity” (as per Wilkins Reference Wilkens2016). This is a part of the “sensemaking” (á la Bamman et al. Reference Bamman, Chang, Lucy and Zhou2024) enabled by these classifiers.

Within these regions, individual books comprise local neighborhoods of similarity. This suggests that the embeddings pick up on both content and something of authorial voice – style and genre. Neural embeddings are best at this. While one might be concerned that the success of the Transformer-based models derives from “memorizing” training data (see, e.g., Ranaldi, Ruzzetti, and Zanzotto Reference Ranaldi, Ruzzetti and Zanzotto2023; Tänzer, Ruder, and Rei Reference Tänzer, Ruder, Rei, Muresan, Nakov and Villavicencio2022), we should point out that BERT and Arctic-embed are encoder-only Transformer models that are trained on an objective distant from our evaluation task. E5 is trained using a contrastive loss but the loss in question is calculated using pairs of related documents drawn from the C4 corpus, which is a web corpus. We think it is extremely unlikely that there is meaningful overlap between Books3 and C4. Most of the C4 documents are from popular websites. These websites do not host books in general and certainly not the in-copyright books in Books3. (See Dodge et al. (Reference Dodge, Sap, Marasović, Agnew, Ilharco, Groeneveld, Mitchell and Gardner2021) for a description of the C4 corpus.) Like E5, the BGE embedding model is trained using a contrastive loss on training data that consists of pairs of documents from a corpus similar to C4.

Given some preservation of style and genre, a potential – if speculative – application of these embeddings is as the engine for a tool for reader-driven exploration of the book world, in which a reader might like to see where a given known book sits in the larger world of literature, where distance is some proxy for literary similarity. Given that the vast majority of published books are neither reviewed nor vigorously marketed, this could provide a way to surface little known, but potentially still attractive books (for a given reader).

More specifically, only a small fraction of the frontlist titles published each year gain notice through mainstream book reviewing and marketing. According to a widely read and shared blogpost by Kristen McLean, “lead industry analyst from NPD BookScan,”Footnote ⁸ $487,000$ new books were published in 2023 (this includes self-publishing and is effectively a count of new ISBNs). While there are many forms of book reviewing (e.g., traditional mainstream outlets, such as newspapers, magazines, online literary sites, posts on sellers like Goodreads, etc.), the pace of reviewing still seems unable to keep up with the pace of book writing. For example, Kirkus Reviews, a well-known and generally well-regarded book review source, only publishes on the order of $10,000$ reviews a year, and provides short text reviews with a simple star/no-star rating system. McLean also notes that even among the approximately $47,000$ frontlist titles of first-rank publishersFootnote ⁹ just over 65% sold fewer than 1,000 copies, suggesting that even a relatively small number of books with large publicity and marketing resources available receive reviews of any sort. A takeaway from this peek behind the scenes of the (highly opaque) publishing industry is that the vast majority of books published each year may have few mechanisms for finding readers. Certainly, the sheer complexity of just what it takes for a book to survive the gantlet of the publishing industry works against wide-ranging promotion (see, e.g., Sinykin Reference Sinykin2023 as well as a thoughtful review Lozano Reference Lozano2023). This “oversupply” (and under-promotion) means that readers can have trouble finding books (Hviid, Izquierdo-Sanchez, and Jacques Reference Hviid, Izquierdo-Sanchez and Jacques2019). A navigable “literary landscape” would allow readers to easily search for books that are “like” a book that they have just enjoyed. Our experiments show that it would be relatively simple to update such a landscape. The success of these embeddings for recovering BISAC codes also suggests that for those books lacking the codes, the embeddings could suggest (or assign) one that is probably correct. In the taxonomy of (Bamman et al. Reference Bamman, Chang, Lucy and Zhou2024), this is a “replacing human labeling at scale” kind of task.

In a related direction, the ability of neural embeddings to preserve useful notions of textual similarity also suggests that such a representation of a literary landscape could be a foundation for an information retrieval system useful for literary research. A representation like this could give a scholar a means of searching large swathes of literature using example-based queries and thus a means of exploring literary hypotheses and conjectures at scale – a form of distant reading to enable the discovery of new sources for close reading.

Our work here serves as a nice complement to work like that in (Bamman et al. Reference Bamman, Chang, Lucy and Zhou2024; Wilkins Reference Wilkens2016), focusing as it does on twenty-first century literature and its comparative analysis of literature using neural and BOW embeddings. Future directions of interest would include more detailed analyses of writing style as well as connecting publishing practice to the literary landscape construction.

Data availability statement

Code to replicate our “sandbox” is available here Footnote ¹⁰ for researchers with access to the Books3 data set.

Ethical standards

The research meets all ethical guidelines, including adherence to the legal requirements of the study country.

Author contributions

Conceptualization: J.C., M.J.L.J., A.R. and D.R.; Methodology: All authors; Data curation: J.C.; Data visualization: All authors; Writing original draft: All authors; All authors approved the final submitted draft.

Funding statement

M.J.L.J. was partially funded by the U.S. Office of Naval Research, Grant N00014-19-1-242.

Competing interests

The authors declare none.

Appendix

BISAC and the sandbox

Here, we include some more details on data preparation. Table A.1 displays the BISAC codes and the corresponding full names of top 20 genres we considered in our experiment.

Table A.1.

BISAC subject headings for top 20 genres

TFIDF-5k

As mentioned earlier, 25,000-dimensional TF-IDF embeddings with two million samples make it impractical to run the computational tasks. Thus, we employed a pilot experiment to reduce dimension to a computationally practical number while large enough so that it preserves as much information as possible. We run our pilot experiment on a downsized dataset, where instead of sampling 100 chunks per book, here, we sampled 20 chunks per book.

Figure A.1.

Indexing of the entries in our Books3 “sandbox” composed of 100 books per each of the 20 best represented genres.

Figure A.2.

TFIDF-5k.

We generated TF-IDF embeddings of this downsized dataset with different number of features (embedding dimension) that varies from the ideal number – 25,000 to the default dimension that aligns with other models – 768. Then, we computed the accuracy for KNN classifier with 100 neighbors and logistic regression classifier with regularization parameter 1 for each different dimension. Figure A.2 shows the trend of classifier accuracy as dimension of embedding changes. We chose 5,000 dimensions where we were able to downsize the dataset 4 times with a loss of about 20% of accuracy gap between the accuracy score with dimensions 25,000 and 768.

Table A.2.

Classification accuracy by chunk

Sub-genre classification

We conducted finer-grained sub-genre classification tests within the fiction category. This can provide insight into whether embeddings capture not only genre-level distinctions but also more nuanced semantic differences.

To this end, we conducted additional experiments focused on the ten most frequent fiction sub-genres in our dataset. We repeated the classification tasks presented in the “Results” section, using both KNN and logistic regression classifiers across a variety of embedding models.

As shown in Table A.2, we found that the sub-genre-level classification results are consistent with our earlier findings at the broader genre level. Embeddings, such as E5 and Arctic-Embed, performed best, achieving high chunk-level classification accuracy even across subgenres that are semantically close. Surprisingly was the strong performance in subclassification of the dimension-reduced TFIDF-pca model using a KNN classifier. It was only surpassed by the E5 model. This warrants a closer look in future work. Simpler models, such as TF-IDF and FreqTerm, performed substantially worse.

To further investigate the nature of model errors, we examined the confusion matrix for E5 embeddings using KNN with 200 neighbors (Figure A.3). The model correctly distinguished between many sub-genres but showed confusion between semantically related categories. For example, text chunks from the Women subgenre were frequently misclassified as Romance, highlighting overlap in thematic and linguistic features between these subcategories. Similar patterns were observed among other closely related sub-genres, such as Action & Adventure and Science Fiction or Fantasy.

Figure A.3.

Fiction subgenre confusion matrix – E5 embedding & KNN ( $n=200$ ).

These findings suggest that while the embedding models are capable of distinguishing sub-genres to a significant extent, classification remains more challenging at this finer granularity due to inherent overlaps. Nonetheless, this subgenre-level analysis further supports the embeddings’ ability to capture subtle stylistic and thematic differences and complements our main results.

BISAC classification according to book location

1. Data: For each text chunk in our sandbox, we labeled its position within the book as follows: “first” if the chunk is fully contained in the first one-third of the book, “middle” if fully contained in the middle one-third, “last” if fully contained in the final one-third and “other” if it does not fall cleanly into any of the three categories. Since our original sandbox was created through random sampling, very few text chunks fell into the “other” category, and we were able to achieve an even distribution across the “first,” “middle” and “last” categories. The detailed breakdown is shown in Table A.3.
Table A.3.
Chunk counts by genre and book section
2. Analysis: We then performed genre classification separately on each section (“first,” “middle” and “last”) to compare classification performance across positions. Since the dataset is very balanced across sections, this ensures a fair comparison.
3. Results: Tables A.4 and A.5 show the results of our analysis using KNN with 100 neighbors and logistic regression classifiers, respectively. We observed that classification accuracy is slightly lower for text chunks from the last third of the books, compared to those from the first or middle third. The first and middle thirds yielded comparable accuracy. The results generally follow those of using the full text. The slightly weaker accuracy for KNN reflects the fact that we are still using a neighborhood size of 100 even though with subsampling by a third, approximately only one-third as many neighbors come from the same book.
Table A.4.
KNN classification accuracy ( $n=100$ ) across book sections and models

Table A.5.
Logistic regression classification accuracy ( $C=1$ ) across book sections and models
4. Conclusion: Our simple experiment suggests that while text chunks from the last third of a book may contain slightly less information for genre classification, chunks from the first and middle thirds provide comparable information. More sophisticated and refined experiments could certainly be conducted to explore more specific questions – perhaps using our sandbox – regarding the effect of chunk positioning.

Entropy

Top 5 entropy books per genre

Table A.6.

Top 5 entropy pre genre

Figure A.4.

Mean accuracy of KNN classifiers per genre.

Classification accuracy per genre

Here, we present the per-genre accuracy scores (proportion of text chunks of genre i that are predicted as genre i) in our classification tasks. Although we experimented with three parameters for each classifier, the changes in parameters did not significantly alter the overall pattern of relative performance of models. Therefore, for clarity of visualization, we present the average accuracy scores for each classifier across the three parameter settings for each pair of model and genre in Figures A.4 and A.5. In both figures, darker colors indicate higher prediction accuracy in the classification tasks, whereas lighter colors represent lower accuracy. As discussed in the main text, genres, such as COM and CKB, consistently achieve the highest accuracy scores, whereas genres, such as SOC and BIO, tend to have the lowest accuracy scores.

Online resources

The Books3 data set (used in the Pile) is no longer publicly available and cannot be shared. That said, citation information suggests that it remains widely used.

Figure A.5.

Mean accuracy of logistic regression classifiers per genre.

To enable reproduction of our results, we identify the books in Books3 we use via filename and ISBN. Interested researchers in possession of a copy of the Pile will be able to reproduce our results using the same data we are working with. Those without access to the Pile data can locate the underlying e-books using the ISBNs and negotiate access from publishers. Given the data of text and ISBNs, researchers will be able to reproduce our results using the code, which is available here.Footnote ¹¹

Footnotes

This article was awarded the open Materials badge for transparent practices. See the Data availability statement for details.

1 For a brief description of the life of Books3 as well as links to articles about the data set, see https://www.aiaaic.org/aiaaic-repository/ai-algorithmic-and-automation-incidents/books3-ai-training-dataset.

2 https://github.com/Jia1-Chen/embeddinglandscape

3 https://www.gutenberg.org/

4 Indeed the use of the Pile seems to have been normalized. According to Google Scholar, the paper (Gao et al. Reference Gao, Biderman, Black, Golding, Hoppe, Foster, Phang, He, Thite, Nabeshima, Presser and Leahy2020) was cited 230 times in papers published in 2023, almost 1,000 times in 2024, and as of May 17, 2025, it has received over 2,020 total citations.

5 https://www.bisg.org/BISAC-Subject-Codes-main

6 GitHub link for the OpenAI tiktoken package: https://github.com/openai/tiktoken.

7 In fact, the classifiers almost always make correct predictions for these genres. See the Appendix (cf. the “Entropy” section) for the genre-specific accuracy.

8 https://countercraft.substack.com/p/no-most-books-dont-sell-only-a-dozen/comment/8883524. According to Wikipedia (https://en.wikipedia.org/wiki/BookScan), “BookScan is a data provider for the book publishing industry that compiles point of sale data for book sales.”

9 Penguin Random House, Simon & Schuster, Hachette Book Group, HarperCollins, Scholastic, Disney, Macmillan, Abrams, Sourcebooks and John Wiley.

10 https://github.com/Jia1-Chen/embeddinglandscape

11 https://github.com/Jia1-Chen/embeddinglandscape

References

AIMeta . 2024. “Llama 3 Model Card.” https://github.com/meta-llama/llama3/blob/20main/MODEL_CARD.md Google Scholar

Bamman, David, Chang, Kent K., Lucy, Li, and Zhou, Naitian. 2024. “On Classification with Large Language Models in Cultural Analytics.” CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus, Denmark, pp. 494–527.Google Scholar

Bengio, Yoshua, Ducharme, Réjean, Vincent, Pascal, and Janvin, Christian. 2003. “A Neural Probabilistic Language Model.” Journal of Machine Learning Research 3: 1137–55.Google Scholar

Binongo, José Nilo G. 2003. “Who Wrote the 15th Book of Oz? An Application of Multivariate Analysis to Authorship Attribution.” CHANCE 16, no. 2: 9–17. https://doi.org/10.1080/09332480.2003.10554843.Google Scholar

Blei, David M., Ng, Andrew Y., and Jordan, Michael I.. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3: 993–1022.Google Scholar

Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, and Toutanova, Kristina. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186.Google Scholar

Dodge, Jesse, Sap, Maarten, Marasović, Ana, Agnew, William, Ilharco, Gabriel, Groeneveld, Dirk, Mitchell, Margaret, and Gardner, Matt. 2021. “Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus.” Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1286–1305, 10.18653/v1/2021.emnlp-main.98.Google Scholar

Gao, Leo, Biderman, Stella, Black, Sid, Golding, Laurence, Hoppe, Travis, Foster, Charles, Phang, Jason, He, Horace, Thite, Anish, Nabeshima, Noa, Presser, Shawn, and Leahy, Connor 2020. “The Pile: An 800Gb Dataset of Diverse Text for Language Modeling.” Preprint, arXiv:2101.00027.Google Scholar

Hughes, James, Foti, Nicholas, Krakauer, David, and Rockmore, Daniel. 2012. “Quantitative Patterns of Stylistic Influence in the Evolution of Literature.” Proceedings of the National Academy of Sciences of the United States of America 109: 7682–6. https://doi.org/10.1073/pnas.1115407109.Google Scholar

Kim, Yoon. 2014. “Convolutional Neural Networks for Sentence Classification.” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1746–1751.Google Scholar

Le, Quoc, and Mikolov, Tomas. 2014. “Distributed Representations of Sentences and Documents.” In Proceedings of the 31st International Conference on Machine Learning, edited by Xing, Eric P. and Jebara, Tony, 32, 1188–96. Bejing: PMLR. https://proceedings.mlr.press/v32/le14.html.Google Scholar

Livermore, Michael, and Rockmore, Dan, editors. 2019. Law as Data: Computation, Text, and the Future of Legal Analysis. Santa Fe Institute Press, Santa Fe, NM.Google Scholar

Lozano, Kevin. 2023. “How Has Big Publishing Changed American Fiction?” The New Yorker. https://www.newyorker.com/books/under-review/how-has-big-publishing-changed-american-fiction.Google Scholar

Merrick, Luke, Xu, Danmei, Nuti, Gaurav, and Campos, Daniel. 2024. “Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models.” Preprint, arXiv:2405.05374.Google Scholar

Modrzejewski, Mateusz, Szachewicz, Jakub, and Rokita, Przemysław. 2020. “Application of Neural Networks and Graphical Representations for Musical Genre Classification.” In Artificial Intelligence and Soft Computing: 19th International Conference, ICAISC 2020, Zakopane, Poland, October 12-14, 2020, Proceedings, Part I, 193–202. Zakopane: Springer-Verlag. https://doi.org/10.1007/978-3-030-61401-0_19.Google Scholar

Moretti, Franco. 2000. “Conjectures on World Literature.” New Left Review 1: 54–68.Google Scholar

Hviid, Morten, Izquierdo-Sanchez, Sofia, and Jacques, Sabine. 2019. “From Publishers to Self-Publishing: Disruptive Effects in the Book Industry.” International Journal of the Economics of Business 26, no. 3: 355–81. https://doi.org/10.1080/13571516.2019.1611198.Google Scholar

Oramas, Sergio, Barbieri, Francesco, Nieto, Oriol, and Serra, Xavier. 2018. “Multimodal Deep Learning for Music Genre Classification.” Transactions of the International Society for Music Information Retrieval, 1, no. 1: pp. 4–21. https://doi.org/10.5334/tismir.10.Google Scholar

Rae, Jack W., Borgeaud, Sebastian, Cai, Trevor, Millican, Katie, Hoffmann, Jordan, Song, Francis, Aslanides, John, Henderson, Sarah, Ring, Roman, Young, Susannah, et al. 2021. “Scaling Language Models: Methods, Analysis & Insights from Training Gopher.” Preprint, arXiv:2112.11446.Google Scholar

Raffel, Colin, Shazeer, Noam, Roberts, Adam, Lee, Katherine, Narang, Sharan, Matena, Michael, Zhou, Yanqi, Li, Wei, and Liu, Peter J.. 2020. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” Journal of Machine Learning Research 21, no. 140: 1–67. http://jmlr.org/papers/v21/20-074.html.Google Scholar

Ranaldi, Leonardo, Ruzzetti, Elena Sofia, and Zanzotto, Fabio Massimo. 2023. “Precog: Exploring the Relation Between Memorization and Performance in Pre-Trained Language Models.” In Proceedings of the Conference Recent Advances in Natural Language Processing—Large Language Models for Natural Language Processings, 961–7. Shoumen: Ranlp. Incoma Ltd. https://doi.org/10.26615/978-954-452-092-2_103.Google Scholar

Reimers, Nils, and Gurevych, Iryna. 2019. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 3982–3992.Google Scholar

Salton, Gerard, Wong, Anita, and Yang, Chung-Shu. 1975. “A Vector Space Model for Automatic Indexing.” Communications of the ACM 18, no. 11: 613–20. https://doi.org/10.1145/361219.361220.Google Scholar

Shen, Zhiqiang, Tao, Tianhua, Ma, Liqun, Neiswanger, Willie, Hestness, Joel, Vassilieva, Natalia, Soboleva, Daria, and Xing, Eric. 2023. “SlimPajama-DC: Understanding Data Combinations for LLM Training.” Preprint, arXiv:2309.10818.Google Scholar

Sinykin, Dan. 2023. Big Fiction: How Conglomeration Changed the Publishing Industry and American Literature. New York, NY: Columbia University Press.Google Scholar

S. K., Jain, and Jadon, R. S. 2009. “Movies genres classifier using neural network,” 2009 24th International Symposium on Computer and Information Sciences, Guzelyurt, Northern Cyprus, pp. 575–580, doi: 10.1109/ISCIS.2009.5291884.Google Scholar

Smith, Shaden, Patwary, Mostofa, Norick, Brandon, LeGresley, Patrick, Rajbhandari, Samyam, Casper, Jared, Liu, Zhun, Prabhumoye, Shrimai, Zerveas, George, Korthikanti, Vijay, Zhang, Elton, Child, Rewon, Aminabadi, Reza Yazdani, Bernauer, Julie, Song, Xia, Shoeybi, Mohammad, He, Yuxiong, Houston, Michael, Tiwary, Saurabh, and Catanzaro, Bryan. 2022. “Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, a Large-Scale Generative Language Model.” Preprint, arXiv:2201.11990.Google Scholar

Tänzer, Michael, Ruder, Sebastian, and Rei, Marek. 2022. “Memorisation versus generalisation in pre-trained language models.” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Muresan, Smaranda, Nakov, Preslav, and Villavicencio, Aline, 7564–78. Dublin: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.521.Google Scholar

Touvron, Hugo, Lavril, Thibaut, Izacard, Gautier, Martinet, Xavier, Lachaux, Marie-Anne, Lacroix, Timothée, Rozière, Baptiste, Goyal, Naman, Hambro, Eric, Azhar, Faisal, Rodriguez, Aurelien, Joulin, Armand, Grave, Edouard, and Lample, Guillaume. 2023. “Llama: Open and Efficient Foundation Language Models.” Preprint, arXiv:2302.13971.Google Scholar

Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N., Kaiser, Łukasz, and Polosukhin, Illia. 2017. “Attention is all you Need.” In Proceedings of the 31st International Conference on Neural Information Processing Systems NIPS’17, 6000–10. Long Beach, CA: Curran Associates Inc. Google Scholar

De Vynck, Gerrit. 2023. “Mike Huckabee Says Microsoft and Meta Stole his Books to Train AI [in en-US].” Washington Post. Accessed June 14, 2024. https://www.washingtonpost.com/technology/2023/10/18/mike-huckabee-ai-lawsuit/.Google Scholar

Walsh, Melanie, Preus, Anna, and Antoniak, Maria. 2024. “Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets.” In Findings of the Association for Computational Linguistics: EMNLP 2024, edited by Al-Onaizan, Yaser, Bansal, Mohit, and Chen, Yun-Nung, 15568–603. Miami, FL: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-emnlp.914.Google Scholar

Wang, Liang, Yang, Nan, Huang, Xiaolong, Jiao, Binxing, Yang, Linjun, Jiang, Daxin, Majumder, Rangan, and Wei, Furu. 2022. “Text Embeddings by Weakly-Supervised Contrastive Pre-Training.” Preprint, arXiv:2212.03533.Google Scholar

Wehrmann, Jônatas, and Barros, Rodrigo C.. 2017. “Movie Genre Classification: A Multi-Label Approach Based on Convolutions Through Time.” Applied Soft Computing 61: 973–82. https://doi.org/10.1016/j.asoc.2017.08.029.Google Scholar

Wilkens, Matthew. 2015. “Digital Humanities and its Application in the Study of Literature and Culture.” Comparative Literature 67, no. 1: 11–20. https://doi.org/10.1215/00104124-2861911.Google Scholar

Wilkens, Matthew. 2016. “Genre, Computation, and the Varieties of Twentieth-Century U.S. Fiction.” Journal of Cultural Analytics 2, no. 2. 1–24. https://doi.org/10.22148/16.009.Google Scholar

Worsham, Joseph, and Kalita, Jugal. 2018. “Genre Identification and the Compositional Effect of Genre in Literature.” In Proceedings of the 27th International Conference on Computational Linguistics, edited by Bender, Emily M., Derczynski, Leon, and Isabelle, Pierre, 1963–1973. Santa Fe, NM: Association for Computational Linguistics. https://aclanthology.org/C18-1167.Google Scholar

Xiao, Shitao, Zheng, Liu, Zhang, Peitian, and Muennighoff, Niklas. 2023. “C-Pack: Packaged Resources to Advance General Chinese Embedding.” Preprint, arXiv:2309.07597.Google Scholar

Figure 1. Distribution of books in Books3 by genre (as encoded by first BISAC code – see the “BISAC” section of the Appendix for a key to the abbreviations).

Figure 2. Histogram of book coverage rate with the 100-chunk sampling scheme.

Table 1. Inter-genre distance vs. intra-genre distance

Figure 3. UMAP embedding of a sampling of BERT-embedded text chunks from BISAC-labeled CKB (Cookbooks), FIC (Fiction) and POL (Politics).

Table 2. Inter-book vs. intra-book distance by genre

Figure 4. t-SNE embedding of a sampling of fiction (BISAC code FIC) BERT-embedded tokenized chunks from a handful of books. Note the tight intra-book clustering.