Hostname: page-component-77f85d65b8-g98kq Total loading time: 0 Render date: 2026-03-28T07:27:20.820Z Has data issue: false hasContentIssue false

On the literary landscapes of vector embeddings

Published online by Cambridge University Press:  07 October 2025

Daniel Rockmore*
Affiliation:
Mathematics, Dartmouth College , USA Santa Fe Institute, USA
Jiayi Chen
Affiliation:
Mathematics, Dartmouth College , USA
Mohammad Javad Latifi Jebelli
Affiliation:
Mathematics, Dartmouth College , USA
Allen Riddell
Affiliation:
School of Informatics, Computing, and Engineering, Indiana University, USA
Harrison Stropkay
Affiliation:
Asymmetric Operations Sector, Applied Physics Laboratory, Laurel, USA
*
Corresponding author: Daniel Rockmore; Email: daniel.n.rockmore@dartmouth.edu
Rights & Permissions [Opens in a new window]

Abstract

From the early use of TF-IDF to the high-dimensional outputs of deep learning, vector space embeddings of text, at a scale ranging from token to document, are at the heart of all machine analysis and generation of text. In this article, we present the first large-scale comparison of a sampling of such techniques on a range of classification tasks on a large corpus of current literature drawn from the well-known Books3 data set. Specifically, we compare TF-IDF, Doc2vec and several Transformer-based embeddings on a variety of text-specific tasks. Using industry-standard BISAC codes as a proxy for genre, we compare embeddings in their ability to preserve information about genre. We further compare these embeddings in their ability to encode inter- and intra-book similarity. All of these comparisons take place at the book “chunk” (1,024 tokens) level. We find Transformer-based (“neural”) embeddings to be best, in the sense of their ability to respect genre and authorship, although almost all embedding techniques produce sensible constructions of a “literary landscape” as embodied by the Books3 corpus. These experiments suggest the possibility of using deep learning embeddings not only for advances in generative AI, but also a potential tool for book discovery and as an aid to various forms of more traditional comparative textual analysis.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices
Open materials
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Distribution of books in Books3 by genre (as encoded by first BISAC code – see the “BISAC” section of the Appendix for a key to the abbreviations).

Figure 1

Figure 2. Histogram of book coverage rate with the 100-chunk sampling scheme.

Figure 2

Table 1. Inter-genre distance vs. intra-genre distance

Figure 3

Figure 3. UMAP embedding of a sampling of BERT-embedded text chunks from BISAC-labeled CKB (Cookbooks), FIC (Fiction) and POL (Politics).

Figure 4

Table 2. Inter-book vs. intra-book distance by genre

Figure 5

Figure 4. t-SNE embedding of a sampling of fiction (BISAC code FIC) BERT-embedded tokenized chunks from a handful of books. Note the tight intra-book clustering.

Figure 6

Table 3. Classification accuracy by chunk

Figure 7

Table 4. Classification accuracy by book

Figure 8

Figure 5. Confusion matrix – E5 embedding & KNN ($n=200$.)

Figure 9

Figure 6. Averaged genre prediction entropy in descending order.

Figure 10

Figure 7. Averaged genre prediction entropy in descending order.

Figure 11

Figure 8. KNN ($n=200$) predictions of the book “Junk Mail” using E5 embedding.

Figure 12

Figure 9. Local dimensionality estimation per genre.

Figure 13

Table A.1. BISAC subject headings for top 20 genres

Figure 14

Figure A.1. Indexing of the entries in our Books3 “sandbox” composed of 100 books per each of the 20 best represented genres.

Figure 15

Figure A.2. TFIDF-5k.

Figure 16

Table A.2. Classification accuracy by chunk

Figure 17

Figure A.3. Fiction subgenre confusion matrix – E5 embedding & KNN ($n=200$).

Figure 18

Table A.3. Chunk counts by genre and book section

Figure 19

Table A.4. KNN classification accuracy ($n=100$) across book sections and models

Figure 20

Table A.5. Logistic regression classification accuracy ($C=1$) across book sections and models

Figure 21

Table A.6. Top 5 entropy pre genre

Figure 22

Figure A.4. Mean accuracy of KNN classifiers per genre.

Figure 23

Figure A.5. Mean accuracy of logistic regression classifiers per genre.

Submit a response

Rapid Responses

No Rapid Responses have been published for this article.