Hostname: page-component-77f85d65b8-pkds5 Total loading time: 0 Render date: 2026-03-28T19:34:21.529Z Has data issue: false hasContentIssue false

A comparison of latent semantic analysis and correspondence analysis of document-term matrices

Published online by Cambridge University Press:  18 May 2023

Qianqian Qi*
Affiliation:
Department of Methodology and Statistics, Faculty of Social Sciences, Utrecht University, Utrecht, the Netherlands
David J. Hessen
Affiliation:
Department of Methodology and Statistics, Faculty of Social Sciences, Utrecht University, Utrecht, the Netherlands
Tejaswini Deoskar
Affiliation:
Department of Languages, Literature, and Communication, Faculty of Humanities, Utrecht University, Utrecht, the Netherlands
Peter G. M. van der Heijden
Affiliation:
Department of Methodology and Statistics, Faculty of Social Sciences, Utrecht University, Utrecht, the Netherlands Southampton Statistical Sciences Research Institute, University of Southampton, Highfield, Southampton, UK
*
Corresponding author: Qianqian Qi; Email: q.qi@uu.nl
Rights & Permissions [Opens in a new window]

Abstract

Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins, that is, sums of row elements and column elements, arising from differing document lengths and term frequencies are effectively eliminated so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA-based methods on text categorization in English and authorship attribution on historical Dutch texts and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, among several contenders.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Table 1. A document-term matrix ${\boldsymbol{F}}$: size 6$\times$6

Figure 1

Table 2. The singular values, the squares of singular values, and the proportion of explained total sum of squared singular values (PSSSV) for each dimension of LSA of ${\boldsymbol{F}}$, of ${\boldsymbol{F}}^{L1}$, of ${\boldsymbol{F}}^{L2}$, and of ${\boldsymbol{F}}^{\text{TF-IDF}}$

Figure 2

Figure 1. A two-dimensional plot of documents and terms (a) for raw matrix ${\boldsymbol{F}}$; (b) for row-normalized data ${\boldsymbol{F}}^{L1}$; (c) for row-normalized data ${\boldsymbol{F}}^{L2}$; and (d) for matrix ${\boldsymbol{F}}^{\text{TF-IDF}}$.

Figure 3

Table 3. Row profiles of ${\boldsymbol{F}}$

Figure 4

Table 4. A row-normalized document-term matrix ${\boldsymbol{F}}^{L2}$

Figure 5

Table 5. A document-term matrix ${\boldsymbol{F}}^{\text{TF-IDF}}$

Figure 6

Table 6. The matrix ${\boldsymbol{D}}_r^{-\frac{1}{2}}({\boldsymbol{P}}-{\boldsymbol{E}}){\boldsymbol{D}}_{c}^{-\frac{1}{2}}$ of standardized residuals

Figure 7

Table 7. The singular values, the inertia, and the proportions of explained total inertia for each dimension of CA

Figure 8

Figure 2. The data of Table 1 using CA for (a) symmetric map and (b) asymmetric map.

Figure 9

Figure 3. The first two dimensions for each document of BBCNews dataset by (a) LSA-RAW; (b) LSA-NROWL1; (c) LSA-NROWL2; (d) LSA-TFIDF; and (e) CA.

Figure 10

Figure 4. The first two dimensions for each document of BBCSport dataset by (a) LSA-RAW; (b) LSA-NROWL1; (c) LSA-NROWL2; (d) LSA-TFIDF; and (e) CA.

Figure 11

Figure 5. The first two dimensions for each document of 20 Newsgroups dataset by (a) LSA-RAW; (b) LSA-NROWL1; (c) LSA-NROWL2; (d) LSA-TFIDF; and (e) CA.

Figure 12

Table 8. The minimum optimal dimensionality $k$ and the accuracy in $k$ for LSA-RAW, LSA-NROWL1, LSA-NROWL2, LSA-TFIDF, and CA, and the accuracy (Acc) for RAW using different distance measurement methods with the BBCNews, BBCSport, and 20 Newsgroups datasets

Figure 13

Figure 6. Accuracy as a function of dimension for CA, LSA-RAW, LSA-NROWL1, LSA-NROWL2, LSA-TFIDF, and RAW.

Figure 14

Figure 7. The first two dimensions for each document of author Datheen and author Marnix, and the Wilhelmus (in red) by (a) LSA-RAW; (b) LSA-NROWL1; (c) LSA-NROWL2; (d) LSA-TFIDF; and (e) CA.

Figure 15

Figure 8. The first two dimensions for each document of author Datheen and another author, and the Wilhelmus (in red) using CA: (a) Heere; (b) Haecht; (c) Fruytiers; and (d) Coornhert.

Figure 16

Figure 9. The first two dimensions for each document of six authors, and the Wilhelmus (in red) by (a) LSA-RAW; (b) LSA-NROWL1; (c) LSA-NROWL2; (d) LSA-TFIDF; and (e) CA.

Figure 17

Table 9. The minimum optimal dimensionality $k$ and the accuracy in $k$ for LSA-RAW, LSA-NROWL1, LSA-NROWL2, LSA-TFIDF, and CA, and the accuracy for RAW using different distance measurement methods with Wilhelmus dataset

Figure 18

Figure 10. Accuracy versus the number of dimensions (centroid method) for CA, RAW, LSA-RAW, LSA-NROWL1, LSA-NROWL2, and LSA-TFIDF with Wilhelmus dataset.