Hostname: page-component-89b8bd64d-mmrw7 Total loading time: 0 Render date: 2026-05-07T04:35:39.780Z Has data issue: false hasContentIssue false

A Guide for Sparse PCA: Model Comparison and Applications

Published online by Cambridge University Press:  01 January 2025

Rosember Guerra-Urzola*
Affiliation:
Tilburg University
Katrijn Van Deun
Affiliation:
Tilburg University
Juan C. Vera
Affiliation:
Tilburg University
Klaas Sijtsma
Affiliation:
Tilburg University
*
Correspondence should be made to Rosember Guerra-Urzola, Department of Methodology and Statistics,Tilburg University, Prof. Cobbenhagenlaan 225, Simon Building, Room S 820, 5037 DB Tilburg, The Netherlands. Email: R.I.GuerraUrzola@tilburguniversity.edu
Rights & Permissions [Opens in a new window]

Abstract

PCA is a popular tool for exploring and summarizing multivariate data, especially those consisting of many variables. PCA, however, is often not simple to interpret, as the components are a linear combination of the variables. To address this issue, numerous methods have been proposed to sparsify the nonzero coefficients in the components, including rotation-thresholding methods and, more recently, PCA methods subject to sparsity inducing penalties or constraints. Here, we offer guidelines on how to choose among the different sparse PCA methods. Current literature misses clear guidance on the properties and performance of the different sparse PCA methods, often relying on the misconception that the equivalence of the formulations for ordinary PCA also holds for sparse PCA. To guide potential users of sparse PCA methods, we first discuss several popular sparse PCA methods in terms of where the sparseness is imposed on the loadings or on the weights, assumed model, and optimization criterion used to impose sparseness. Second, using an extensive simulation study, we assess each of these methods by means of performance measures such as squared relative error, misidentification rate, and percentage of explained variance for several data generating models and conditions for the population model. Finally, two examples using empirical data are considered.

Information

Type
Application Reviews and Case Studies
Creative Commons
Creative Common License - CCCreative Common License - BY
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Copyright
Copyright © 2021 The Author(s)
Figure 0

Table 1. Summary of methods for sparse PCA.

Figure 1

Table 2. Simulation design factors and their levels.

Figure 2

Table 3. Simulation description summary.

Figure 3

Figure 1. Matching sparsity: Boxplots of the performance measures in conditions with 80% of variance accounted by the model in the data and two components. Within each panel, a dashed line divides the boxplots for sparse loadings methods (at the left side of the dashed line) from those for sparse weights methods. The top row summarizes the squared relative error (SRE-LW) for the loadings (at the left of the dashed line) and weights (at the right of the dashed line), the second row the SRE-S for the component scores, the third row (PEV) the proportion of variance in the data explained by the estimated model, and the bottom row the misidentification rate (MR).

Figure 4

Figure 2. Double sparsity: Boxplots of the performance measures in conditions with 80% of variance accounted by the model in the data and two components. Within each panel, a dashed line divides the boxplots for sparse loadings methods (at the left side of the dashed line) from those for sparse weights methods. The top row summarizes the squared relative error (SRE-LW) for the loadings (at the left of the dashed line) and weights (at the right of the dashed line), the second row the SRE-S for the component scores, the third row (PEV) the proportion of variance in the data explained by the estimated model, and the bottom row the misidentification rate (MR).

Figure 5

Figure 3. Mismatching sparsity: boxplots of the performance measures in conditions with 80% of variance accounted by the model in the data and two components. Within each panel, a dashed line divides the boxplots for sparse loadings methods (at the left side of the dashed line) from those for sparse weights methods. The top row summarizes the squared relative error (SRE-LW) for the loadings (at the left of the dashed line) and weights (at the right of the dashed line), the second row the SRE-S for the component scores, the third row (PEV) the proportion of variance in the data explained by the estimated model, and the bottom row the misidentification rate (MR).

Figure 6

Figure 4. Misidentification rate (MR): boxplots of the MR in conditions with 80% of variance accounted by the model in the data, a proportion of sparsity of 0.8, and two components. Within each panel, a dashed line is used to divide the boxplots for sparse loadings methods (at the left side of the dashed line) from those for sparse weights methods.

Figure 7

Figure 5. Percentage of explained variance (PEV): boxplots of the PEV in conditions with 80% of variance accounted by the model in the data, a proportion of sparsity of 0.8, and two components. Within each panel, a dashed line is used to divide the boxplots for sparse loadings methods (at the left side of the dashed line) from those for sparse weights methods.

Figure 8

Figure 6. Index of sparseness(IS) and percentage of explained variance (PEV) against the proportion of sparsity (PS).

Figure 9

Figure 7. Biplot: the dots in each subplot represent the component scores, the arrows the component loadings.

Figure 10

Table 4. Sparse loading and weights composition by trait (OCEAN).

Figure 11

Figure 8. Index of sparseness and percentage of explained variance against the proportion of sparsity when applying GPower to the gene expression data set.

Figure 12

Figure 9. Scatter plot of component scores.

Supplementary material: File

Guerra-Urzola et al. supplementary material

The online version contains supplementary material available at https://doi.org/10.1007/s11336-021-09773-2.
Download Guerra-Urzola et al. supplementary material(File)
File 450 KB