Inspired by the problem of inferring gene networks associated with the host response to infectious diseases, a new framework for discriminative factor models is developed. Bayesian shrinkage priors are employed to impose (near) sparsity on the factor loadings, while non-parametric techniques are utilized to infer the number of factors needed to represent the data. Two discriminative Bayesian loss functions are investigated, i.e. the logistic log-loss and the max-margin hinge loss. Efficient mean-field variational Bayesian inference and Gibbs sampling are implemented. To address large-scale datasets, an online version of variational Bayes is also developed. Experimental results on two real world microarray-based gene expression datasets show that the proposed framework achieves comparatively superior classification performance, with model interpretation delivered via pathway association analysis.
Background
From a statistical-modeling perspective, gene expression analysis can be roughly divided into two phases: exploration and prediction. In the former, the practitioner attempts to get a general understanding of a dataset by modeling its variability in an interpretable way, such that the inferred model can serve as a feature extractor and hypotheses generating mechanism of the underlying biological processes. Factor models are among the most widely employed techniques for exploratory gene expression analysis [1, 2], with principal component analysis a popular special case [3]. Predictive modeling, on the other hand, is concerned with finding a relationship between gene expression and phenotypes, that can be generalized to unseen samples. Examples of predictive models include classification methods like logistic regression and support vector machines [4, 5].
Factor models infer a latent covariance structure among the genes or biomarkers, with data modeled as generated from a noisy low-rank matrix factorization, manifested in terms of a loadings matrix and a factor scores matrix. Different specifications for these matrices give rise to special cases of factor models, such as principal components analysis [6], nonnegative matrix factorization [7], independent component analysis [8], and sparse factor models [1]. Factor models employing a sparse factor loadings matrix are of significant interest in gene-expression analysis, as the nonzero elements in the loadings matrix may be interpreted as correlated gene networks [1, 2, 9].