Introduction and Motivation
With the proliferation of available high-throughput data being generated by platforms such as microarray, mass spectrometry, and sequencing machines, scientists are now commonly confronted with measurements of thousands of molecules. The term molecule here can refer to anything from an mRNA strand to an SNP (single nucleotide polymorphism) to a peptide to an exon. When analysis is done across these molecules on a genome-wide basis, it becomes difficult for analysts to interpret the results. In many instances, investigators are provided with a list of ranked molecules that are then prioritized for further study. Although this represents a fairly reasonable approach, it has the potential to be unwieldy depending on the number of molecules under consideration. Additionally, meta-analysis of high-throughput gene expression studies have found that sets of molecules such as metabolic pathways and signaling cascades are dysregulated more consistently across studies than are the specific molecules detected within these sets (e.g., Fan et al., 2006).
Predicated on the assumption that individual molecules act in concert for various biological processes, multiple databases have been constructed to classify molecules into sets of biological interest, and many are publicly available. Databases such as as KEGG (Kyoto Encyclopedia of Genes and Genomes)(Kanehisa and Goto, 2000; Kanehisa et al., 2006, 2008), GO (Gene Ontology) (Consortium, 2000, 2004), and Biocarta give functional annotation for genes (GO) and relationships to enzymes and metabolites (KEGG and Biocarta).