To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter describes genome sequence data and explains the relevance of the statistics, computation and algebra that we have discussed in Chapters 1–3 to understanding the function of genomes and their evolution. It sets the stage for the studies in biological sequence analysis in some of the later chapters.
Given that quantitative methods play an increasingly important role in many different aspects of biology, the question arises: why the emphasis on genome sequences? The most significant answer is that genomes are fundamental objects that carry instructions for the self-assembly of living organisms. Ultimately, our understanding of human biology will be based on an understanding of the organization and function of our genome. Another reason to focus on genomes is the abundance of high fidelity data. Current finished genome sequences have less than one error in 10,000 bases. Statistical methods can therefore be directly applied to modeling the random evolution of genomes and to making inferences about the structure and organization of functional elements; there is no need to worry about extracting signal from noisy data. Furthermore, it is possible to validate findings with laboratory experiments.
The rate of accumulation of genome sequence data has been extraordinary, far outpacing Moore's law for the increasing density of transistors on circuit chips. This is due to breakthroughs in sequencing technologies and radical advances in automation. Since the first completion of the genome of a free living organism in 1995 (Haemophilus Influenza [Fleischmann et al., 1995]), biologists have completely sequenced over 200 microbial genomes, and dozens of complete invertebrate and vertebrate genomes.
This volume on matrix algebra and its companion volume on statistics are the first two volumes of the Econometric Exercises Series. The two books contain exercises in matrix algebra, probability, and statistics, relating to course material that students are expected to know while enrolled in an (advanced) undergraduate or a postgraduate course in econometrics.
When we started writing this volume, our aim was to provide a collection of interesting exercises with complete and rigorous solutions. In fact, we wrote the book that we — as students — would have liked to have had. Our intention was not to write a textbook, but to supply material that could be used together with a textbook. But when the volume developed we discovered that we did in fact write a textbook, be it one organized in a completely different manner. Thus, we do provide and prove theorems in this volume, because continually referring to other texts seemed undesirable. The volume can thus be used either as a self-contained course in matrix algebra or as a supplementary text.
We have attempted to develop new ideas slowly and carefully. The important ideas are introduced algebraically and sometimes geometrically, but also through examples. It is our experience that most students find it easier to assimilate the material through examples rather than by the theoretical development only.
Some of the statistical models introduced in Chapter 1 have the feature that, aside from the observed data, there is hidden information that cannot be determined from an observation. In this chapter we consider graphical models with hidden variables, such as the hidden Markov model and the hidden tree model. A natural problem in such models is to determine, given a particular observation, what is the most likely hidden data (which is called the explanation) for that observation. This problem is called MAP inference (Remark 4.13). Any fixed values of the parameters determine a way to assign an explanation to each possible observation. A map obtained in this way is called an inference function.
Examples of inference functions include gene-finding functions which were discussed in [Pachter and Sturmfels, 2005, Section 5]. These inference functions of a hidden Markov model are used to identify gene structures in DNA sequences (see Section 4.4. An observation in such a model is a sequence over the alphabet Σ′ = {A, C, G, T}.
After a short introduction to inference functions, we present the main result of this chapter in Section 9.2. We call it the Few Inference Functions Theorem, and it states that in any graphical model the number of inference functions grows polynomially if the number of parameters is fixed. This theorem shows that most functions from the set of observations to possible values of the hidden data cannot be inference functions for any choice of the model parameters.
We present a new, statistically consistent algorithm for phylogenetic tree construction that uses the algebraic theory of statistical models (as developed in Chapters 1 and 3). Our basic tool is Singular Value Decomposition (SVD) from numerical linear algebra.
Starting with an alignment of n DNA sequences, we show that SVD allows us to quickly decide whether a split of the taxa occurs in their phylogenetic tree, assuming only that evolution follows a tree Markov model. Using this fact, we have developed an algorithm to construct a phylogenetic tree by computing only O(n2) SVDs.
We have implemented this algorithm using the SVDLIBC library (available at http://tedlab.mit.edu/~dr/SVDLIBC/) and have done extensive testing with simulated and real data. The algorithm is fast in practice on trees with 20–30 taxa.
We begin by describing the general Markov model and then show how to flatten the joint probability distribution along a partition of the leaves in the tree. We give rank conditions for the resulting matrix; most notably, we give a set of new rank conditions that are satisfied by non-splits in the tree. Armed with these rank conditions, we present the tree-building algorithm, using SVD to calculate how close a matrix is to a certain rank. Finally, we give experimental results on the behavior of the algorithm with both simulated and real-life (ENCODE) data.
The general Markov model
We assume that evolution follows a tree Markov model, as introduced in Section 1.4, with evolution acting independently at different sites of the genome.
Graphical models are powerful statistical tools that have been applied to a wide variety of problems in computational biology: sequence alignment, ancestral genome reconstruction, etc. A graphical model consists of a graph whose vertices have associated random variables representing biological objects, such as entries in a DNA sequence, and whose edges have associated parameters that model transition or dependence relations between the random variables at the nodes. In many cases we will know the contents of only a subset of the model vertices, the observed random variables, and nothing about the contents of the remaining ones, the hidden random variables. A common example is a phylogenetic tree on a set of current species with given DNA sequences, but with no information about the DNA of their extinct ancestors. The task of finding the most likely set of values of the hidden random variables (also known as the explanation) given the set of observed random variables and the model parameters, is known as inference in graphical models.
Clearly, inference drawn about the hidden data is highly dependent on the topology and parameters (transition probabilities) of the graphical model. The topology of the model will be determined by the biological process being modeled, while the assumptions one can make about the nature of evolution, site mutation and other biological phenomena, allow us to restrict the space of possible transition probabilities to certain parameterized families. This raises several questions.
Using homologous sequences from eight vertebrates, we present a concrete example of the estimation of mutation rates in the models of evolution introduced in Chapter 4. We detail the process of data selection from a multiple alignment of the ENCODE regions, and compare rate estimates for each of the models in the Felsenstein hierarchy of Figure 4.7. We also address a standing problem in vertebrate evolution, namely the resolution of the phylogeny of the Eutherian orders, and discuss several challenges of molecular sequence analysis in inferring the phylogeny of this subclass. In particular, we consider the question of the position of the rodents relative to the primates, carnivores and artiodactyls; we affectionately dub this question the rodent problem.
Estimating mutation rates
Given an alignment of sequence homologs from various taxa, and an evolutionary model from Section 4.5, we are naturally led to ask the question, “what tree (with what branch lengths) and what values of the parameters in the rate matrix for that model are suggested by the alignment?” One answer to this question, the so-called maximum-likelihood solution, is, “the tree and rate parameters which maximize the probability that the given alignment would be generated by the given model.” (See also Sections 1.3 and 3.3.)
There are a number of available software packages which attempt to find, to varying degrees, this maximum-likelihood solution. For example, for a few of the most restrictive models in the Felsenstein hierarchy, the package PHYLIP [Felsenstein, 2004] will very efficiently search the tree space for the maximum-likelihood tree and rate parameters.
This chapter is the most abstract of the book. You may skip it at first reading, and jump directly to Chapter 4. But make sure you return to it later. Matrix theory can be viewed from an algebraic viewpoint or from a geometric viewpoint — both are equally important. The theory of vector spaces is essential in understanding the geometric viewpoint.
Associated with every vector space is a set of scalars, used to define scalar multiplication on the space. In the most abstract setting these scalars are required only to be elements of an algebraic field. We shall, however, always take the scalars to be the set of complex numbers (complex vector space) or, as an important special case, the set of real numbers (real vector space).
A vector space (or linear space) V is a nonempty set of elements (called vectors) together with two operations and a set of axioms. The first operation is addition, which associates with any two vectors x, y ∈ V a vector x + y ∈ V (the sum of x and y). The second operation is scalar multiplication, which associates with any vector x ∈ V and any real (or complex) scalar α, a vector αx ∈ V. It is the scalar (rather than the vectors) that determines whether the space is real or complex.