To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter we present a probabilistic approach to the homology mapping problem. This is the problem of identifying regions among genomic sequences that diverged from the same region in a common ancestor. We explore this question as a combinatorial optimization problem, seeking the best assignment of labels to the nodes in a Markov random field. The general problem is formulated using toric models, for which it is unfortunately intractable to find an exact solution. However, for a relevant subclass of models, we find a (non-integer) linear programming formulation that gives us the exact integer solution in polynomial time in the size of the problem. It is encouraging that for a useful subclass of toric models, maximum a posteriori inference is tractable.
Genome mapping
Evolutionary divergence gives rise to different present-day genomes that are related by shared ancestry. Evolutionary events occur at varying rates, but also at different scales of genomic regions. Local mutation events (for instance, the point mutations, insertions and deletions discussed in Section 4.5) occur at the level of one or several base-pairs. Large-scale mutations can occur at the level of single or multiple genes, chromosomes, or even an entire genome. Some of these mutation mechanisms such as rearrangement and duplication, were briefly introduced in Section 4.1. As a result, regions in two different genomes could be tied to a single region in the ancestral genome, linked by a series of mutational events.
As discussed in Chapter 1, the EM algorithm is an iterative procedure used to obtain maximum likelihood estimates (MLEs) for the parameters of statistical models which are induced by a hidden variable construct, such as the hidden Markov model (HMM). The tree structure underlying the HMM allows us to organize the required computations efficiently, which leads to an efficient implementation of the EM algorithm for HMMs known as the Baum–Welch algorithm. For several examples of two-state HMMs with binary output we plot the likelihood function and relate the paths taken by the EM algorithm to the gradient of the likelihood function.
The EM algorithm for hidden Markov models
The hidden Markov model is obtained from the fully observed Markov model by marginalization; see Sections 1.4.2 and 1.4.3. We will use the same notation as there, so σ = σ1σ2 … σn ∈ Σn is a sequence of states and τ = τ1τ2 … τn ∈ (Σ′)n a sequence of output variables. We assume that we observe N sequences, τ1, τ2, …, τN ∈ (Σ′)n, each of length n but that the corresponding state sequences, σ1, σ2, …, σN ∈ Σn, are not observed (hidden).
In Section 1.4.2 it is assumed that there is a uniform distribution on the first state in each sequence, i.e., Prob(σ1 = r) = 1/l for each r ∈ Σ where l = |Σ|.
The contributions in this part of the book were all written by students, postdocs and visitors who in some way were involved in the graduate course Algebraic Statistics for Computational Biology that we taught in the mathematics department at UC Berkeley during the fall of 2004. The eighteen chapters offer a more in-depth study of some of the themes which were introduced in Part I. Most of the chapters contain original research that has not been published elsewhere. Highlights among new research results include:
New results about polytope propagation and parametric inference (Chapters 5, 6 and 8).
An example of a biologically correct alignment which is not the optimal alignment for any choice of parameters in the pair HMM (Chapter 7).
Theorem 9.3 which states that the number of inference functions of a graphical model grows polynomially for fixed number of parameters.
Theorem 10.5 which states that, for alphabets with four or more letters, every toric Viterbi sequence is a Viterbi sequence.
Explicit calculations of phylogenetic invariants for the strand symmetric model which interpolate between the general reversible model and group based models (Chapter 16).
Tree reconstruction based on singular value decomposition (Chapter 19).
The other chapters also include new mathematical results or methodological advances in computational biology. Chapter 15 introduces a standardized framework for working with small trees. Even results on the smallest non-trivial tree (with three leaves) are interesting, and are discussed in Chapter 18. Similarly, Chapter 14 presents a unified algebraic statistical view of mutagenetic tree models.
Polytope propagation associated with hidden Markov models or arbitrary tree models, as introduced in [Pachter and Sturmfels, 2004a], can be carried to a further level of abstraction. Generalizing polytope propagation allows for a clearer view of the algorithmic complexity issues involved.
We begin with the simple observation that a graphical model associated with a model graph G, which may or may not be a tree, defines another directed graph Γ(G) which can roughly be seen as a product of G with the state space of the model (considered as a graph with an isolated node for each state). Polytope propagation actually takes place on this product graph Γ(G). The nodes represent the polytopes propagated while each arc carries a vector which represents the multiplication with a monomial in the parameters of the model.
The purpose of this chapter is to collect some information about what happens if Γ(G) is replaced by an arbitrary directed acyclic graph and to explain how this more general form of polytope propagation is implemented in polymake.
Polytopes from directed acyclic graphs
Let Γ = (V, A) be a finite directed graph with node set V and arc set A. Also, let α : A → ℝd be some function. We assume that Γ does not have any directed cycles. As such, Γ has at least one source and one sink where a source is a node of in-degree zero and a sink is a node of out-degree zero.
This appendix collects mathematical tools that are needed in the main text. In addition, it gives a brief description of some essential background topics. It is assumed that the reader knows elementary calculus. The topics are grouped in four sections. First, we consider some useful methods of indirect proofs. Second, we introduce elementary results for complex numbers and polynomials. The third topic concerns series expansions. Finally, some further calculus is presented.
Some methods of indirect proof
Perhaps the most fundamental of all mathematical tools is the construction of a proof. When a direct proof is hard to obtain, there are indirect methods that can often help. In this section, we will denote a statement by p (such as “I like this book”), and another by q (such as “matrix algebra is interesting”). The negation of p will be denoted by ¬p. The statement “p and q” is denoted by p ∧ q, and the statement “p or q (or both)” is denoted by p ∨ q. The statements ¬(p ∨ q) and ¬p ∧ ¬q are equivalent: the negation transforms p, q into ¬p, ¬q and ∨ into ∧. This is the equivalent of De Morgan's law for sets, where p and q would be sets, ¬p the complement of p, q ∨ q the union of the sets, and p ∧ q their intersection.
Part I of this book is devoted to outlining the basic principles of algebraic statistics and their relationship to computational biology. Although some of the ideas are complex, and their relationships intricate, the underlying philosophy of our approach to biological sequence analysis is summarized in the cartoon on the cover of the book. The fictional character is DiaNA, who appears throughout the book, and is the statistical surrogate for our biological intuition. In the cartoon, DiaNA is walking randomly on a graph and is tossing tetrahedral dice that can land on one of the letters A, C, G or T. A key feature of the tosses is that the outcome depends on the direction of her route. We, the observers, record the letters that appear on the successive throws, but are unable to see the path that DiaNA takes on her graph. Our goal is to guess DiaNA's path from the die roll outcomes. That is, we wish to make an inference about missing data from certain observations.
In this book, the observed data are DNA sequences. A standard problem of computational biology is to infer an optimal alignment for two given DNA sequences. We shall see that this problem is precisely our example of guessing DiaNA's path. In Chapter 4 we give an introduction to the relevant biological concepts, and we argue that our example is not just a toy problem but is fundamental for designing efficient algorithms for analyzing real biological data.
Direct reconstruction of phylogenetic trees by maximum likelihood methods is computationally prohibitive for trees with many taxa; however, by computing all trees for subsets of taxa of size m, we can infer the entire tree. In particular, if m = 2, the traditional distance-based methods such as neighbor-joining [Saitou and Nei, 1987] and UPGMA [Sneath and Sokal, 1973] are applicable. Under distance-based methods, 2-leaf subtrees are completely determined by the total length between each pair of leaves. We extend this idea to m leaves by developing the notion of m-dissimilarity [Pachter and Speyer, 2004]. By building trees on subsets of size m of the taxa and rinding the total length, we can obtain an m-dissimilarity map. We will explain the generalized neighbor-joining (GNJ) algorithm [Levy et al., 2005] for obtaining a phylogenetic tree with edge lengths from an m-dissimilarity map.
This algorithm is consistent: given an m-dissimilarity map DT that comes from a tree T, GNJ returns the correct tree. However, in the case of data that is “noisy”, e.g., when the observed dissimilarity map does not lie in the space of trees, the accuracy of GNJ depends on the reliability of the subtree lengths. Numerical methods may run into trouble when models are of high degree (Section 1.3); exact methods for computing subtrees, therefore, could only serve to improve the accuracy of GNJ. One family of such methods consists of algorithms for finding critical points of the ML equations as discussed in Chapter 15 and in [Hoşten et al., 2005].
The past two decades have seen econometrics grow into a vast discipline. Many different branches of the subject now happily coexist with one another. These branches interweave econometric theory and empirical applications, and bring econometric method to bear on a myriad of economic issues. Against this background, a guided treatment of the modern subject of econometrics in a of volumes of worked econometric exercises seemed a natural and rather challenging idea.
The present Series, Econometric Exercises, was conceived in 1995 with this challenge in mind. Now, almost a decade later it has become an exciting reality with the publication of the first installment of a series of volumes of worked econometric exercises. How can these volumes work as a tool of learning that adds value to the many existing textbooks of econometrics? What readers do we have in mind as benefiting from this Series? What format best suits the objective of helping these readers learn, practice, and teach econometrics? These questions we now address, starting with our overall goals for the Series.
Econometric Exercises is published as an organized set of volumes. Each volume in the Series provides a coherent sequence of exercises in a specific field or subfield of econometrics. Solved exercises are assembled together in a structured and logical pedagogical framework that seeks to develop the subject matter of the field from its foundations through to its empirical applications and advanced reaches.