To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Direct reconstruction of phylogenetic trees by maximum likelihood methods is computationally prohibitive for trees with many taxa; however, by computing all trees for subsets of taxa of size m, we can infer the entire tree. In particular, if m = 2, the traditional distance-based methods such as neighbor-joining [Saitou and Nei, 1987] and UPGMA [Sneath and Sokal, 1973] are applicable. Under distance-based methods, 2-leaf subtrees are completely determined by the total length between each pair of leaves. We extend this idea to m leaves by developing the notion of m-dissimilarity [Pachter and Speyer, 2004]. By building trees on subsets of size m of the taxa and rinding the total length, we can obtain an m-dissimilarity map. We will explain the generalized neighbor-joining (GNJ) algorithm [Levy et al., 2005] for obtaining a phylogenetic tree with edge lengths from an m-dissimilarity map.
This algorithm is consistent: given an m-dissimilarity map DT that comes from a tree T, GNJ returns the correct tree. However, in the case of data that is “noisy”, e.g., when the observed dissimilarity map does not lie in the space of trees, the accuracy of GNJ depends on the reliability of the subtree lengths. Numerical methods may run into trouble when models are of high degree (Section 1.3); exact methods for computing subtrees, therefore, could only serve to improve the accuracy of GNJ. One family of such methods consists of algorithms for finding critical points of the ML equations as discussed in Chapter 15 and in [Hoşten et al., 2005].
This chapter describes genome sequence data and explains the relevance of the statistics, computation and algebra that we have discussed in Chapters 1–3 to understanding the function of genomes and their evolution. It sets the stage for the studies in biological sequence analysis in some of the later chapters.
Given that quantitative methods play an increasingly important role in many different aspects of biology, the question arises: why the emphasis on genome sequences? The most significant answer is that genomes are fundamental objects that carry instructions for the self-assembly of living organisms. Ultimately, our understanding of human biology will be based on an understanding of the organization and function of our genome. Another reason to focus on genomes is the abundance of high fidelity data. Current finished genome sequences have less than one error in 10,000 bases. Statistical methods can therefore be directly applied to modeling the random evolution of genomes and to making inferences about the structure and organization of functional elements; there is no need to worry about extracting signal from noisy data. Furthermore, it is possible to validate findings with laboratory experiments.
The rate of accumulation of genome sequence data has been extraordinary, far outpacing Moore's law for the increasing density of transistors on circuit chips. This is due to breakthroughs in sequencing technologies and radical advances in automation. Since the first completion of the genome of a free living organism in 1995 (Haemophilus Influenza [Fleischmann et al., 1995]), biologists have completely sequenced over 200 microbial genomes, and dozens of complete invertebrate and vertebrate genomes.
Some of the statistical models introduced in Chapter 1 have the feature that, aside from the observed data, there is hidden information that cannot be determined from an observation. In this chapter we consider graphical models with hidden variables, such as the hidden Markov model and the hidden tree model. A natural problem in such models is to determine, given a particular observation, what is the most likely hidden data (which is called the explanation) for that observation. This problem is called MAP inference (Remark 4.13). Any fixed values of the parameters determine a way to assign an explanation to each possible observation. A map obtained in this way is called an inference function.
Examples of inference functions include gene-finding functions which were discussed in [Pachter and Sturmfels, 2005, Section 5]. These inference functions of a hidden Markov model are used to identify gene structures in DNA sequences (see Section 4.4. An observation in such a model is a sequence over the alphabet Σ′ = {A, C, G, T}.
After a short introduction to inference functions, we present the main result of this chapter in Section 9.2. We call it the Few Inference Functions Theorem, and it states that in any graphical model the number of inference functions grows polynomially if the number of parameters is fixed. This theorem shows that most functions from the set of observations to possible values of the hidden data cannot be inference functions for any choice of the model parameters.
We present a new, statistically consistent algorithm for phylogenetic tree construction that uses the algebraic theory of statistical models (as developed in Chapters 1 and 3). Our basic tool is Singular Value Decomposition (SVD) from numerical linear algebra.
Starting with an alignment of n DNA sequences, we show that SVD allows us to quickly decide whether a split of the taxa occurs in their phylogenetic tree, assuming only that evolution follows a tree Markov model. Using this fact, we have developed an algorithm to construct a phylogenetic tree by computing only O(n2) SVDs.
We have implemented this algorithm using the SVDLIBC library (available at http://tedlab.mit.edu/~dr/SVDLIBC/) and have done extensive testing with simulated and real data. The algorithm is fast in practice on trees with 20–30 taxa.
We begin by describing the general Markov model and then show how to flatten the joint probability distribution along a partition of the leaves in the tree. We give rank conditions for the resulting matrix; most notably, we give a set of new rank conditions that are satisfied by non-splits in the tree. Armed with these rank conditions, we present the tree-building algorithm, using SVD to calculate how close a matrix is to a certain rank. Finally, we give experimental results on the behavior of the algorithm with both simulated and real-life (ENCODE) data.
The general Markov model
We assume that evolution follows a tree Markov model, as introduced in Section 1.4, with evolution acting independently at different sites of the genome.
Graphical models are powerful statistical tools that have been applied to a wide variety of problems in computational biology: sequence alignment, ancestral genome reconstruction, etc. A graphical model consists of a graph whose vertices have associated random variables representing biological objects, such as entries in a DNA sequence, and whose edges have associated parameters that model transition or dependence relations between the random variables at the nodes. In many cases we will know the contents of only a subset of the model vertices, the observed random variables, and nothing about the contents of the remaining ones, the hidden random variables. A common example is a phylogenetic tree on a set of current species with given DNA sequences, but with no information about the DNA of their extinct ancestors. The task of finding the most likely set of values of the hidden random variables (also known as the explanation) given the set of observed random variables and the model parameters, is known as inference in graphical models.
Clearly, inference drawn about the hidden data is highly dependent on the topology and parameters (transition probabilities) of the graphical model. The topology of the model will be determined by the biological process being modeled, while the assumptions one can make about the nature of evolution, site mutation and other biological phenomena, allow us to restrict the space of possible transition probabilities to certain parameterized families. This raises several questions.
Using homologous sequences from eight vertebrates, we present a concrete example of the estimation of mutation rates in the models of evolution introduced in Chapter 4. We detail the process of data selection from a multiple alignment of the ENCODE regions, and compare rate estimates for each of the models in the Felsenstein hierarchy of Figure 4.7. We also address a standing problem in vertebrate evolution, namely the resolution of the phylogeny of the Eutherian orders, and discuss several challenges of molecular sequence analysis in inferring the phylogeny of this subclass. In particular, we consider the question of the position of the rodents relative to the primates, carnivores and artiodactyls; we affectionately dub this question the rodent problem.
Estimating mutation rates
Given an alignment of sequence homologs from various taxa, and an evolutionary model from Section 4.5, we are naturally led to ask the question, “what tree (with what branch lengths) and what values of the parameters in the rate matrix for that model are suggested by the alignment?” One answer to this question, the so-called maximum-likelihood solution, is, “the tree and rate parameters which maximize the probability that the given alignment would be generated by the given model.” (See also Sections 1.3 and 3.3.)
There are a number of available software packages which attempt to find, to varying degrees, this maximum-likelihood solution. For example, for a few of the most restrictive models in the Felsenstein hierarchy, the package PHYLIP [Felsenstein, 2004] will very efficiently search the tree space for the maximum-likelihood tree and rate parameters.
Countdown is the name of a game in which one is given a list of source numbers and a target number, with the aim of building an arithmetic expression out of the source numbers to get as close to the target as possible. Starting with a relational specification we derive a number of functional programs for solving Countdown. These programs are obtained by exploiting the properties of the folds and unfolds of various data types, a style of programming Gibbons has aptly called origami programming. Countdown is attractive as a case study in origami programming both as an illustration of how different algorithms can emerge from a single specification, as well as the space and time trade-offs that have to be taken into account in comparing functional programs.
Functional reasoning (FR) enables people to derive and explain function of artifacts in a goal-oriented manner. FR has been studied and employed in various disciplines, including philosophy, biology, sociology, and engineering design, and enhanced by the techniques borrowed from computer science and artificial intelligence. The outcome of FR research has been applied to engineering design, planning, explanation, and learning. A typical FR system in engineering design usually incorporates representational mechanisms of function concept together with description mechanisms of state, structure, or behavior, and explanations and reasoning mechanisms to derive and explain functions. As for representation, philosophers have long argued whether function of an artifact is a genuine property of it. As for explanation and reasoning, they have produced theories for functional ascription by an external viewer as part of an explanation. To build an FR-based system, the theory based on which the system is built and the underlying assumptions must be explicitly identified. This point is not always clear in the engineering of FR-based systems. Understanding the underlying assumptions, logical formulation, and limitations of FR theories will help developers assessing their systems correctly. The purpose of this paper is to review various FR theories and their underlying assumptions and limitations. This later serves as a benchmark for comparing various FR techniques.
Inspiration is useful for exploration and discovery of new solution spaces. Systems in natural and artificial worlds and their functionality are seen as rich sources of inspiration for idea generation. However, unlike in the artificial domain where existing systems are often used for inspiration, those from the natural domain are rarely used in a systematic way for this purpose. Analogy is long regarded as a powerful means for inspiring novel idea generation. One aim of the work reported here is to initiate similar work in the area of systematic biomimetics for product development, so that inspiration from both natural and artificial worlds can be used systematically to help develop novel, analogical ideas for solving design problems. A generic model for representing causality of natural and artificial systems has been developed, and used to structure information in a database of systems from both the domains. These are implemented in a piece of software for automated analogical search of relevant ideas from the databases to solve a given problem. Preliminary experiments at validating the software indicate substantial potential for the approach.
This paper is an informal description of some recent insights about what a device function is, how it arises in response to needs, and how function arises from the structure of a device and the functions of its components. These results formalize and clarify a set of contending intuitions about function that researchers have had. The paper relates the approaches, results, and goals of this stream of research, called functional representation (FR), with the functional modeling (FM) stream in engineering. Despite the occurrence of the term function in the two streams, often the results and techniques in the two streams appear not to have much to do with each other. I argue that, in fact, the two streams are performing research that is mutually complementary. FR research provides the basic layer for device ontology in a formal framework that helps to clarify the meanings of terms such as function and structure, and also to support representation of device knowledge for automated reasoning. FM research provides another layer in device ontology, by attempting to identify behavior primitives that are applicable to subsets of devices, with the hope that functions can be described in those domains with an economy of terms. This can lead to useful catalogs of functions and devices in specific areas of engineering. With increased attention to formalization, the work in FM can provide domain-specific terms for FR research in knowledge representation and automated reasoning.
In engineering design, the end goal is the creation of an artifact, product, system, or process that fulfills some functional requirements at some desired level of performance. As such, knowledge of functionality is essential in a wide variety of tasks in engineering activities, including modeling, generation, modification, visualization, explanation, evaluation, diagnosis, and repair of these artifacts and processes. A formal representation of functionality is essential for supporting any of these activities on computers. The goal of Parts 1 and 2 of this Special Issue is to bring together the state of knowledge of representing functionality in engineering applications from both the engineering and the artificial intelligence (AI) research communities.
The need to model and to reason about design alternatives throughout the design process demands robust representation schemes of function, behavior, and structure. Function describes the physical effect imposed on an energy or material flow by a design entity without regard for the working principles or physical solutions used to accomplish this effect. Behaviors are the physical events associated with a physical artifact (or hypothesized concept) over time (or simulated time) as perceived by an observer. Structure, the most tangible concept, partitions an artifact into meaningful constituents such as features, Wirk elements, and interfaces in addition to the widely used assemblies and components. The focus of this work is on defining a model for function-based representations that can be used across various design methodologies and for a variety of design tasks throughout all stages of the design process. In particular, the mapping between function and structure is explored and, to a lesser extent, its impact on behavior is noted. Clearly, the issues of a function-based representation's composition and mappings directly impact certain computational synthesis methods that rely on (digitally) archived product design knowledge. Moreover, functions have already been related to not only form, but also information of user actions, performance parameters in the form of equations, and failure mode data. It is essential to understand the composition and mappings of functions and their relation to design activities because this information is part of the foundation for function-based methods, and consequently dictates the performance of those methods. Toward this end, the important findings of this work include a formalism for two aspects of function-based representations (composition and mappings), the supported design activities of the model for function-based representations, and examples of how computational design methods benefit from this formalism.
Edited by
Jacob E. Goodman, City College, City University of New York,Janos Pach, City College, City University of New York and New York University,Emo Welzl, Eidgenössische Technische Hochschule Zürich
In this mostly expository paper we explain how the Bernstein basis, widely used in computer-aided geometric design, provides an efficient method for real root isolation, using de Casteljau's algorithm. We discuss the link between this approach and more classical methods for real root isolation. We also present a new improved method for isolating real roots in the Bernstein basis inspired by Roullier and Zimmerman.
Introduction
Real root isolation is an important subroutine in many algorithms of real algebraic geometry [Basu et al. 2003] as well as in exact geometric computations, and is also interesting in its own right.
Our approach to real root isolation is based on properties of the Bernstein basis. We first recall Descartes’ Law of Signs and give a useful partial reciprocal to it. Section 2 contains the definition and main properties of the Bernstein basis. In the third section, several variants of real root isolation based on the Bernstein basis are given. In the fourth section, the link with more classical real root isolation methods [Uspensky 1948] is established. We end the paper with a few remarks on the computational efficiency of the algorithms described.