To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Sequence comparison, particularly when combined with the systematic collection, curration, and search of databases containing biomolecular sequences, has become essential in modern molecular biology. Commenting on the (then) near-completion of the effort to sequence the entire yeast genome (now finished), Stephen Oliver says
In a short time it will be hard to realize how we managed without the sequence data. Biology will never be the same again. [478]
One fact explains the importance of molecular sequence data and sequence comparison in biology.
The first fact of biological sequence analysis
The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity usually implies significant functional or structural similarity.
Evolution reuses, builds on, duplicates, and modifies “successful” structures (proteins, exons, DNA regulatory sequences, morphological features, enzymatic pathways, etc.). Life is based on a repertoire of structured and interrelated molecular building blocks that are shared and passed around. The same and related molecular structures and mechanisms show up repeatedly in the genome of a single species and across a very wide spectrum of divergent species. “Duplication with modification” [127, 128, 129, 130] is the central paradigm of protein evolution, wherein new proteins and/or new biological functions are fashioned from earlier ones. Doolittle emphasizes this point as follows:
The vast majority of extant proteins are the result of a continuous series of genetic duplications and subsequent modifications.
We will see many applications of suffix trees throughout the book. Most of these applications allow surprisingly efficient, linear-time solutions to complex string problems. Some of the most impressive applications need an additional tool, the constant-time lowest common ancestor algorithm, and so are deferred until that algorithm has been discussed (in Chapter 8). Other applications arise in the context of specific problems that will be discussed in detail later. But there are many applications we can now discuss that illustrate the power and utility of suffix trees. In this chapter and in the exercises at its end, several of these applications will be explored.
Perhaps the best way to appreciate the power of suffix trees is for the reader to spend some time trying to solve the problems discussed below, without using suffix trees. Without this effort or without some historical perspective, the availability of suffix trees may make certain of the problems appear trivial, even though linear-time algorithms for those problems were unknown before the advent of suffix trees. The longest common substring problem discussed in Section 7.4 is one clear example, where Knuth had conjectured that a linear-time algorithm would not be possible [24, 278], but where such an algorithm is immediate with the use of suffix trees. Another classic example is the longest prefix repeat problem discussed in the exercises, where a linear-time solution using suffix trees is easy, but where the best prior method ran in O(n log n) time.
With the ability to solve lowest common ancestor queries in constant time, suffix trees can be used to solve many additional string problems. Many of those applications move from the domain of exact matching to the domain of inexact, or approximate, matching (matching with some errors permitted). This chapter illustrates that point with several examples.
Longest common extension: a bridge to inexact matching
The longest common extension problem is solved as a subtask in many classic string algorithms. It is at the heart of all but the last application discussed in this chapter and is central to the k-difference algorithm discussed in Section 12.2.
Longest common extension problem Two strings S1 and S2 of total length n are first specified in a preprocessing phase. Later, a long sequence of index pairs is specified. For each specified index pair (i, j), we must find the length of the longest substring of S1 starting at position i that matches a substring of S2 starting at position j. That is, we must find the length of the longest prefix of suffix i of S1 that matches a prefix of suffix j of S2 (see Figure 9.1).
Of course, any time an index pair is specified, the longest common extension can be found by direct search in time proportional to the length of the match. But the goal is to compute each extension in constant time, independent of the length of the match.
In Section 15.11.3, we discussed the canonical advice of translating any newly sequenced gene into a derived amino acid sequence to search the protein databases for similarities to the new sequence. This is in contrast to searching DNA databases with the original DNA string. There is, however, a technical problem with using derived amino acid sequences. If a single nucleotide is missing from the DNA transcript, then the reading frame of the succeeding DNA will be changed (see Figure 18.1). A similar problem occurs if a nucleotide is incorrectly inserted into the transcript. Until the correct reading frame is reestablished (through additional errors), most of the translated amino acids will be incorrect, invalidating most comparisons made to the derived amino acid sequence.
Insertion and deletion errors during DNA sequencing are fairly common, so frameshift errors can be serious in the subsequent analysis. Those errors are in addition to any substitution errors that leave the reading frame unchanged. Moreover, informative alignments often contain a relatively small number of exactly matching characters and larger regions of more poorly aligned substrings (see Section 11.7 on local alignment). Therefore, two substrings that would align well without a frameshift error but would align poorly with one can easily be mistaken for regions that align poorly due only to substitution errors. Therefore, without some additional technique, it is easy to miss frameshift errors and hard to correct them.
We now turn to the dominant, most mature and most successful application of string algorithms in computational biology: the building and searching of databases holding molecular sequence data. We start by illustrating some uses of sequence databases and by describing a bit about existing databases. The impact of database searching (and expectations of greater future impact), explains a large part of the interest among biologists for algorithms that search, manipulate and compare strings. In turn, the biologists' activities have stimulated additional interest in string algorithms among computer scientists.
After describing the “why” and the “what” of sequence databases, we will discuss some “how” issues in string algorithms that are particular to database organization and search.
Why sequence databases?
Comprehensive databases/archives holding DNA and protein sequences are firmly established as central tools in current molecular biology – “electronic databases are fast becoming the lifeblood of the field” [452]. The fundamental reason is the power of biomolecular sequence comparison. This was made explicit in the first fact of biological sequence analysis (Chapter 10, page 212). Given the effectiveness of sequence comparison in molecular biology, it is natural to stockpile and systematically organize the biosequences to be compared; this has naturally led to the growth of sequence databases. We start this chapter with a few illustrations of the power of sequence comparison in the form of sequence database search.
The dominant view of the evolution of life is that all existing organisms are derived from some common ancestor and that a new species arises by a splitting of one population into two (or more) populations that do not cross-breed, rather than by a mixing of two populations into one. Therefore, the high level history of life is ideally organized and displayed as a rooted, directed tree. The extant species (and some of the extinct species) are represented at the leaves of the tree, each internal node represents a point when the history of two sets of species diverged (or represents a common ancestor of those species), the length and direction of each edge represents the passage of time or the evolutionary events that occur in that time, and so the path from the root of the tree to each leaf represents the evolutionary history of the organisms represented there. To quote Darwin
… the great Tree of Life fills with its dead and broken branches the crust of the earth, and covers the surface with its ever-branching and beautiful ramifications.
[119]
This view of the history of life as a tree must frequently be modified when considering the evolution of viruses, or even bacteria or individual genes, but remains the dominant way that high-level evolution is viewed in current biology. Hundreds (may be thousands) of papers are published yearly that depict deduced evolutionary trees.
In the previous three parts of the book we developed general techniques and specific string algorithms whose importance is either already well established or is likely to be established. We expect that the material of those three parts will be relevant to the field of string algorithms and molecular sequence analysis for many years to come. In this final part of the book we branch out from well established techniques and from problems strictly defined on strings. We do this in three ways.
First, we discuss techniques that are very current but may not stand the test of time although they may lead to more powerful and effective methods. Similarly, we discuss string problems that are tied to current technology in molecular biology but may become less important as that technology changes.
Second, we discuss problems, such as physical mapping, fragment assembly, and building phylogenetic (evolutionary) trees, that, although related to string problems, are not themselves string problems. These cousins of string problems either motivate specific pure string problems or motivate string problems generally by providing a more complete picture of how biological sequence data are obtained, or they use the output of pure string algorithms.
Third, we introduce a few important cameo topics without giving as much depth and detail as has generally been given to other topics in the book.
Of course, some topics to be presented in this final part of the book cross the three categories and are simultaneously currents, cousins, and cameos.
We will present two methods for constructing suffix trees in detail, Ukkonen's method and Weiner's method. Weiner was the first to show that suffix trees can be built in linear time, and his method is presented both for its historical importance and for some different technical ideas that it contains. However, Ukkonen's method is equally fast and uses far less space (i.e., memory) in practice than Weiner's method. Hence Ukkonen is the method of choice for most problems requiring the construction of a suffix tree. We also believe that Ukkonen's method is easier to understand. Therefore, it will be presented first. A reader who wishes to study only one method is advised to concentrate on it. However, our development of Weiner's method does not depend on understanding Ukkonen's algorithm, and the two algorithms can be read independently (with one small shared section noted in the description of Weiner's method).
Ukkonen's linear-time suffix tree algorithm
Esko Ukkonen [438] devised a linear-time algorithm for constructing a suffix tree that may be the conceptually easiest linear-time construction algorithm. This algorithm has a space-saving improvement over Weiner's algorithm (which was achieved first in the development of McCreight's algorithm), and it has a certain “on-line” property that may be useful in some situations. We will describe that on-line property but emphasize that the main virtue of Ukkonen's algorithm is the simplicity of its description, proof, and time analysis.
A look at some DNA mapping and sequencing problems
In this chapter we consider a number of theoretical and practical issues in creating and using genome maps and in large-scale (genomic) DNA sequencing. These areas are considered in this book for two reasons: First, we want to more completely explain the origin of molecular sequence data, since string problems on such data provide a large part of the motivation for studying string algorithms in general. Second, we need to more completely explain specific problems on strings that arise in obtaining molecular sequence data.
We start with a discussion of mapping in general and the distinction between physical maps and genetic maps. This leads to the discussion of several physical mapping techniques such as STS-content mapping and radiation-hybrid mapping. Our discussion emphasizes the combinatorial and computational aspects common to those techniques. We follow with a discussion of the tightest layout problem, and a short introduction to map comparison and map alignment. Then we move to large-scale sequencing and its relation to physical mapping. We emphasize shotgun sequencing and the string problems involved in sequence assembly under the shotgun strategy. Shotgun sequencing leads naturally to a beautiful pure string problem, the shortest common superstring problem. This pure, exact string problem is motivated by the practical problem of shotgun sequence assembly and deserves attention if only for the elegance of the results that have been obtained.
In this chapter we begin the discussion of multiple string comparison, one of the most important methodological issues and most active research areas in current biological sequence analysis. We first discuss some of the reasons for the importance of multiple string comparison in molecular biology. Then we will examine multiple string alignment, one common way that multiple string comparison has been formalized. We will precisely define three variants of the multiple alignment problem and consider in depth algorithms for attacking those problems. Other variants will be sketched in this chapter; additional multiple alignment issues will be discussed in Part IV.
Why multiple string comparison?
For a computer scientist, the multiple string comparison problem may at first seem like a generalization for generalization's sake – “two strings good, four strings better”. But in the context of molecular biology, multiple string comparison (of DNA, RNA, or protein strings) is much more than a technical exercise. It is the most critical cutting-edge tool for extracting and representing biologically important, yet faint or widely dispersed, commonalities from a set of strings. These (faint) commonalities may reveal evolutionary history, critical conserved motifs or conserved characters in DNA or protein, common two- and three-dimensional molecular structure, or clues about the common biological function of the strings. Such commonalities are also used to characterize families or superfamilies of proteins. These characterizations are then used in database searches to identify other potential members of a family.
Although I didn't know it at the time, I began writing this book in the summer of 1988 when I was part of a computer science (early bioinformatics) research group at the Human Genome Center of Lawrence Berkeley Laboratory. Our group followed the standard assumption that biologically meaningful results could come from considering DNA as a one-dimensional character string, abstracting away the reality of DNA as a flexible three-dimensional molecule, interacting in a dynamic environment with protein and RNA, and repeating a life-cycle in which even the classic linear chromosome exists for only a fraction of the time. A similar, but stronger, assumption existed for protein, holding, for example, that all the information needed for correct three-dimensional folding is contained in the protein sequence itself, essentially independent of the biological environment the protein lives in. This assumption has recently been modified, but remains largely intact [297].
For nonbiologists, these two assumptions were (and remain) a god send, allowing rapid entry into an exciting and important field. Reinforcing the importance of sequence-level investigation were statements such as:
The digital information that underlies biochemistry, cell biology, and development can be represented by a simple string of G's, A's, T's and C's. This string is the root data structure of an organism's biology. [352]
Inequalities for martingales with bounded differences have recently proved to be very useful in combinatorics and in the mathematics of operational research and computer science. We see here that these inequalities extend in a natural way to ‘centering sequences’ with bounded differences, and thus include, for example, better inequalities for sequences related to sampling without replacement.
Considering strings over a finite alphabet [Ascr], say that a string is w-avoiding if it does not contain w as a substring. It is known that the number aw(n) of w-avoiding strings of length n depends only on the autocorrelation of w as defined by Guibas–Odlyzko. We give a simple criterion on the autocorrelations of w and w′ for determining whether aw(n) > aw′(n) for all large enough n.
The prime factorization of a random integer has a GEM/Poisson-Dirichlet distribution as transparently proved by Donnelly and Grimmett [8]. By similarity to the arc-sine law for the mean distribution of the divisors of a random integer, due to Deshouillers, Dress and Tenenbaum [6] (see also Tenenbaum [24, II.6.2, p. 233]), – the ‘DDT theorem’ – we obtain an arc-sine law in the GEM/Poisson-Dirichlet context. In this context we also investigate the distribution of the number of components larger than ε which correspond to the number of prime factors larger than nε.
We are interested in a function f(p) that represents the probability that a random subset of edges of a Δ-regular graph G contains half the edges of some cycle of G. f(p) is also the probability that a codeword is corrupted beyond recognition when words of the cycle code of G are submitted to the binary symmetric channel. We derive a precise upper bound on the largest p for which f(p) can vanish when the number of edges of G goes to infinity. To this end, we introduce the notion of fractional percolation on trees, and calculate the related critical probabilities.
Let [Mscr]n,k(S) be the set of n-edge k-vertex rooted maps in some class on the surface S. Let P be a planar map in the class. We develop a method for showing that almost all maps in [Mscr]n,k(S) contain many copies of P. One consequence of this is that almost all maps in [Mscr]n,k(S) have no symmetries. The classes considered include c-connected maps (c [les ] 3) and certain families of degree restricted maps.
A tournament T on a set V of n players is an orientation of the edges of the complete graph Kn on V; T will be called a random tournament if the directions of these edges are determined by a sequence {Yj[ratio ]j = 1, …, (n2)} of independent coin flips. If (y, x) is an edge in a (random) tournament, we say that y beats x. A set A ⊂ V, |A| = k, is said to be beaten if there exists a player y ∉ A such that y beats x for each x ∈ A. If such a y does not exist, we say that A is unbeaten. A (random) tournament on V is said to have property Sk if each k-element subset of V is beaten. In this paper, we use the Stein–Chen method to show that the probability distribution of the number W0 of unbeaten k-subsets of V can be well-approximated by that of a Poisson random variable with the same mean; an improved condition for the existence of tournaments with property Sk is derived as a corollary. A multivariate version of this result is proved next: with Wj representing the number of k-subsets that are beaten by precisely j external vertices, j = 0, 1, …, b, it is shown that the joint distribution of (W0, W1, …, Wb) can be approximated by a multidimensional Poisson vector with independent components, provided that b is not too large.
Assemblies are labelled combinatorial objects that can be decomposed into components. Examples of assemblies include set partitions, permutations and random mappings. In addition, a distribution from population genetics called the Ewens sampling formula may be treated as an assembly. Each assembly has a size n, and the sum of the sizes of the components sums to n. When the uniform distribution is put on all assemblies of size n, the process of component counts is equal in distribution to a process of independent Poisson variables Zi conditioned on the event that a weighted sum of the independent variables is equal to n. Logarithmic assemblies are assemblies characterized by some θ > 0 for which i[]Zi → θ. Permutations and random mappings are logarithmic assemblies; set partitions are not a logarithmic assembly. Suppose b = b(n) is a sequence of positive integers for which b/n → β ε (0, 1]. For logarithmic assemblies, the total variation distance db(n) between the laws of the first b coordinates of the component counting process and of the first b coordinates of the independent processes converges to a constant H(β). An explicit formula for H(β) is given for β ε (0, 1] in terms of a limit process which depends only on the parameter θ. Also, it is shown that db(n) → 0 if and only if b/n → 0, generalizing results of Arratia, Barbour and Tavaré for the Ewens sampling formula. Local limit theorems for weighted sums of the Zi are used to prove these results.
A model for a random random-walk on a finite group is developed where the group elements that generate the random-walk are chosen uniformly and with replacement from the group. When the group is the d-cube Zd2, it is shown that if the generating set is size k then as d → ∞ with k − d → ∞ almost all of the random-walks converge to uniform in k ln (k/(k − d))/4+ρk steps, where ρ is any constant satisfying ρ > −ln (ln 2)/4.