To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In the previous three parts of the book we developed general techniques and specific string algorithms whose importance is either already well established or is likely to be established. We expect that the material of those three parts will be relevant to the field of string algorithms and molecular sequence analysis for many years to come. In this final part of the book we branch out from well established techniques and from problems strictly defined on strings. We do this in three ways.
First, we discuss techniques that are very current but may not stand the test of time although they may lead to more powerful and effective methods. Similarly, we discuss string problems that are tied to current technology in molecular biology but may become less important as that technology changes.
Second, we discuss problems, such as physical mapping, fragment assembly, and building phylogenetic (evolutionary) trees, that, although related to string problems, are not themselves string problems. These cousins of string problems either motivate specific pure string problems or motivate string problems generally by providing a more complete picture of how biological sequence data are obtained, or they use the output of pure string algorithms.
Third, we introduce a few important cameo topics without giving as much depth and detail as has generally been given to other topics in the book.
Of course, some topics to be presented in this final part of the book cross the three categories and are simultaneously currents, cousins, and cameos.
We will present two methods for constructing suffix trees in detail, Ukkonen's method and Weiner's method. Weiner was the first to show that suffix trees can be built in linear time, and his method is presented both for its historical importance and for some different technical ideas that it contains. However, Ukkonen's method is equally fast and uses far less space (i.e., memory) in practice than Weiner's method. Hence Ukkonen is the method of choice for most problems requiring the construction of a suffix tree. We also believe that Ukkonen's method is easier to understand. Therefore, it will be presented first. A reader who wishes to study only one method is advised to concentrate on it. However, our development of Weiner's method does not depend on understanding Ukkonen's algorithm, and the two algorithms can be read independently (with one small shared section noted in the description of Weiner's method).
Ukkonen's linear-time suffix tree algorithm
Esko Ukkonen [438] devised a linear-time algorithm for constructing a suffix tree that may be the conceptually easiest linear-time construction algorithm. This algorithm has a space-saving improvement over Weiner's algorithm (which was achieved first in the development of McCreight's algorithm), and it has a certain “on-line” property that may be useful in some situations. We will describe that on-line property but emphasize that the main virtue of Ukkonen's algorithm is the simplicity of its description, proof, and time analysis.
A look at some DNA mapping and sequencing problems
In this chapter we consider a number of theoretical and practical issues in creating and using genome maps and in large-scale (genomic) DNA sequencing. These areas are considered in this book for two reasons: First, we want to more completely explain the origin of molecular sequence data, since string problems on such data provide a large part of the motivation for studying string algorithms in general. Second, we need to more completely explain specific problems on strings that arise in obtaining molecular sequence data.
We start with a discussion of mapping in general and the distinction between physical maps and genetic maps. This leads to the discussion of several physical mapping techniques such as STS-content mapping and radiation-hybrid mapping. Our discussion emphasizes the combinatorial and computational aspects common to those techniques. We follow with a discussion of the tightest layout problem, and a short introduction to map comparison and map alignment. Then we move to large-scale sequencing and its relation to physical mapping. We emphasize shotgun sequencing and the string problems involved in sequence assembly under the shotgun strategy. Shotgun sequencing leads naturally to a beautiful pure string problem, the shortest common superstring problem. This pure, exact string problem is motivated by the practical problem of shotgun sequence assembly and deserves attention if only for the elegance of the results that have been obtained.
In this chapter we begin the discussion of multiple string comparison, one of the most important methodological issues and most active research areas in current biological sequence analysis. We first discuss some of the reasons for the importance of multiple string comparison in molecular biology. Then we will examine multiple string alignment, one common way that multiple string comparison has been formalized. We will precisely define three variants of the multiple alignment problem and consider in depth algorithms for attacking those problems. Other variants will be sketched in this chapter; additional multiple alignment issues will be discussed in Part IV.
Why multiple string comparison?
For a computer scientist, the multiple string comparison problem may at first seem like a generalization for generalization's sake – “two strings good, four strings better”. But in the context of molecular biology, multiple string comparison (of DNA, RNA, or protein strings) is much more than a technical exercise. It is the most critical cutting-edge tool for extracting and representing biologically important, yet faint or widely dispersed, commonalities from a set of strings. These (faint) commonalities may reveal evolutionary history, critical conserved motifs or conserved characters in DNA or protein, common two- and three-dimensional molecular structure, or clues about the common biological function of the strings. Such commonalities are also used to characterize families or superfamilies of proteins. These characterizations are then used in database searches to identify other potential members of a family.
Although I didn't know it at the time, I began writing this book in the summer of 1988 when I was part of a computer science (early bioinformatics) research group at the Human Genome Center of Lawrence Berkeley Laboratory. Our group followed the standard assumption that biologically meaningful results could come from considering DNA as a one-dimensional character string, abstracting away the reality of DNA as a flexible three-dimensional molecule, interacting in a dynamic environment with protein and RNA, and repeating a life-cycle in which even the classic linear chromosome exists for only a fraction of the time. A similar, but stronger, assumption existed for protein, holding, for example, that all the information needed for correct three-dimensional folding is contained in the protein sequence itself, essentially independent of the biological environment the protein lives in. This assumption has recently been modified, but remains largely intact [297].
For nonbiologists, these two assumptions were (and remain) a god send, allowing rapid entry into an exciting and important field. Reinforcing the importance of sequence-level investigation were statements such as:
The digital information that underlies biochemistry, cell biology, and development can be represented by a simple string of G's, A's, T's and C's. This string is the root data structure of an organism's biology. [352]
Inequalities for martingales with bounded differences have recently proved to be very useful in combinatorics and in the mathematics of operational research and computer science. We see here that these inequalities extend in a natural way to ‘centering sequences’ with bounded differences, and thus include, for example, better inequalities for sequences related to sampling without replacement.
Considering strings over a finite alphabet [Ascr], say that a string is w-avoiding if it does not contain w as a substring. It is known that the number aw(n) of w-avoiding strings of length n depends only on the autocorrelation of w as defined by Guibas–Odlyzko. We give a simple criterion on the autocorrelations of w and w′ for determining whether aw(n) > aw′(n) for all large enough n.
The prime factorization of a random integer has a GEM/Poisson-Dirichlet distribution as transparently proved by Donnelly and Grimmett [8]. By similarity to the arc-sine law for the mean distribution of the divisors of a random integer, due to Deshouillers, Dress and Tenenbaum [6] (see also Tenenbaum [24, II.6.2, p. 233]), – the ‘DDT theorem’ – we obtain an arc-sine law in the GEM/Poisson-Dirichlet context. In this context we also investigate the distribution of the number of components larger than ε which correspond to the number of prime factors larger than nε.
We are interested in a function f(p) that represents the probability that a random subset of edges of a Δ-regular graph G contains half the edges of some cycle of G. f(p) is also the probability that a codeword is corrupted beyond recognition when words of the cycle code of G are submitted to the binary symmetric channel. We derive a precise upper bound on the largest p for which f(p) can vanish when the number of edges of G goes to infinity. To this end, we introduce the notion of fractional percolation on trees, and calculate the related critical probabilities.
Let [Mscr]n,k(S) be the set of n-edge k-vertex rooted maps in some class on the surface S. Let P be a planar map in the class. We develop a method for showing that almost all maps in [Mscr]n,k(S) contain many copies of P. One consequence of this is that almost all maps in [Mscr]n,k(S) have no symmetries. The classes considered include c-connected maps (c [les ] 3) and certain families of degree restricted maps.
A tournament T on a set V of n players is an orientation of the edges of the complete graph Kn on V; T will be called a random tournament if the directions of these edges are determined by a sequence {Yj[ratio ]j = 1, …, (n2)} of independent coin flips. If (y, x) is an edge in a (random) tournament, we say that y beats x. A set A ⊂ V, |A| = k, is said to be beaten if there exists a player y ∉ A such that y beats x for each x ∈ A. If such a y does not exist, we say that A is unbeaten. A (random) tournament on V is said to have property Sk if each k-element subset of V is beaten. In this paper, we use the Stein–Chen method to show that the probability distribution of the number W0 of unbeaten k-subsets of V can be well-approximated by that of a Poisson random variable with the same mean; an improved condition for the existence of tournaments with property Sk is derived as a corollary. A multivariate version of this result is proved next: with Wj representing the number of k-subsets that are beaten by precisely j external vertices, j = 0, 1, …, b, it is shown that the joint distribution of (W0, W1, …, Wb) can be approximated by a multidimensional Poisson vector with independent components, provided that b is not too large.
Assemblies are labelled combinatorial objects that can be decomposed into components. Examples of assemblies include set partitions, permutations and random mappings. In addition, a distribution from population genetics called the Ewens sampling formula may be treated as an assembly. Each assembly has a size n, and the sum of the sizes of the components sums to n. When the uniform distribution is put on all assemblies of size n, the process of component counts is equal in distribution to a process of independent Poisson variables Zi conditioned on the event that a weighted sum of the independent variables is equal to n. Logarithmic assemblies are assemblies characterized by some θ > 0 for which i[]Zi → θ. Permutations and random mappings are logarithmic assemblies; set partitions are not a logarithmic assembly. Suppose b = b(n) is a sequence of positive integers for which b/n → β ε (0, 1]. For logarithmic assemblies, the total variation distance db(n) between the laws of the first b coordinates of the component counting process and of the first b coordinates of the independent processes converges to a constant H(β). An explicit formula for H(β) is given for β ε (0, 1] in terms of a limit process which depends only on the parameter θ. Also, it is shown that db(n) → 0 if and only if b/n → 0, generalizing results of Arratia, Barbour and Tavaré for the Ewens sampling formula. Local limit theorems for weighted sums of the Zi are used to prove these results.
A model for a random random-walk on a finite group is developed where the group elements that generate the random-walk are chosen uniformly and with replacement from the group. When the group is the d-cube Zd2, it is shown that if the generating set is size k then as d → ∞ with k − d → ∞ almost all of the random-walks converge to uniform in k ln (k/(k − d))/4+ρk steps, where ρ is any constant satisfying ρ > −ln (ln 2)/4.
An [n, k, r]-partite graph is a graph whose vertex set, V, can be partitioned into n pairwise-disjoint independent sets, V1, …, Vn, each containing exactly k vertices, and the subgraph induced by Vi ∪ Vj contains exactly r independent edges, for 1 [les ] i < j [les ] n. An independent transversal in an [n, k, r]-partite graph is an independent set, T, consisting of n vertices, one from each Vi. An independent covering is a set of k pairwise-disjoint independent transversals. Let t(k, r) denote the maximal n for which every [n, k, r]-partite graph contains an independent transversal. Let c(k, r) be the maximal n for which every [n, k, r]-partite graph contains an independent covering. We give upper and lower bounds for these parameters. Furthermore, our bounds are constructive. These results improve and generalize previous results of Erdo″s, Gyárfás and Łuczak [5], for the case of graphs.
Lemke and Kleitman [2] showed that, given a positive integer d and d (necessarily non-distinct) divisors of da1, …, ad there exists a subset Q ⊆ {1, …, d} such that d = [sum ]i∈Qai answering a conjecture of Erdo″s and Lemke. Here we extend this result, showing that, provided [sum ]p|d1/p [les ] 1 (where the sum is taken over all primes p), there is some collection from a1, …, ad which both sum to d and which can themselves be ordered so that each element divides its successor in the order. Furthermore, we shall show that the condition on the prime divisors is in some sense also necessary.
In the basic model of communication complexity, Alice and Bob were all powerful, but deterministic. This means that in any stage, when one of the players needs to communicate a bit, the value of this bit is a deterministic function of the player's input and the communication so far. In this chapter, we study what happens when Alice and Bob are allowed to act in a randomized fashion. That is, the players are also allowed to “toss coins” during the execution of the protocol and take into account the outcome of the coin tosses when deciding what messages to send. This implies that the communication on a given input (x, y) is not fixed anymore but instead it becomes a random variable. Similarly, the output computed by a randomized protocol on input (x, y) is also a random variable. As a result, the success of a randomized protocol can be defined in several ways. The first possibility, which is more conservative (sometimes called Las-Vegas protocols), is to consider only protocols that always output the correct value f (x, y). The more liberal possibility is to allow protocols that may err, but for every input (x, y) are guaranteed to compute the correct value f(x, y) with high probability (sometimes called Monte-Carlo protocols). Similarly, the cost of a randomized protocol can also be defined in several ways. We can either analyze the worst case behavior of the protocol, or we can analyze the average case behavior.
This book surveys the mathematical field of communication complexity. whereas the original motivation for studying this issue comes from computer systems and the operations they perform, the underlying issues can be neatly abstracted mathematically. This is the approach taken here.
Communication
The need for communication arises whenever two or more computers, components, systems, or humans (in general, “parties”) need to jointly perform a task that none of them can perform alone. This may arise, for example, due to the lack of resources of any single party or due to the lack of data available to any single party.
In many cases, the need for communication is explicit: When we search files on a remote computer it is clear that the requests and answers are actually communicated (via electrical wires, optical cables, radio signals, etc.). In other cases, the communication taking place is more implicit: When a single computer runs a program there is some communication between the different parts of the computer, for example, between the CPU and the memory, or even among different parts of the CPU. In yet other cases, there is no real communication going on but it is still a useful abstraction. For a problem whose solution relies on several pieces of data, we can imagine that these pieces of data need to communicate with each other in order to solve the problem; in reality, of course, this communication will be achieved by a processor accessing them all.
Complexity
The notion of complexity is becoming more and more central in many branches of science and in particular in the study of various types of computation.
In the standard two-party model the input (x, y) is partitioned in a fixed way. That is, Alice always gets x and Bob always gets y. In this chapter we discuss models in which the partition of the input among the players is not fixed. The main motivation for these models is that in many cases we wish to use communication complexity lower bounds to obtain lower bounds in other models of computation. This would typically require finding a communication complexity problem “hidden” somewhere in the computation that the model under consideration must perform. Because in such a model the input usually is not partitioned into two distinct sets x1, …, xn and y1, …, yn, such a partition must be given by the reduction. In some cases the partition can be figured out and fixed. In some other cases we must use arguments regarding any partition (of a certain kind). That is, we require a model where the partition is not fixed beforehand but the protocol determines the partition (independently of the particular input). Several such “variable partition models” are discussed in this chapter.
Throughout this chapter the input will be m Boolean variables x1, …, xm, and we consider functions f: {0, l}m → {0, 1}. We will talk about the communication complexity of f between two disjoint sets of variables S and T. That is, one player gets all bits in S and the other all bits in T
Worst-Case Partition
The simplest variable partition model we may consider is the “worst-case” partition: split the input into two sets in the way that maximizes the communication complexity.