To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
All of the exact matching methods in the first three chapters, as well as most of the methods that have yet to be discussed in this book, are examples of comparison-based methods. The main primitive operation in each of those methods is the comparison of two characters. There are, however, string matching methods based on bit operations or on arithmetic, rather than character comparisons. These methods therefore have a very different flavor than the comparison-based approaches, even though one can sometimes see character comparisons hidden at the inner level of these “seminumerical” methods. We will discuss three examples of this approach: the Shift-And method and its extension to a program called agrep to handle inexact matching; the use of the Fast Fourier Transform in string matching; and the random fingerprint method of Karp and Rabin.
The Shift-And method
R. Baeza-Yates and G. Gonnet [35] devised a simple, bit-oriented method that solves the exact matching problem very efficiently for relatively small patterns (the length of a typical English word for example). They call this method the Shift-Or method, but it seems more natural to call it Shift-And. Recall that pattern P is of size n and the text T is of size m.
Definition Let M be an n by m + 1 binary valued array, with index i running from 1 to n and index j running from 1 to m.
In this chapter we look in detail at alignment problems in the more complex contexts typical of string problems that currently arise in computational molecular biology. These more complex problems require techniques that extend (rather than refine) the core alignment methods.
Parametric sequence alignment
Introduction
When using sequence alignment methods to study DNA or amino acid sequences, there is often considerable disagreement about how to weight matches, mismatches, insertions and deletions (indels), and gaps. The most commonly used alignment software packages require the user to specify fixed values for those parameters, and it is widely observed that the biological significance of the resulting alignment can be greatly affected by the choice of parameter settings. The following relates to alignments of proteins from the globin family and is representative of frequently seen comments in the biological literature:
…one must be able to vary the gap and gap size penalties independently and in a query dependent fashion in order to obtain the maximal sensitivity of the search.
[81]
A similar comment appears in [432]:
Sequence alignment is sensitive to the choices of gap penalty and the form of the relatedness matrix, and it is often desirable to vary these …
Finally, from [446],
One of the most prominent problems is the choice of parametric values, especially gap penalties. When very similar sequences are compared, the choice is not critical; but when the conservation is low, the resulting alignment is strongly affected.
A suffix tree is a data structure that exposes the internal structure of a string in a deeper way than does the fundamental preprocessing discussed in Section 1.3. Suffix trees can be used to solve the exact matching problem in linear time (achieving the same worst-case bound that the Knuth-Morris-Pratt and the Boyer–Moore algorithms achieve), but their real virtue comes from their use in linear-time solutions to many string problems more complex than exact matching. Moreover (as we will detail in Chapter 9), suffix trees provide a bridge between exact matching problems, the focus of Part I, and inexact matching problems that are the focus of Part III.
The classic application for suffix trees is the substring problem. One is first given a text T of length m. After O(m), or linear, preprocessing time, one must be prepared to take in any unknown string S of length n and in O(n) time either find an occurrence of S in T or determine that S is not contained in T. That is, the allowed preprocessing takes time proportional to the length of the text, but thereafter, the search for S must be done in time proportional to the length of S, independent of the length of T. These bounds are achieved with the use of a suffix tree. The suffix tree for the text is built in O(m) time during a preprocessing stage; thereafter, whenever a string of length O(n) is input, the algorithm searches for it in O(n) time using that suffix tree.
String search, edit, and alignment tools have been extensively used in studies of molecular evolution. However, their use has primarily been aimed at comparing strings representing single genes or single proteins. For example, evolutionary studies have usually selected a single protein and have examined how the amino acid sequence for that protein differs in different species. Accordingly, string edit and alignment algorithms have been guided by objective functions that model the most common types of mutations occurring at the level of a single gene or protein: point mutations or amino acid substitutions, single character insertions and deletions, and block insertions and deletions (gaps).
Recently, attention has been given to mutations that occur on a scale much larger than the single gene. These mutations occur at the chromosome or at the genome level and are central in the evolution of the whole genome. These larger-scale mutations have features that can be quite different from gene- or protein-level mutations. With more genome-level molecular data becoming available, larger-scale string comparisons may give insights into evolution that are not seen at the single gene or protein level.
The guiding force behind genome evolution is “duplication with modification” [126, 128, 301, 468]. That is, parts of the genome are duplicated, possibly very far away from the original site, and then modified. Other genome-level mutations of importance include inversions, where a segment of DNA is reversed; translocations, where the ends of two chromosomes (telomeres) are exchanged; and transpositions, where two adjacent segments of DNA exchange places.
Almost all discussions of exact matching begin with the naive method, and we follow this tradition. The naive method aligns the left end of P with the left end of T and then compares the characters of P and T left to right until either two unequal characters are found or until P is exhausted, in which case an occurrence of P is reported. In either case, P is then shifted one place to the right, and the comparisons are restarted from the left end of P. This process repeats until the right end of P shifts past the right end of T.
Using n to denote the length of P and m to denote the length of T, the worst-case number of comparisons made by this method is Θ(nm). In particular, if both P and T consist of the same repeated character, then there is an occurrence of P at each of the first m − n + 1 positions of T and the method performs exactly n(m − n + 1) comparisons. For example, if P = aaa and T = aaaaaaaaaa then n = 3, m = 10, and 24 comparisons are made.
The naive method is certainly simple to understand and program, but its worst-case running time of Θ(nm) may be unsatisfactory and can be improved. Even the practical running time of the naive method may be too slow for larger texts and patterns.
In this book I have tried to present fundamental ideas, algorithms, and techniques that have a wide range of application and that will likely remain important even as the present-day interests change. I have also tried to explain the fundamental reasons why computations on strings and sequences are productive in biology and will remain important even as the specific applications change. But with only 500 pages (a mere 285,639 words formed from 1,784,996 characters), there are certain algorithmic methods and certain present and anticipated applications that I could not cover.
Additional techniques
For additional pure computer science results on exact matching, the reader is referred to Text Algorithms by M. Crochemore and W. Rytter [117]. That book goes more deeply into several pure computer science issues, such as periodicities in strings and parallel algorithms. For a survey of many string searching algorithms and inexact matching methods, see String Searching Algorithms by G. Stephen [421]. For additional topics in computational molecular biology, particularly probabilistic and statistical questions about strings and sequences, see An Introduction to Computational Biology by M. Waterman [461]. For another introduction to combinatorial and string problems in computational molecular biology, see Introduction to Computational Molecular Biology, by J. Setubal and J. Meidanis [402]. For topics in computational molecular biology more focused on issues of protein structure, see the chapter Computational Molecular Biology by A. Lesk in [297].
A Boyer–Moore variant with a “simple” linear time bound
Apostolico and Giancarlo [26] suggested a variant of the Boyer–Moore algorithm that allows a fairly simple proof of linear worst-case running time. With this variant, no character of T will ever be compared after it is first matched with any character of P. It is then immediate that the number of comparisons is at most 2m: Every comparison is either a match or a mismatch; there can only be m mismatches since each one results in a nonzero shift of P; and there can only be m matches since no character of T is compared again after it matches a character of P. We will also show that (in addition to the time for comparisons) the time taken for all the other work in this method is linear in m.
Given the history of very difficult and partial analyses of the Boyer–Moore algorithm, it is quite amazing that a close variant of the algorithm allows a simple linear time bound. We present here a further improvement of the Apostolico–Giancarlo idea, resulting in an algorithm that simulates exactly the shifts of the Boyer–Moore algorithm. The method therefore has all the rapid shifting advantages of the Boyer–Moore method as well as a simple linear worst-case time analysis.
Key ideas
Our version of the Apostolico–Giancarlo algorithm simulates the Boyer–Moore algorithm, finding exactly the same mismatches that Boyer–Moore would find and making exactly the same shifts.
At this point we shift from the general area of exact matching and exact pattern discovery to the general area of inexact, approximate matching, and sequence alignment. “Approximate” means that some errors, of various types detailed later, are acceptable in valid matches. “Alignment” will be given a precise meaning later, but generally means lining up characters of strings, allowing mismatches as well as matches, and allowing characters of one string to be placed opposite spaces made in opposing strings.
We also shift from problems primarily concerning substrings to problems concerning subsequences. A subsequence differs from a substring in that the characters in a substring must be contiguous, whereas the characters in a subsequence embedded in a string need not be. For example, the string xyz is a subsequence, but not a substring, in axayaz. The shift from substrings to subsequences is a natural corollary of the shift from exact to inexact matching. This shift of focus to inexact matching and subsequence comparison is accompanied by a shift in technique. Most of the methods we will discuss in Part III, and many of the methods in Part IV, rely on the tool of dynamic programming, a tool that was not needed in Parts I and II.
Much of computational biology concerns sequence alignments
The area of approximate matching and sequence comparison is central in computational molecular biology both because of the presence of errors in molecular data and because of active mutational processes that (sub)sequence comparison methods seek to model and reveal.
In this chapter we look at a number of important refinements that have been developed for certain core string edit and alignment problems. These refinements either speed up a dynamic programming solution, reduce its space requirements, or extend its utility.
Computing alignments in only linear space
One of the defects of dynamic programming for all the problems we have discussed is that the dynamic programming tables use Θ(nm) space when the input strings have length n and m. (When we talk about the space used by a method, we refer to the maximum space ever in use simultaneously. Reused space does not add to the count of space use.) It is quite common that the limiting resource in string alignment problems is not time but space. That limit makes it difficult to handle large strings, no matter how long we may be willing to wait for the computation to finish. Therefore, it is very valuable to have methods that reduce the use of space without dramatically increasing the time requirements.
Hirschberg [224] developed an elegant and practical space-reduction method that works for many dynamic programming problems. For several string alignment problems, this method reduces the required space from Θ(nm) to O(n) (for n < m) while only doubling the worst-case time bound. Miller and Myers expanded on the idea and brought it to the attention of the computational biology community [344].
We now begin the discussion of an amazing result that greatly extends the usefulness of suffix trees (in addition to many other applications).
Definition In a rooted tree T, a node u is an ancestor of a node v if u is on the unique path from the root to v. With this definition a node is an ancestor of itself. A proper ancestor of v refers to an ancestor that is not v.
Definition In a rooted tree T, the lowest common ancestor (lca) of two nodes x and y is the deepest node in T that is an ancestor of both x and y.
For example, in Figure 8.1 the lca of nodes 6 and 10 is node 5 while the lca of 6 and 3 is 1.
The amazing result is that after a linear amount of preprocessing of a rooted tree, any two nodes can then be specified and their lowest common ancestor found in constant time. That is, a rooted tree with n nodes is first preprocessed in O(n) time, and thereafter any lowest common ancestor query takes only constant time to solve, independent of n. Without preprocessing, the best worst-case time bound for a single query is Θ(n), so this is a most surprising and useful result. The lca result was first obtained by Harel and Tarjan [214] and later simplified by Schieber and Vishkin [393]. The exposition here is based on the later approach.
This chapter develops a number of classical comparison-based matching algorithms for the exact matching problem. With suitable extensions, all of these algorithms can be implemented to run in linear worst-case time, and all achieve this performance by preprocessing pattern P. (Methods that preprocess T will be considered in Part II of the book.) The original preprocessing methods for these various algorithms are related in spirit but are quite different in conceptual difficulty. Some of the original preprocessing methods are quite difficult. This chapter does not follow the original preprocessing methods but instead exploits fundamental preprocessing, developed in the previous chapter, to implement the needed preprocessing for each specific matching algorithm.
Also, in contrast to previous expositions, we emphasize the Boyer–Moore method over the Knuth-Morris-Pratt method, since Boyer–Moore is the practical method of choice for exact matching. Knuth-Morris-Pratt is nonetheless completely developed, partly for historical reasons, but mostly because it generalizes to problems such as real-time string matching and matching against a set of patterns more easily than Boyer–Moore does. These two topics will be described in this chapter and the next.
The Boyer–Moore Algorithm
As in the naive algorithm, the Boyer–Moore algorithm successively aligns P with T and then checks whether P matches the opposing characters of T. Further, after the check is complete, P is shifted right relative to T just as in the naive algorithm. However, the Boyer–Moore algorithm contains three clever ideas not contained in the naive algorithm: the right-to-left scan, the bad character shift rule, and the good suffix shift rule.
In this chapter we consider the inexact matching and alignment problems that form the core of the field of inexact matching and others that illustrate the most general techniques. Some of those problems and techniques will be further refined and extended in the next chapters. We start with a detailed examination of the most classic inexact matching problem solved by dynamic programming, the edit distance problem. The motivation for inexact matching (and, more generally, sequence comparison) in molecular biology will be a recurring theme explored throughout the rest of the book. We will discuss many specific examples of how string comparison and inexact matching are used in current molecular biology. However, to begin, we concentrate on the purely formal and technical aspects of defining and computing inexact matching.
The edit distance between two strings
Frequently, one wants a measure of the difference or distance between two strings (for example, in evolutionary, structural, or functional studies of biological strings; in textual database retrieval; or in spelling correction methods). There are several ways to formalize the notion of distance between strings. One common, and simple, formalization [389, 299], called edit distance, focuses on transforming (or editing) one string into the other by a series of edit operations on individual characters. The permitted edit operations are insertion of a character into the first string, the deletion of a character from the first string, or the substitution (or replacement) of a character in the first string with a character in the second string.
Sequence comparison, particularly when combined with the systematic collection, curration, and search of databases containing biomolecular sequences, has become essential in modern molecular biology. Commenting on the (then) near-completion of the effort to sequence the entire yeast genome (now finished), Stephen Oliver says
In a short time it will be hard to realize how we managed without the sequence data. Biology will never be the same again. [478]
One fact explains the importance of molecular sequence data and sequence comparison in biology.
The first fact of biological sequence analysis
The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity usually implies significant functional or structural similarity.
Evolution reuses, builds on, duplicates, and modifies “successful” structures (proteins, exons, DNA regulatory sequences, morphological features, enzymatic pathways, etc.). Life is based on a repertoire of structured and interrelated molecular building blocks that are shared and passed around. The same and related molecular structures and mechanisms show up repeatedly in the genome of a single species and across a very wide spectrum of divergent species. “Duplication with modification” [127, 128, 129, 130] is the central paradigm of protein evolution, wherein new proteins and/or new biological functions are fashioned from earlier ones. Doolittle emphasizes this point as follows:
The vast majority of extant proteins are the result of a continuous series of genetic duplications and subsequent modifications.
We will see many applications of suffix trees throughout the book. Most of these applications allow surprisingly efficient, linear-time solutions to complex string problems. Some of the most impressive applications need an additional tool, the constant-time lowest common ancestor algorithm, and so are deferred until that algorithm has been discussed (in Chapter 8). Other applications arise in the context of specific problems that will be discussed in detail later. But there are many applications we can now discuss that illustrate the power and utility of suffix trees. In this chapter and in the exercises at its end, several of these applications will be explored.
Perhaps the best way to appreciate the power of suffix trees is for the reader to spend some time trying to solve the problems discussed below, without using suffix trees. Without this effort or without some historical perspective, the availability of suffix trees may make certain of the problems appear trivial, even though linear-time algorithms for those problems were unknown before the advent of suffix trees. The longest common substring problem discussed in Section 7.4 is one clear example, where Knuth had conjectured that a linear-time algorithm would not be possible [24, 278], but where such an algorithm is immediate with the use of suffix trees. Another classic example is the longest prefix repeat problem discussed in the exercises, where a linear-time solution using suffix trees is easy, but where the best prior method ran in O(n log n) time.
With the ability to solve lowest common ancestor queries in constant time, suffix trees can be used to solve many additional string problems. Many of those applications move from the domain of exact matching to the domain of inexact, or approximate, matching (matching with some errors permitted). This chapter illustrates that point with several examples.
Longest common extension: a bridge to inexact matching
The longest common extension problem is solved as a subtask in many classic string algorithms. It is at the heart of all but the last application discussed in this chapter and is central to the k-difference algorithm discussed in Section 12.2.
Longest common extension problem Two strings S1 and S2 of total length n are first specified in a preprocessing phase. Later, a long sequence of index pairs is specified. For each specified index pair (i, j), we must find the length of the longest substring of S1 starting at position i that matches a substring of S2 starting at position j. That is, we must find the length of the longest prefix of suffix i of S1 that matches a prefix of suffix j of S2 (see Figure 9.1).
Of course, any time an index pair is specified, the longest common extension can be found by direct search in time proportional to the length of the match. But the goal is to compute each extension in constant time, independent of the length of the match.
In Section 15.11.3, we discussed the canonical advice of translating any newly sequenced gene into a derived amino acid sequence to search the protein databases for similarities to the new sequence. This is in contrast to searching DNA databases with the original DNA string. There is, however, a technical problem with using derived amino acid sequences. If a single nucleotide is missing from the DNA transcript, then the reading frame of the succeeding DNA will be changed (see Figure 18.1). A similar problem occurs if a nucleotide is incorrectly inserted into the transcript. Until the correct reading frame is reestablished (through additional errors), most of the translated amino acids will be incorrect, invalidating most comparisons made to the derived amino acid sequence.
Insertion and deletion errors during DNA sequencing are fairly common, so frameshift errors can be serious in the subsequent analysis. Those errors are in addition to any substitution errors that leave the reading frame unchanged. Moreover, informative alignments often contain a relatively small number of exactly matching characters and larger regions of more poorly aligned substrings (see Section 11.7 on local alignment). Therefore, two substrings that would align well without a frameshift error but would align poorly with one can easily be mistaken for regions that align poorly due only to substitution errors. Therefore, without some additional technique, it is easy to miss frameshift errors and hard to correct them.
We now turn to the dominant, most mature and most successful application of string algorithms in computational biology: the building and searching of databases holding molecular sequence data. We start by illustrating some uses of sequence databases and by describing a bit about existing databases. The impact of database searching (and expectations of greater future impact), explains a large part of the interest among biologists for algorithms that search, manipulate and compare strings. In turn, the biologists' activities have stimulated additional interest in string algorithms among computer scientists.
After describing the “why” and the “what” of sequence databases, we will discuss some “how” issues in string algorithms that are particular to database organization and search.
Why sequence databases?
Comprehensive databases/archives holding DNA and protein sequences are firmly established as central tools in current molecular biology – “electronic databases are fast becoming the lifeblood of the field” [452]. The fundamental reason is the power of biomolecular sequence comparison. This was made explicit in the first fact of biological sequence analysis (Chapter 10, page 212). Given the effectiveness of sequence comparison in molecular biology, it is natural to stockpile and systematically organize the biosequences to be compared; this has naturally led to the growth of sequence databases. We start this chapter with a few illustrations of the power of sequence comparison in the form of sequence database search.
The dominant view of the evolution of life is that all existing organisms are derived from some common ancestor and that a new species arises by a splitting of one population into two (or more) populations that do not cross-breed, rather than by a mixing of two populations into one. Therefore, the high level history of life is ideally organized and displayed as a rooted, directed tree. The extant species (and some of the extinct species) are represented at the leaves of the tree, each internal node represents a point when the history of two sets of species diverged (or represents a common ancestor of those species), the length and direction of each edge represents the passage of time or the evolutionary events that occur in that time, and so the path from the root of the tree to each leaf represents the evolutionary history of the organisms represented there. To quote Darwin
… the great Tree of Life fills with its dead and broken branches the crust of the earth, and covers the surface with its ever-branching and beautiful ramifications.
[119]
This view of the history of life as a tree must frequently be modified when considering the evolution of viruses, or even bacteria or individual genes, but remains the dominant way that high-level evolution is viewed in current biology. Hundreds (may be thousands) of papers are published yearly that depict deduced evolutionary trees.