To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
We explore the ‘Hausdorff dimension at infinity’ for self-affine carpets defined on the square lattice. This notion of dimension (due to Barlow and Taylor), which is the correct notion from a probabilistic perspective, differs for these sets from more ‘naive’ indices of fractal dimension.
Certain convergent search algorithms can be turned into chaotic dynamic systems by renormalisation back to a standard region at each iteration. This allows the machinery of ergodic theory to be used for a new probabilistic analysis of their behaviour. Rates of convergence can be redefined in terms of various entropies and ergodic characteristics (Kolmogorov and Rényi entropies and Lyapunov exponent). A special class of line-search algorithms, which contains the Golden-Section algorithm, is studied in detail. Their associated dynamic systems exhibit a Markov partition property, from which invariant measures and ergodic characteristics can be computed. A case is made that the Rényi entropy is the most appropriate convergence criterion in this environment.
Given a string P called the pattern and a longer string T called the text, the exact matching problem is to find all occurrences, if any, of pattern P in text T.
For example, if P = aba and T = bbabaxababay then P occurs in T starting at locations 3, 7, and 9. Note that two occurrences of P may overlap, as illustrated by the occurrences of P at locations 7 and 9.
Importance of the exact matching problem
The practical importance of the exact matching problem should be obvious to anyone who uses a computer. The problem arises in widely varying applications, too numerous to even list completely. Some of the more common applications are in word processors; in utilities such as grep on Unix; in textual information retrieval programs such as Medline, Lexis, or Nexis; in library catalog searching programs that have replaced physical card catalogs in most large libraries; in internet browsers and crawlers, which sift through massive amounts of text available on the internet for material containing specific keywords; in internet news readers that can search the articles for topics of interest; in the giant digital libraries that are being planned for the near future; in electronic journals that are already being “published” on-line; in telephone directory assistance; in on-line encyclopedias and other educational CD-ROM applications; in on-line dictionaries and thesauri, especially those with cross-referencing features (the Oxford English Dictionary project has created an electronic on-line version of the OED containing 50 million words); and in numerous specialized databases.
All of the exact matching methods in the first three chapters, as well as most of the methods that have yet to be discussed in this book, are examples of comparison-based methods. The main primitive operation in each of those methods is the comparison of two characters. There are, however, string matching methods based on bit operations or on arithmetic, rather than character comparisons. These methods therefore have a very different flavor than the comparison-based approaches, even though one can sometimes see character comparisons hidden at the inner level of these “seminumerical” methods. We will discuss three examples of this approach: the Shift-And method and its extension to a program called agrep to handle inexact matching; the use of the Fast Fourier Transform in string matching; and the random fingerprint method of Karp and Rabin.
The Shift-And method
R. Baeza-Yates and G. Gonnet [35] devised a simple, bit-oriented method that solves the exact matching problem very efficiently for relatively small patterns (the length of a typical English word for example). They call this method the Shift-Or method, but it seems more natural to call it Shift-And. Recall that pattern P is of size n and the text T is of size m.
Definition Let M be an n by m + 1 binary valued array, with index i running from 1 to n and index j running from 1 to m.
In this chapter we look in detail at alignment problems in the more complex contexts typical of string problems that currently arise in computational molecular biology. These more complex problems require techniques that extend (rather than refine) the core alignment methods.
Parametric sequence alignment
Introduction
When using sequence alignment methods to study DNA or amino acid sequences, there is often considerable disagreement about how to weight matches, mismatches, insertions and deletions (indels), and gaps. The most commonly used alignment software packages require the user to specify fixed values for those parameters, and it is widely observed that the biological significance of the resulting alignment can be greatly affected by the choice of parameter settings. The following relates to alignments of proteins from the globin family and is representative of frequently seen comments in the biological literature:
…one must be able to vary the gap and gap size penalties independently and in a query dependent fashion in order to obtain the maximal sensitivity of the search.
[81]
A similar comment appears in [432]:
Sequence alignment is sensitive to the choices of gap penalty and the form of the relatedness matrix, and it is often desirable to vary these …
Finally, from [446],
One of the most prominent problems is the choice of parametric values, especially gap penalties. When very similar sequences are compared, the choice is not critical; but when the conservation is low, the resulting alignment is strongly affected.
A suffix tree is a data structure that exposes the internal structure of a string in a deeper way than does the fundamental preprocessing discussed in Section 1.3. Suffix trees can be used to solve the exact matching problem in linear time (achieving the same worst-case bound that the Knuth-Morris-Pratt and the Boyer–Moore algorithms achieve), but their real virtue comes from their use in linear-time solutions to many string problems more complex than exact matching. Moreover (as we will detail in Chapter 9), suffix trees provide a bridge between exact matching problems, the focus of Part I, and inexact matching problems that are the focus of Part III.
The classic application for suffix trees is the substring problem. One is first given a text T of length m. After O(m), or linear, preprocessing time, one must be prepared to take in any unknown string S of length n and in O(n) time either find an occurrence of S in T or determine that S is not contained in T. That is, the allowed preprocessing takes time proportional to the length of the text, but thereafter, the search for S must be done in time proportional to the length of S, independent of the length of T. These bounds are achieved with the use of a suffix tree. The suffix tree for the text is built in O(m) time during a preprocessing stage; thereafter, whenever a string of length O(n) is input, the algorithm searches for it in O(n) time using that suffix tree.
String search, edit, and alignment tools have been extensively used in studies of molecular evolution. However, their use has primarily been aimed at comparing strings representing single genes or single proteins. For example, evolutionary studies have usually selected a single protein and have examined how the amino acid sequence for that protein differs in different species. Accordingly, string edit and alignment algorithms have been guided by objective functions that model the most common types of mutations occurring at the level of a single gene or protein: point mutations or amino acid substitutions, single character insertions and deletions, and block insertions and deletions (gaps).
Recently, attention has been given to mutations that occur on a scale much larger than the single gene. These mutations occur at the chromosome or at the genome level and are central in the evolution of the whole genome. These larger-scale mutations have features that can be quite different from gene- or protein-level mutations. With more genome-level molecular data becoming available, larger-scale string comparisons may give insights into evolution that are not seen at the single gene or protein level.
The guiding force behind genome evolution is “duplication with modification” [126, 128, 301, 468]. That is, parts of the genome are duplicated, possibly very far away from the original site, and then modified. Other genome-level mutations of importance include inversions, where a segment of DNA is reversed; translocations, where the ends of two chromosomes (telomeres) are exchanged; and transpositions, where two adjacent segments of DNA exchange places.
Almost all discussions of exact matching begin with the naive method, and we follow this tradition. The naive method aligns the left end of P with the left end of T and then compares the characters of P and T left to right until either two unequal characters are found or until P is exhausted, in which case an occurrence of P is reported. In either case, P is then shifted one place to the right, and the comparisons are restarted from the left end of P. This process repeats until the right end of P shifts past the right end of T.
Using n to denote the length of P and m to denote the length of T, the worst-case number of comparisons made by this method is Θ(nm). In particular, if both P and T consist of the same repeated character, then there is an occurrence of P at each of the first m − n + 1 positions of T and the method performs exactly n(m − n + 1) comparisons. For example, if P = aaa and T = aaaaaaaaaa then n = 3, m = 10, and 24 comparisons are made.
The naive method is certainly simple to understand and program, but its worst-case running time of Θ(nm) may be unsatisfactory and can be improved. Even the practical running time of the naive method may be too slow for larger texts and patterns.
In this book I have tried to present fundamental ideas, algorithms, and techniques that have a wide range of application and that will likely remain important even as the present-day interests change. I have also tried to explain the fundamental reasons why computations on strings and sequences are productive in biology and will remain important even as the specific applications change. But with only 500 pages (a mere 285,639 words formed from 1,784,996 characters), there are certain algorithmic methods and certain present and anticipated applications that I could not cover.
Additional techniques
For additional pure computer science results on exact matching, the reader is referred to Text Algorithms by M. Crochemore and W. Rytter [117]. That book goes more deeply into several pure computer science issues, such as periodicities in strings and parallel algorithms. For a survey of many string searching algorithms and inexact matching methods, see String Searching Algorithms by G. Stephen [421]. For additional topics in computational molecular biology, particularly probabilistic and statistical questions about strings and sequences, see An Introduction to Computational Biology by M. Waterman [461]. For another introduction to combinatorial and string problems in computational molecular biology, see Introduction to Computational Molecular Biology, by J. Setubal and J. Meidanis [402]. For topics in computational molecular biology more focused on issues of protein structure, see the chapter Computational Molecular Biology by A. Lesk in [297].
A Boyer–Moore variant with a “simple” linear time bound
Apostolico and Giancarlo [26] suggested a variant of the Boyer–Moore algorithm that allows a fairly simple proof of linear worst-case running time. With this variant, no character of T will ever be compared after it is first matched with any character of P. It is then immediate that the number of comparisons is at most 2m: Every comparison is either a match or a mismatch; there can only be m mismatches since each one results in a nonzero shift of P; and there can only be m matches since no character of T is compared again after it matches a character of P. We will also show that (in addition to the time for comparisons) the time taken for all the other work in this method is linear in m.
Given the history of very difficult and partial analyses of the Boyer–Moore algorithm, it is quite amazing that a close variant of the algorithm allows a simple linear time bound. We present here a further improvement of the Apostolico–Giancarlo idea, resulting in an algorithm that simulates exactly the shifts of the Boyer–Moore algorithm. The method therefore has all the rapid shifting advantages of the Boyer–Moore method as well as a simple linear worst-case time analysis.
Key ideas
Our version of the Apostolico–Giancarlo algorithm simulates the Boyer–Moore algorithm, finding exactly the same mismatches that Boyer–Moore would find and making exactly the same shifts.
At this point we shift from the general area of exact matching and exact pattern discovery to the general area of inexact, approximate matching, and sequence alignment. “Approximate” means that some errors, of various types detailed later, are acceptable in valid matches. “Alignment” will be given a precise meaning later, but generally means lining up characters of strings, allowing mismatches as well as matches, and allowing characters of one string to be placed opposite spaces made in opposing strings.
We also shift from problems primarily concerning substrings to problems concerning subsequences. A subsequence differs from a substring in that the characters in a substring must be contiguous, whereas the characters in a subsequence embedded in a string need not be. For example, the string xyz is a subsequence, but not a substring, in axayaz. The shift from substrings to subsequences is a natural corollary of the shift from exact to inexact matching. This shift of focus to inexact matching and subsequence comparison is accompanied by a shift in technique. Most of the methods we will discuss in Part III, and many of the methods in Part IV, rely on the tool of dynamic programming, a tool that was not needed in Parts I and II.
Much of computational biology concerns sequence alignments
The area of approximate matching and sequence comparison is central in computational molecular biology both because of the presence of errors in molecular data and because of active mutational processes that (sub)sequence comparison methods seek to model and reveal.
In this chapter we look at a number of important refinements that have been developed for certain core string edit and alignment problems. These refinements either speed up a dynamic programming solution, reduce its space requirements, or extend its utility.
Computing alignments in only linear space
One of the defects of dynamic programming for all the problems we have discussed is that the dynamic programming tables use Θ(nm) space when the input strings have length n and m. (When we talk about the space used by a method, we refer to the maximum space ever in use simultaneously. Reused space does not add to the count of space use.) It is quite common that the limiting resource in string alignment problems is not time but space. That limit makes it difficult to handle large strings, no matter how long we may be willing to wait for the computation to finish. Therefore, it is very valuable to have methods that reduce the use of space without dramatically increasing the time requirements.
Hirschberg [224] developed an elegant and practical space-reduction method that works for many dynamic programming problems. For several string alignment problems, this method reduces the required space from Θ(nm) to O(n) (for n < m) while only doubling the worst-case time bound. Miller and Myers expanded on the idea and brought it to the attention of the computational biology community [344].
We now begin the discussion of an amazing result that greatly extends the usefulness of suffix trees (in addition to many other applications).
Definition In a rooted tree T, a node u is an ancestor of a node v if u is on the unique path from the root to v. With this definition a node is an ancestor of itself. A proper ancestor of v refers to an ancestor that is not v.
Definition In a rooted tree T, the lowest common ancestor (lca) of two nodes x and y is the deepest node in T that is an ancestor of both x and y.
For example, in Figure 8.1 the lca of nodes 6 and 10 is node 5 while the lca of 6 and 3 is 1.
The amazing result is that after a linear amount of preprocessing of a rooted tree, any two nodes can then be specified and their lowest common ancestor found in constant time. That is, a rooted tree with n nodes is first preprocessed in O(n) time, and thereafter any lowest common ancestor query takes only constant time to solve, independent of n. Without preprocessing, the best worst-case time bound for a single query is Θ(n), so this is a most surprising and useful result. The lca result was first obtained by Harel and Tarjan [214] and later simplified by Schieber and Vishkin [393]. The exposition here is based on the later approach.
This chapter develops a number of classical comparison-based matching algorithms for the exact matching problem. With suitable extensions, all of these algorithms can be implemented to run in linear worst-case time, and all achieve this performance by preprocessing pattern P. (Methods that preprocess T will be considered in Part II of the book.) The original preprocessing methods for these various algorithms are related in spirit but are quite different in conceptual difficulty. Some of the original preprocessing methods are quite difficult. This chapter does not follow the original preprocessing methods but instead exploits fundamental preprocessing, developed in the previous chapter, to implement the needed preprocessing for each specific matching algorithm.
Also, in contrast to previous expositions, we emphasize the Boyer–Moore method over the Knuth-Morris-Pratt method, since Boyer–Moore is the practical method of choice for exact matching. Knuth-Morris-Pratt is nonetheless completely developed, partly for historical reasons, but mostly because it generalizes to problems such as real-time string matching and matching against a set of patterns more easily than Boyer–Moore does. These two topics will be described in this chapter and the next.
The Boyer–Moore Algorithm
As in the naive algorithm, the Boyer–Moore algorithm successively aligns P with T and then checks whether P matches the opposing characters of T. Further, after the check is complete, P is shifted right relative to T just as in the naive algorithm. However, the Boyer–Moore algorithm contains three clever ideas not contained in the naive algorithm: the right-to-left scan, the bad character shift rule, and the good suffix shift rule.
In this chapter we consider the inexact matching and alignment problems that form the core of the field of inexact matching and others that illustrate the most general techniques. Some of those problems and techniques will be further refined and extended in the next chapters. We start with a detailed examination of the most classic inexact matching problem solved by dynamic programming, the edit distance problem. The motivation for inexact matching (and, more generally, sequence comparison) in molecular biology will be a recurring theme explored throughout the rest of the book. We will discuss many specific examples of how string comparison and inexact matching are used in current molecular biology. However, to begin, we concentrate on the purely formal and technical aspects of defining and computing inexact matching.
The edit distance between two strings
Frequently, one wants a measure of the difference or distance between two strings (for example, in evolutionary, structural, or functional studies of biological strings; in textual database retrieval; or in spelling correction methods). There are several ways to formalize the notion of distance between strings. One common, and simple, formalization [389, 299], called edit distance, focuses on transforming (or editing) one string into the other by a series of edit operations on individual characters. The permitted edit operations are insertion of a character into the first string, the deletion of a character from the first string, or the substitution (or replacement) of a character in the first string with a character in the second string.