To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The first version of these lecture notes was composed for a last-year undergraduate course at Chalmers University of Technology, in the spring semester 2000. I wrote a revised and expanded version for the same course one year later. This is the third and final (?) version.
The notes are intended to be sufficiently self-contained that they can be read without any supplementary material, by anyone who has previously taken (and passed) some basic course in probability or mathematical statistics, plus some introductory course in computer programming.
The core material falls naturally into two parts: Chapters 2–6 on the basic theory of Markov chains, and Chapters 7–13 on applications to a number of randomized algorithms.
Markov chains are a class of random processes exhibiting a certain “memoryless property”, and the study of these – sometimes referred to as Markov theory – is one of the main areas in modern probability theory. This area cannot be avoided by a student aiming at learning how to design and implement randomized algorithms, because Markov chains are a fundamental ingredient in the study of such algorithms. In fact, any randomized algorithm can (often fruitfully) be viewed as a Markov chain.
I have chosen to restrict the discussion to discrete time Markov chains with finite state space. One reason for doing so is that several of the most important ideas and concepts in Markov theory arise already in this setting; these ideas are more digestible when they are not obscured by the additional technicalities arising from continuous time and more general state spaces.
For several of the most interesting results in Markov theory, we need to put certain assumptions on the Markov chains we are considering. It is an important task, in Markov theory just as in all other branches of mathematics, to find conditions that on the one hand are strong enough to have useful consequences, but on the other hand are weak enough to hold (and be easy to check) for many interesting examples. In this chapter, we will discuss two such conditions on Markov chains: irreducibility and aperiodicity. These conditions are of central importance in Markov theory, and in particular they play a key role in the study of stationary distributions, which is the topic of Chapter 5. We shall, for simplicity, discuss these notions in the setting of homogeneous Markov chains, although they do have natural extensions to the more general setting of inhomogeneous Markov chains.
We begin with irreducibility, which, loosely speaking, is the property that “all states of the Markov chain can be reached from all others”. To make this more precise, consider a Markov chain (X0, X1, …) with state space S = {s1, …, sk} and transition matrix P. We say that a state sicommunicates with another state sj, writing si → sj, if the chain has positive probability of ever reaching sj when we start from si.
The general problem considered in this chapter is the following. We have a set S = {s1, …, sk} and a function f : S → R. The objective is to find an si ∈ S which minimizes (or, sometimes, maximizes) f (si).
When the size k of S is small, then this problem is of course totally trivial – just compute f (si) for i = 1, …, k and keep track sequentially of the smallest value so far, and for which si it was attained. What we should have in mind is the case where k is huge, so that this simple method becomes computationally too heavy to be useful in practice. Here are two examples.
Example 13.1: Optimal packing. Let G be a graph with vertex set V and edge set E. Suppose that we want to pack objects at the vertices of this graph, in such a way that
(i) at most one object can be placed at each vertex, and
(ii) no two objects can occupy adjacent vertices,
and that we want to squeeze in as many objects as possible under these constraints. If we represent objects by 1's and empty vertices by 0's, then, in the terminology of Example 7.1 (the hard-core model), the problem is to find (one of) the feasible configuration(s) ξ ∈ {0, 1}V which maximizes the number of 1's. […]
We present in this chapter algorithms to search for regular expressions in texts or biological sequences. Regular expressions are often used in text retrieval or computational biology applications to represent search patterns that are more complex than a string, a set of strings, or an extended string. We begin with a formal definition of a regular expression and the language (set of strings) it represents.
DefinitionA regular expression RE is a string on the set of symbols Σ ∪ { ε, |, ·, ⋆, (,) }, which is recursively defined as the empty character ε a character α ∈ Σ and (RE1), (RE1 · RE2), (RE1 | RE2), and (RE1⋆), where RE1and RE2are regular expressions.
For instance, in this chapter we consider the regular expression (((A·T) | (G·A))·(((A·G) | ((A·A)·A))⋆)). When there is no ambiguity, we simplify our expressions by writing RE1RE2 instead of (RE1 · RE2). This way, we obtain a more readable expression, in our case (AT|GA) ((AG|AAA)⋆). It is usual to use also the precedence order “*”, “·”, “|” to remove more parentheses, but we do not do this here. The symbols “·”, “|”, “⋆” are called operators. It is customary to add an extra postfix operator “+” to mean RE+ = RE · RE⋆. We define now the language represented by a regular expression.
DefinitionThe language represented by a regular expression RE is a set of strings over Σ which is defined recursively on the structure of RE as follows:
• If RE is ε, then L(RE) = {ε}, the empty string.
• If RE is α ∈ Σ, then L(RE) = {α}, a single string of one character.
• If RE is of the form (RE1), then L(RE) = L(RE1).
• If RE is of the form (RE1 · RE2), then L(RE) = L(RE1) · L(RE2), where W1 · W2is the set of strings w such that w = w1w2, with w1 ∈ W1and w2 ∈ W2. The operator “·” represents the classical concatenation of strings.
Before finishing, we would like to give some extra material that might be of interest.
First, we believe that it is extremely useful to know of freely available tools for on-line text searching, so we cover the existing software of this kind we are aware of.
Second, we give pointers to other books, journals, conferences, and online resources one may want to read to enter deeper into the area of text searching. This is also of interest to readers with a specific algorithmic problem not addressed in this book and not solved by the available software.
Finally, we include a section with problems related to combinatorial pattern matching. The section aims at briefing over the different extensions to the basic text searching problem, explaining the main concepts and existing results, and pointing to more comprehensive material covering them.
Up to date information and errata related to this book will be available at http://www.dcc.uchile.cl/~gnavarro/FPMbook.
Available software
We present in this section a sample of freely available software for on-line pattern matching.
7.1.1 Gnu Grep
What it is GNU (http://www.gnu.org) is an organization devoted to the development of free software. One of its products, Grep, permits fast searching of simple strings, multiple strings, and regular expressions in a set of files. Approximate searching is not supported. Gnu Grep is twice as fast as the classical Unix Grep.
Grep reports the lines in the file that contain matches. However, there are many configuration options that permit reporting the lines that do not match, the number of lines that match, whole files containing matches, and so on. The software provides a very powerful syntax that includes operators that go beyond regular expressions.
The string matching problem is that of finding all the occurrences of a given pattern p = p1p2 … pm a large text T = t1t2 … tn, where both T and p are sequences of characters from a finite character set Σ. Given strings x, y, and z, we say that x is a prefix of xy, a suffix of yx, and a factor of yxz.
Many algorithms exist to solve this problem. The oldest and most famous are the Knuth-Morris-Pratt and the Boyer-Moore algorithms. These algorithms appeared in 1977. The first is worst-case linear in the size of the text. This O(n) complexity is a lower bound for the worst case of any string matching algorithm. The second is O(mn) in the worst case but is sublinear on average, that is, it may avoid reading some characters of the text. An O(nlog|Σ|m/m) lower bound on the average complexity has been proved in [Yao79].
Since 1977, many studies have been undertaken to find simpler algorithms, optimal average-case algorithms, algorithms that could also search extended patterns, constant space algorithms, and so on. There exists a large variety of research directions that have been tried, many of which lead to different string matching algorithms.
The aim of this chapter is not to present as many algorithms as possible, nor to give an exhaustive list of them. Instead, we will present the most efficient algorithms, which means the algorithms that for some pattern length and some alphabet size yield the best experimental results. Among those that have more or less the same efficiency, we will present the simplest.
The algorithms we present derive from three general search approaches, according to the way the text is searched. For all of them, a search window of the size of the pattern is slid from left to right along the text, and the pattern is searched for inside the window.
Up to now we have considered search patterns that are sequences of characters. However, in many cases one may be interested in a more sophisticated form of searching. The most complex patterns that we consider in this book are regular expressions, which are covered in Chapter 5. However, regular expression searching is costly in processing time and complex to program, so one should resort to it only if necessary. In many cases one needs far less flexibility, and the search problem can be solved more efficiently with much simpler algorithms.
We have designed this chapter on “extended strings” as a middle point between simple strings and regular expressions. We provide simple search algorithms for a number of enhancements over the basic string search, which can be solved more easily than general regular expressions. We focus on those used in text searching and computational biology applications.
We consider four extensions to the string search problem: classes of characters, bounded length gaps, optional characters, and repeatable characters. The first one allows specifying sets of characters at any pattern or text position. The second permits searching patterns containing bounded length gaps, which is of interest for protein searching (e.g., PROSITE patterns [Gus97, HBFB99]). The third allows certain characters to appear optionally in a pattern occurrence, and the last permits a given character to appear multiple times in an occurrence, which includes wild cards. We finally consider some limited multipattern search capabilities.
Different occurrences of a pattern may have different lengths, and there may be several occurrences starting or ending at the same text position. Among the several choices for reporting these occurrences, we choose to report all the initial or all the final occurrence positions, depending on what is more natural for each algorithm.
In this chapter we make heavy use of bit-parallel algorithms. With some extra work, other algorithms can be adapted to handle some extended patterns as well, but bit-parallel algorithms provide the maximum flexibility and in general the best performance.
String matching can be understood as the problem of finding a pattern with some property within a given sequence of symbols. The simplest case is that of finding a given string inside the sequence.
This is one of the oldest and most pervasive problems in computer science. Applications requiring some form of string matching can be found virtually everywhere. However, recent years have witnessed a dramatic increase in interest in string matching problems, especially within the rapidly growing communities of information retrieval and computational biology.
Not only are these communities facing a drastic increase in the text sizes they have to manage, but they are demanding more and more sophisticated searches. The patterns of interest are not just simple strings but also include wild cards, gaps, and regular expressions. The definition of a match may also permit slight differences between the pattern and its occurrence in the text. This is called “approximate matching” and is especially interesting in text retrieval and computational biology.
The problems arising in this field can be addressed from different view-points. In particular, string matching is well known for being amenable to approaches that range from the extremely theoretical to the extremely practical. The theoretical solutions have given rise to important algorithmic achievements, but they are rarely useful in practice: A well-known fact in the community is that simpler ideas work better in practice. Two typical examples are the famous Knuth-Morris-Pratt algorithm, which in practice is twice as slow as the brute force approach, and the well-known Boyer-Moore family, whose most successful members in practice are highly simplified variants of the original proposal.
It is hard, however, to find the simpler ideas in the literature. In most current books on text algorithms, the string matching part covers only the classic theoretical algorithms. There are three reasons for that.