To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Early computers replaced calculators and typewriters, and programmers focused on scientific computing (calculations involving numbers) and string processing (manipulating sequences of alphanumeric characters, or strings). Ironically, in modern applications, string processing is an integral part of scientific computing, as strings are an appropriate model of the natural world in a wide range of applications, notably computational biology and chemistry. Beyond scientific applications, strings are the lingua franca of modern computing, with billions of computers having immediate access to an almost unimaginable number of strings.
Decades of research have met the challenge of developing fundamental algorithms for string processing and mathematical models for strings and string processing that are suitable for scientific studies. Until now, much of this knowledge has been the province of specialists, requiring intimate familiarity with the research literature. The appearance of this new book is therefore a welcome development. It is a unique resource that provides a thorough coverage of the field and serves as a guide to the research literature. It is worthy of serious study by any scientist facing the daunting prospect of making sense of huge numbers of strings.
The development of an understanding of strings and string processing algorithms has paralleled the emergence of the field of analytic combinatorics, under the leadership of the late Philippe Flajolet, to whom this book is dedicated. Analytic combinatorics provides powerful tools that can synthesize and simplify classical derivations and new results in the analysis of strings and string processing algorithms. As disciples of Flajolet and leaders in the field nearly since its inception, Philippe Jacquet and Wojciech Szpankowski are well positioned to provide a cohesive modern treatment, and they have done a masterful job in this volume.
Repeated patterns and related phenomena in words are known to play a central role in many facets of computer science, telecommunications, coding, data compression, data mining, and molecular biology. One of the most fundamental questions arising in such studies is the frequency of pattern occurrences in a given string known as the text. Applications of these results include gene finding in biology, executing and analyzing tree-like protocols for multiaccess systems, discovering repeated strings in Lempel–Ziv schemes and other data compression algorithms, evaluating string complexity and its randomness, synchronization codes, user searching in wireless communications, and detecting the signatures of an attacker in intrusion detection.
The basic pattern matching problem is to find for a given (or random) pattern w or set of patterns W and a text X how many times W occurs in the text X and how long it takes for W to occur in X for the first time. There are many variations of this basic pattern matching setting which is known as exact string matching. In approximate string matching, better known as generalized string matching, certain words from W are expected to occur in the text while other words are forbidden and cannot appear in the text. In some applications, especially in constrained coding and neural data spikes, one puts restrictions on the text (e.g., only text without the patterns 000 and 0000 is permissible), leading to constrained string matching. Finally, in the most general case, patterns from the set W do not need to occur as strings (i.e., consecutively) but rather as subsequences; that leads to subsequence pattern matching, also known as hidden pattern matching.
These various pattern matching problems find a myriad of applications. Molecular biology provides an important source of applications of pattern matching, be it exact or approximate or subsequence pattern matching. There are examples in abundance: finding signals in DNA; finding split genes where exons are interrupted by introns; searching for starting and stopping signals in genes; finding tandem repeats in DNA.
In this chapter we consider generalized pattern matching, in which a set of patterns (rather than a single pattern) is given. We assume here that the pattern is a pair of sets of words (W0, W), where Wi consists of the sets Wi ⊂ Ami (i.e., all words in Wi have a fixed length mi). The set W0 is called the forbidden set. For W0 = ∅ one is interested in the number of pattern occurrences On(W), defined as the number of patterns from W occurring in a text generated by a (random) source. Another parameter of interest is the number of positions in where a pattern from W appears (clearly, several patterns may occur at the same positions but words from Wi must occur in different locations); this quantity we denote as Πn. If we define as the number of positions where a word from Wi occurs, then
Notice that at any given position of the text and for a given i only one word from Wi can occur.
For W0 ≠ ∅ one studies the number of occurrences On(W) under the condition that, that is, there is no occurrence of a pattern from W0 in the text. This could be called constrained pattern matching since one restricts the text to those strings that do not contain strings from W0. A simple version of constrained pattern matching was discussed in Chapter 3 (see also Exercises 3.3, 3.6, and 3.10).
In this chapter we first present an analysis of generalized pattern matching with W0 = ∅ and d = 1, which we call the reduced pattern set (i.e., no pattern is a substring of another pattern).
The discrete Green's function (without boundary) $\mathbb{G}$ is a pseudo-inverse of the combinatorial Laplace operator of a graph G = (V, E). We reveal the intimate connection between Green's function and the theory of exact stopping rules for random walks on graphs. We give an elementary formula for Green's function in terms of state-to-state hitting times of the underlying graph. Namely,$\mathbb{G}(i,j) = \pi_j \bigl( H(\pi,j) - H(i,j) \bigr),$ where πi is the stationary distribution at vertex i, H(i, j) is the expected hitting time for a random walk starting from vertex i to first reach vertex j, and H(π, j) = ∑k∈V πkH(k, j). This formula also holds for the digraph Laplace operator.
The most important characteristics of a stopping rule are its exit frequencies, which are the expected number of exits of a given vertex before the rule halts the walk. We show that Green's function is, in fact, a matrix of exit frequencies plus a rank one matrix. In the undirected case, we derive spectral formulas for Green's function and for some mixing measures arising from stopping rules. Finally, we further explore the exit frequency matrix point of view, and discuss a natural generalization of Green's function for any distribution τ defined on the vertex set of the graph.