To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
A series of important applications of combinatorics on words has emerged with the development of computerized text and string processing, especially in biology and in linguistics. The aim of this volume is to present, in a unified treatment, some of the major fields of applications. The main topics that are covered in this book are
Algorithms for manipulating text, such as string searching, pattern matching, and testing a word for special properties.
Efficient data structures for retrieving information on large indexes, including suffix trees and suffix automata.
Combinatorial, probabilistic, and statistical properties of patterns in finite words, and more general pattern, under various assumptions on the sources of the text.
Inference of regular expressions.
Algorithms for repetitions in strings, such as maximal run or tandem repeats.
Linguistic text processing, especially analysis of the syntactic and semantic structure of natural language. Applications to language processing with large dictionaries.
Enumeration, generation, and sampling of complex combinatorial structures by their encodings in words.
This book is actually the third of a series of books on combinatorics on words. Lothaire's “Combinatorics on Words” appeared in its first printing in 1984 as Volume 17 of the Encyclopedia of Mathematics. It was based on the impulse of M. P. Schützenberger's scientific work. Since then, the theory developed to a large scientific domain. It was reprinted in 1997 in the Cambridge Mathematical Library.
Repeated patterns and related phenomena in words are known to play a central role in many facets of computer science, telecommunications, coding, data compression, and molecular biology. One of the most fundamental questions arising in such studies is the frequency of pattern occurrences in another string known as the text. Applications of these results include gene finding in biology, code synchronization, user search in wireless communications, detecting signatures of an attacker in intrusion detection, and discovering repeated strings in the Lempel-Ziv schemes and other data compression algorithms.
In basic pattern matching one finds for a given (or random) pattern w or a set of patterns W and text X how many times W occurs in the text and how long it takes for W to occur in X for the first time. These two problems are not unrelated as we have already seen in Chapter 6. Throughout this chapter we allow patterns to overlap and we count overlapping occurrences separately. For example, w = abab occurs three times in the text = bababababb.
We consider pattern matching problems in a probabilistic framework in which the text is generated by a probabilistic source while the pattern is given. In Chapter 1 various probabilistic sources were discussed. Here we succinctly summarize assumptions adopted in this chapter. In addition, we introduce a new general source known as a dynamical source recently proposed by Vallée. In Chapter 2 algorithmic aspects of pattern matching and various efficient algorithms for finding patterns were discussed.
The application of statistical methods to natural language processing has been remarkably successful over the past two decades. The wide availability of text and speech corpora has played a critical role in their success since, as for all learning techniques, these methods rely heavily on data. Many of the components of complex natural language processing systems, for example, text normalizers, morphological or phonological analyzers, part-of-speech taggers, grammars or language models, pronunciation models, context-dependency models, acoustic Hidden-Markov Models (HMMs), are statistical models derived from large data sets using modern learning techniques. These models are often given as weighted automata or weighted finite-state transducers either directly or as a result of the approximation of more complex models.
Weighted automata and transducers are the finite automata and finite-state transducers described in Chapter 1 Section 1.5 with the addition of some weight to each transition. Thus, weighted finite-state transducers are automata in which each transition, in addition to its usual input label, is augmented with an output label from a possibly different alphabet, and carries some weight. The weights may correspond to probabilities or log-likelihoods or they may be some other costs used to rank alternatives. More generally, as we shall see in the next section, they are elements of a semiring set. Transducers can be used to define a mapping between two different types of information sources, for example, word and phoneme sequences.
Feedback shift register (FSR) sequences have been widely used as synchronization, masking, or scrambling codes and for white noise signals in communication systems, signal sets in CDMA communications, key stream generators in stream cipher cryptosystems, random number generators in many cryptographic primitive algorithms, and testing vectors in hardware design. Golomb's popular book Shift Register Sequences, first published in 1967 and revised in 1982 is a pioneering book that discusses this type of sequences. In this chapter, we introduce this topic and discuss the synthesis and the analysis of periodicity of linear feedback shift register sequences. We give different (though equivalent) definitions and representations for LFSR sequences and point out which are most suitable for either implementation or analysis. This chapter contains seven sections, which are organized as follows. In Section 4.1, we give a general description for feedback shift registers at the gate level for the binary case and as a finite field configuration for the q-ary case. In Sections 4.2–4.4, we introduce the definition of LFSR sequences from the point of view of polynomial rings and discuss their characteristic polynomials, minimal polynomials, and periods. Then, we show the decomposition of LFSR sequences. We provide the matrix representation of LFSR sequences in Section 4.5 as another historic approach and discuss their trace representation for the irreducible case in detail in Section 4.6, which is a more modern approach. (The general case will be treated in Chapter 6.) LFSRs with primitive minimal polynomials are basic building blocks for nonlinear generators.
Randomness of a sequence refers to the unpredictablity of the sequence. Any deterministically generated sequence used in practical applications is not truly random. The best that can be done here is to single out certain properties as being associated with randomness and to accept any sequence that has these properties as random or more properly, a pseudorandom sequence. In this chapter, we will discuss the randomness of sequences whose elements are taken from a finite field. In Section 5.1, we present Golomb's three randomness postulates for binary sequences, namely the balance property, the run property, and the (ideal) two-level autocorrelation property, and the extension of these randomness postulates to nonbinary sequences. M-sequences over a finite field possess many extraordinary randomness properties except for having the lowest possible linear span, which has stimulated researchers to seek nonlinear sequences with similarly such favorable properties for years. In Section 5.2, we show that m-sequences satisfy Golomb's three randomness postulates. In Section 5.3, we introduce the interleaved structures of m-sequences and the subfield decomposition of m-sequences. In Sections 5.4–5.6, we present the shift-and-add property, constant-on-cosets property, and 2-tuple balance property of m-sequences, respectively. The last section is devoted to the classification of binary sequences of period 2n − 1.
Golomb's randomness postulates and randomness criteria
We discussed some general properties of auto- and crosscorrelation in Chapter 1 for sequences whose elements are taken from the real number field or the complex number field.
This chapter shows some examples of applications of combinatorics on words to number theory with a brief incursion into physics. These examples have a common feature: the notion of morphism of the free monoid. Such morphisms have been widely studied in combinatorics on words; they generate infinite words which can be considered as highly ordered, and which occur in an ubiquitous way in mathematics, theoretical computer science, and theoretical physics.
The first part of this chapter is devoted to the notion of automatic sequences and uniform morphisms, in connection with the transcendence of formal power series with coefficients in a finite field. Namely it is possible to characterize algebraicity of these series in a simple way: a formal power series is algebraic if and only if the sequence of its coefficients is automatic, that is, if it is the image by a letter-to-letter map of a fixed point of a uniform morphism. This criterion is known as Christol's theorem. A central tool in the study of automatic sequences is the notion of kernel of an infinite word (sequence) over a finite alphabet: this is the set of subsequences obtained by certain decimations. A rephrasing of Christol's theorem is that transcendence of a formal power series over a finite field is equivalent to infiniteness of the kernel of the sequence of its coefficients: this will be illustrated in this chapter.
This chapter illustrates the use of words to derive enumeration results and algorithms for sampling and coding.
Given a family C of combinatorial structures, endowed with a size such that the subset Cn of objects of size n is finite, we consider three problems:
(i) Counting: determine for all n ≥ 0, the cardinal Card (Cn) of the set Cn of objects with size n.
(ii) Sampling: design an algorithm RandC that, for any n, produces a random object uniformly chosen in Cn: in other terms, the algorithm must satisfy P(RandC(n) = O) = 1/Card (Cn) for any object O ∊ Cn.
(iii) Optimal coding: construct a function φ that maps injectively objects of C on words of {0, 1}* in such a way that an object O of size n is coded by a word φ(O) of length roughly bounded above by log2 Card (Cn).
These three problems have in common an enumerative flavour, in the sense that they are immediately solved if a list of all objects of size n is available. However, since in general there is an exponential number of objects of size n in the families in which we are interested, this solution is in no way satisfying. For a wide class of so-called decomposable combinatorial structures, including nonambiguous algebraic languages, algorithms with polynomial complexity can be derived from the rather systematic recursive method. Our aim is to explore classes of structures for which an even tighter link exists between counting, sampling, and coding.
This chapter introduces various mathematical models and combinatorial algorithms that are used to infer network expressions which appear repeated in a word or are common to a set of words, where by network expression is meant a regular expression without Kleene closure on the alphabet of the input word(s). A network expression on such an alphabet is therefore any expression built up of concatenation and union operators. For example, the expression A(C + G)T concatenates A with the union (C + G) and with T. Inferring network expressions means discovering such expressions which are initially unknown. The only input is the word(s) where the repeated (or common) expressions will be sought. This is in contrast with another problem, we shall not be concerned with, which searches for a known expression in a word(s) both of which are in this case part of the input. The inference of network expressions has many applications, notably in molecular biology, system security, text mining, etc. Because of the richness of the mathematical and algorithmic problems posed by molecular biology, we concentrate on applications in this area. The network expressions considered may therefore contain spacers where by spacer is meant any number of don't care symbols (a don't care is a symbol that matches anything). Constrained spacers are consecutive don't care symbols whose number ranges over a fixed interval of values. Network expressions with don't care symbols but no spacers are called “simple” while network expressions with spacers are called “flexible” if the spacers are unconstrained, and “structured” otherwise.
This book is the product of a fruitful collaboration between one of the earliest developers of the theory and applications of binary sequences with favorable correlation properties and one of the currently most active younger contributors to research in this area. Each of us has taught university courses based on this material and benefited from the feedback obtained from the students in those courses. Our goal has been to produce a book that achieves a balance between the theoretical aspects of binary sequences with nearly ideal autocorrelation functions and the applications of these sequences to signal design for communications, radar, cryptography, and so on. This book is intended for use as a reference work for engineers and computer scientists in the applications areas just mentioned, as well as to serve as a textbook for a course in this important area of digital communications. Enough material has been included to enable an instructor to make some choices about what to cover in a one-semester course. However, we have referred the reader to the literature on those occasions when the inclusion of further detail would have resulted in a book of inordinate length.
We plan to maintain a Web site at http://calliope.uwaterloo.ca/∼ggong/book/book.htm for additions, corrections, and the continual updating of the material in this book.
Repetitions (periodicities) in words are important objects that play a fundamental role in combinatorial properties of words and their applications to string processing, such as compression or biological sequence analysis. Using properties of repetitions allows one to speed up pattern matching algorithms.
The problem of efficiently identifying repetitions in a given word is one of the classical pattern matching problems. Recently, searching for repetitions in strings received a new motivation, due to the biosequence analysis. In DNA sequences, successively repeated fragments often bear important biological information and their presence is characteristic for many genomic structures (such as telomer regions). From a practical view-point, satellites and alu-repeats are involved in chromosome analysis and genotyping, and thus are of major interest to genomic researchers. Thus, different biological studies based on the analysis of tandem repeats have been done, and even databases of tandem repeats in certain species have been compiled.
In this chapter, we present a general efficient approach to computing different periodic structures in words. It is based on two main algorithmic techniques – a special factorization of the word and so-called longest extension functions – described in Section 8.3. Different applications of this method are described in Sections 8.4, 8.5, 8.6, 8.7, and 8.8. These sections are preceded by Section 8.2 devoted to combinatorial enumerative properties of repetitions. Bounding the maximal number of repetitions is necessary for proving complexity bounds of corresponding search algorithms.
Statistical and probabilistic properties of words in sequences have been of considerable interest in many fields, such as coding theory and reliability theory, and most recently in the analysis of biological sequences. The latter will serve as the key example in this chapter. We only consider finite words.
Two main aspects of word occurrences in biological sequences are: where do they occur and how many times do they occurfi An important problem, for instance, was to determine the statistical significance of a word frequency in a DNA sequence. The naive idea is the following: a word may be significantly rare in a DNA sequence because it disrupts replication or gene expression, (perhaps a negative selection factor), whereas a significantly frequent word may have a fundamental activity with regard to genome stability. Well-known examples of words with exceptional frequencies in DNA sequences are certain biological palindromes corresponding to restriction sites avoided, for instance in E. coli, and the Cross-over Hotspot Instigator sites in several bacteria. Identifying over- and underrepresented words in a particular genome is a very common task in genome analysis.
Statistical methods of studying the distribution of the word locations along a sequence and word frequencies have also been an active field of research; the goal of this chapter is to provide an overview of the state of this research.
Binary sequences of period N with 2-level autocorrelation have many important applications in communications and cryptology. From Section 7.1, 2-level autocorrelation sequences are in natural correspondence with cyclic Hadamard difference sets with ν = N, κ = (N − 1)/2, and λ = (N − 3)/4. For this reason, they are named cyclic Hadamard sequences. In this chapter, 2-level autocorrelation always means ideal 2-level autocorrelation. There are three classic constructions for binary 2-level autocorrelation sequences that were known before 1997 (including some generalizations along these lines after 1997). One is m-sequences, described in Chapter 5, with period N = 2n − 1. The second construction is based on a number theory approach, including three types of sequences in Chapter 2, which are the quadratic residue sequences, Hall sextic residue sequences, and twin prime sequences. The period of such a sequence is either a prime or a product of twin primes. The third construction is associated with intermediate subfields. The resulting sequences have subfield decompositions and period N = 2n − 1. They include GMW sequences, cascaded GMW sequences, and generalized GMW sequences. Although the resulting sequences are binary, this construction relies heavily on intermediate fields and compositions of functions. As a consequence, it involves sequences over intermediate fields that are not binary sequences. The content of this chapter is organized as follows.
In the first three of the applications mentioned in the title of this chapter, one of the objectives (often the major objective) is to determine a point in time with great accuracy. In radar and sonar, we want to determine the round-trip time from transmitter to target to receiver very accurately, because the one-way time (half of the round-trip time) is a measure of the distance to the target (called the range of the target).
The simplest approach would be to send out a pure impulse of energy and measure the time until it returns. The ideal impulse would be virtually instantaneous in duration, but with such high amplitude that the total energy contained in the pulse would be significant, much like a Dirac delta function. However, the Dirac delta function not only fails to exist as a mathematical function, but it is also unrealizable as a physical signal. Close approximations to it – very brief signals with very large amplitudes – may be valid mathematically, but are impractical to generate physically. Any actual transmitter will have an upper limit on peak power output, and hence a short pulse will have a very restricted amount of total energy: at most, the peak power times the pulse duration. More total energy can be transmitted if we extend the duration; but if we transmit at uniform power over an extended duration, we do not get a sharp determination of the round-trip time. This dilemma is illustrated in Figure 12.1.