To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
David Sarnoff, long the chairman of RCA and a pioneer in the electronics industry, summarized his career in these words: “I hitched my wagon to the electron rather than the proverbial star.”
The world has followed suit. The past sixty years have witnessed the most rapid transformation of human activity in history, with digital electronic technology as the driving force.
Nothing has been left untouched. The way people communicate, live, work, travel, and consume products and services have all changed forever. The digital revolution has spurred the rapid expansion of economic activity across the face of the planet.
In this book I will explore the unprecedented outburst of electronic innovation that created the digital revolution. Based on this example, I will examine how innovation works, what it has achieved, and what forms we can expect it to assume in the near future.
Since innovation does not happen in a vacuum, I will also explore the political and economic factors that can accelerate or impede changes in the industrial landscape. One of these is globalization, which creates neither a level playing field nor a truly “flat world.” Governments everywhere are focused on industrializing as quickly as possible in the face of growing competition. As a result, attempts to gain national competitive advantage by building artificial walls are not going away.
Defining innovation
Before outlining the issues to be covered, we must clarify what constitutes an “innovation” in the first place.
In this chapter, we consider the problem of searching for all the occurrences of a fixed string in a text. The methods described here are based on combinatorial properties. They apply when the string and the text are in central memory or when only a part of the text is in a memory buffer. Contrary to the solutions presented in the previous chapter, the search does not process the text in a strictly sequential way.
The algorithms of the chapter scan the text through a window having the same length as the pattern length. The process that consists in determining if the content of the window matches the string is called an attempt, following the sliding window mechanism described in Section 1.5. After the end of each attempt the window is shifted toward the end of the text. The executions of these algorithms are thus successions of attempts followed by shifts.
We consider algorithms that, during each attempt, perform the comparisons between letters of the string and of the window from right to left, that is to say, in the opposite direction of the usual reading direction. These algorithms match thus suffixes of the string inside the text. The interest of this technique is that during an attempt the algorithm accumulates information on the text that is possibly processed later on.
This book presents a broad panorama of the algorithmic methods used for processing texts. For this reason it is a book on algorithms, but whose object is focused on the handling of texts by computers. The idea of this publication results from the observation that the rare books entirely devoted to the subject are primarily monographs of research. This is surprising because the problems of the field have been known since the development of advanced operating systems, and the need for effective solutions becomes essential because the massive use of data processing in office automation is crucial in many sectors of the society. In 1985, Galil pointed out several unsolved questions in the field, called after him, Stringology (see [12]). Most of them are still open.
In a written or vocal form, text is the only reliable vehicle of abstract concepts. Therefore, it remains the privileged support of information systems, despite of significant efforts toward the use of other media (graphic interfaces, systems of virtual reality, synthesis movies, etc.). This aspect is still reinforced by the use of knowledge databases, legal, commercial, or others, which develop on the Internet. Thanks, in particular, to the Web services.
The contents of the book carry over into formal elements and technical bases required in the fields of information retrieval, of automatic indexing for search engines, and more generally of software systems, which includes the edition, the treatment, and the compression of texts.
This chapter addresses the problem of searching a fixed text. The associated data structure described here is known as the Suffix Array of the text. The searching procedure is presented first for a list of strings in Sections 4.1 and 4.2, and then adapted to a fixed text in the remaining sections.
The first three sections consider the question of searching a list of strings memorized in a table. The table is supposed to be fixed and can thus be preprocessed to speed up later accesses to it. The search for a string in a lexicon or a dictionary that can be stored in central memory of a computer is an application of this question.
We describe how to lexicographically sort the strings of the list (in maximal time proportional to the total length of the strings) in order to be able to apply a binary search algorithm. Actually, the sorting is not entirely sufficient to get an efficient search. The precomputation and the utilization of the longest common prefixes between the strings of the list are extra elements that make the technique very efficient. Searching for a string of length m in a list of n strings takes O(m + log n) time.
This chapter presents the algorithmic and combinatorial framework in which are developed the following chapters. It first specifies the concepts and notation used to work on strings, languages, and automata. The rest is mainly devoted to the introduction of chosen data structures for implementing automata, to the presentation of combinatorial results, and to the design of elementary pattern matching techniques. This organization is based on the observation that efficient algorithms for text processing rely on one or the other of these aspects.
Section 1.2 provides some combinatorial properties of strings that occur in numerous correctness proofs of algorithms or in their performance evaluation. They are mainly periodicity results.
The formalism for the description of algorithms is presented in Section 1.3, which is especially centered on the type of algorithm presented in the book, and introduces some standard objects related to queues and automata processing.
Section 1.4 details several methods to implement automata in memory, these techniques contribute, in particular, to results of Chapters 2, 5, and 6.
The first algorithms for locating strings in texts are presented in Section 1.5. The sliding window mechanism, the notions of search automaton and of bit vectors that are described in this section are also used and improved in Chapters 2, 3, and 8, in particular.
The techniques introduced in the two previous chapters find immediate applications for the realization of the index of a text. The utility of considering the suffixes of a text for this kind of application comes from the obvious remark that every factor of a string can be extended in a suffix of the text (see Figure 6.1). By storing efficiently the suffixes, we get a kind of direct access to all the factors of the text or of a language, and this is certainly the main interest of these techniques. From this property comes quite directly an implementation of the notion of index on a text or on a family of texts, with efficient algorithms for the basic operations (Section 6.2) such as the membership problem and the computation of the positions of occurrences of a pattern. Section 6.3 gives a solution under the form of a transducer. We deduce also quite directly solutions for the detection of repetitions (Section 6.4) and for the computation of forbidden strings (Section 6.5). Section 6.6 presents an inverted application of the previous techniques by using the index of a pattern in order to help searching fro itself. This method is extended in a particularly efficient way to the search for the conjugates (or rotations) of a string.
In this chapter, we are interested in the approximate search for fixed strings. Several notions of approximation on strings are considered: jokers, differences, and mismatches.
A joker is a symbol meant to represent all the letters of the alphabet. The solutions to the problem of searching a text for a pattern containing jokers use specific methods that are described in Section 8.1.
More generally, approximate pattern matching consists in locating all the occurrences of factors inside a text y that are similar to a string x. It consists in producing the positions of the factors of y that are at distance at most k from x, for a given natural integer k. We assume in the rest that k < |x| ≤ |y|. We consider two distances for measuring the approximation: the edit distance and the Hamming distance.
The edit distance between two strings u and v, that are not necessarily of the same length, is the minimum cost of a sequence of elementary edit operations between these two strings (see Section 7.1). The method at the basis of approximate pattern matching is a natural extension of the alignment method by dynamic programming of Chapter 7. It can be improved by using a restricted notion of distance obtained by considering the minimum number of edit operations rather than the sum of their costs.
In this chapter, we address the problem of searching for a pattern in a text when the pattern represents a finite set of strings. We present solutions based on the utilization of automata. Note first that the utilization of an automaton as solution of the problem is quite natural: given a finite language X ⊆ A*, locating all the occurrences of strings belonging to X in a text y ∈ A* amounts to determine all the prefixes of y that ends with a string of X; this amounts to recognize the language A*X; and as A*X is a regular language, this can be realized by an automaton. We additionally note that such solutions particularly suit to cases where a pattern has to be located in data that have to be processed in an online way: data flow analysis, downloading, virus detection, etc.
The utilization of an automaton for locating a pattern has already been discussed in Section 1.5. We complete here the subject by specifying how to obtain the deterministic automata mentioned at the beginning of this section. Complexities of the methods exposed at the end of Section 1.5 and that are valid for nondeterministic automata are also compared with those presented in this chapter.
This chapter is devoted to the detection of local periodicities that can occur inside a string.
The method for detecting these periodicities is based on a partitioning of the suffixes that also allows to sort them in lexicographic order. The process is analogue to the one used in Chapter 4 for the preparation of the suffix array of a string and achieves the same time and space complexity, but the information on the string collected during its execution is more directly useful.
In Section 9.1, we introduce a simplified partitioning method that is adapted to different questions in the rest of the chapter. The detection of periods is dealt with immediately after in Section 9.2.
In Section 9.3, we consider squares. Their search in optimal time uses algorithms that require combinatorial properties together with the utilization of the structures of Chapter 5. We discuss also the maximal number of squares that can occur in a string, which gives upper bounds on the number of local periodicities.
Finally, in Section 9.4, we come back to the problem of lexicographically sorting the suffixes of a string and to the computation of their common prefixes. The solution presented there is another adaptation of the partitioning method; it can be used with benefit for the construction of a suffix array (Chapter 4).
Alignments constitute one of the processes commonly used to compare strings. They allow to visualize the resemblance between strings. This chapter deals with several methods that perform the comparison of two strings in this sense. The extension to comparison methods of more than two strings is delicate, leads to algorithms whose execution time is at least exponential, and is not treated here.
Alignments are based on notions of distance or of similarity between strings. The computations are usually realized by dynamic programming. A typical example used for the design of efficient methods is the computation of the longest subsequence common to two strings. It shows the algorithmic techniques that are to implement in order to obtain an efficient computation and to extend possibly to general alignments. In particular, the reduction of the memory space obtained by one of the algorithms is a strategy that can often be applied in the solutions to close problems.
After the presentation of some distances defined on strings, notions of alignment and of edit graph, Section 7.2 describes the basic techniques for the computation of the edit (or alignment) distance and the production of the associated alignments. The chosen method highlights a global resemblance between two strings using assumptions that simplify the computation.
In this chapter, we present data structures for storing the suffixes of a text. These structures are conceived for providing a direct and fast access to the factors of the text. They allow to work on the factors of the string in almost the same way as the suffix array of Chapter 4 does, but the more important part of the technique is put on the structuring of data rather than on algorithms to search the text.
The main application of these techniques is to provide the basis of an index implementation as described in Chapter 6. The direct access to the factors of a string allows a large number of other applications. In particular, the structures can be used for matching patterns by considering them as search machines (see Chapter 6).
Two types of objects are considered in this chapter, trees and automata, together with their compact versions. Trees have for effect to factorize the prefixes of the strings in the set. Automata additionally factorize their common suffixes. The structures are presented in decreasing order of size.
The representation of the suffixes of a string by a trie (Section 5.1) has the advantage to be simple but can lead to a quadratic memory space according to the length of the considered string.
The relationship between continuous-time dynamics and the corresponding discrete schemes, and its generally limited validity, is an important and widely acknowledged field within numerical analysis. In this paper, we propose another, more physical, viewpoint on this topic in order to understand the possible failure of discretisation procedures and the way to fix it. Three basic examples, the logistic equation, the Lotka–Volterra predator–prey model and Newton's law for planetary motion, are worked out. They illustrate the deep difference between continuous-time evolutions and discrete-time mappings, hence shedding some light on the more general duality between continuous descriptions of natural phenomena and discrete numerical computations.
In the domain of music for performers and electronic sounds (whether fixed or live) there are various paradigms of interaction: the performer(s) may be situated in an electroacoustic ‘environment’; there may be a primarily responsorial or ‘proliferating’ relationship; or the relationship may be closer to the traditional one between soloist and accompaniment. These paradigms preserve a relatively unproblematic dichotomy between performer, whose sound is inextricably linked to a sense of action, presence and spontaneity, and fixed or treated sound, which is more or less de-coupled from this presence. Once one tries to create a continuous, intimate relation between the two, so that one is dealing with an extension of the instrument rather than an emulated ‘other’ or environmental context, one is confronted with a fundamental difference between a sounding body whose physical properties transparently determine its sonic possibilities, and the loudspeaker, which can produce practically any sound at all. This paper interrogates this dichotomy between the fallible-corporeal and the fixed-disembodied, activating questions both about the social fact of live performance and about the compositional practices which give rise to a sense of extended instrumentality.
Suppose ƒ : X* → X* is a morphism and u,v ∈ X*. For every nonnegative integer n, let zn be the longest commonprefix of ƒn(u) and ƒn(v), and let un,vn ∈ X* be words suchthat ƒn(u) = znun and ƒn(v) = znvn. We prove that there is a positiveinteger q such that for any positive integer p, the prefixes of un(resp. vn) of length p form an ultimately periodic sequence having periodq. Further, there is a value of q which works for all words u,v ∈ X*.