Search

Contents
Maxime Crochemore, Christophe Hancart, Thierry Lecroq, Université de Rouen
Book:

Algorithms on Strings

Published online:

03 October 2009

Print publication:

09 April 2007, pp v-vi
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

3 - String searching with a sliding window
Maxime Crochemore, Christophe Hancart, Thierry Lecroq, Université de Rouen
Book:

Algorithms on Strings

Published online:

03 October 2009

Print publication:

09 April 2007, pp 102-145
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In this chapter, we consider the problem of searching for all the occurrences of a fixed string in a text. The methods described here are based on combinatorial properties. They apply when the string and the text are in central memory or when only a part of the text is in a memory buffer. Contrary to the solutions presented in the previous chapter, the search does not process the text in a strictly sequential way.
The algorithms of the chapter scan the text through a window having the same length as the pattern length. The process that consists in determining if the content of the window matches the string is called an attempt, following the sliding window mechanism described in Section 1.5. After the end of each attempt the window is shifted toward the end of the text. The executions of these algorithms are thus successions of attempts followed by shifts.
We consider algorithms that, during each attempt, perform the comparisons between letters of the string and of the window from right to left, that is to say, in the opposite direction of the usual reading direction. These algorithms match thus suffixes of the string inside the text. The interest of this technique is that during an attempt the algorithm accumulates information on the text that is possibly processed later on.

Preface
Maxime Crochemore, Christophe Hancart, Thierry Lecroq, Université de Rouen
Book:

Algorithms on Strings

Published online:

03 October 2009

Print publication:

09 April 2007, pp vii-viii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This book presents a broad panorama of the algorithmic methods used for processing texts. For this reason it is a book on algorithms, but whose object is focused on the handling of texts by computers. The idea of this publication results from the observation that the rare books entirely devoted to the subject are primarily monographs of research. This is surprising because the problems of the field have been known since the development of advanced operating systems, and the need for effective solutions becomes essential because the massive use of data processing in office automation is crucial in many sectors of the society. In 1985, Galil pointed out several unsolved questions in the field, called after him, Stringology (see [12]). Most of them are still open.
In a written or vocal form, text is the only reliable vehicle of abstract concepts. Therefore, it remains the privileged support of information systems, despite of significant efforts toward the use of other media (graphic interfaces, systems of virtual reality, synthesis movies, etc.). This aspect is still reinforced by the use of knowledge databases, legal, commercial, or others, which develop on the Internet. Thanks, in particular, to the Web services.
The contents of the book carry over into formal elements and technical bases required in the fields of information retrieval, of automatic indexing for search engines, and more generally of software systems, which includes the edition, the treatment, and the compression of texts.

4 - Suffix arrays
Maxime Crochemore, Christophe Hancart, Thierry Lecroq, Université de Rouen
Book:

Algorithms on Strings

Published online:

03 October 2009

Print publication:

09 April 2007, pp 146-176
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter addresses the problem of searching a fixed text. The associated data structure described here is known as the Suffix Array of the text. The searching procedure is presented first for a list of strings in Sections 4.1 and 4.2, and then adapted to a fixed text in the remaining sections.
The first three sections consider the question of searching a list of strings memorized in a table. The table is supposed to be fixed and can thus be preprocessed to speed up later accesses to it. The search for a string in a lexicon or a dictionary that can be stored in central memory of a computer is an application of this question.
We describe how to lexicographically sort the strings of the list (in maximal time proportional to the total length of the strings) in order to be able to apply a binary search algorithm. Actually, the sorting is not entirely sufficient to get an efficient search. The precomputation and the utilization of the longest common prefixes between the strings of the list are extra elements that make the technique very efficient. Searching for a string of length m in a list of n strings takes O(m + log n) time.

1 - Tools
Maxime Crochemore, Christophe Hancart, Thierry Lecroq, Université de Rouen
Book:

Algorithms on Strings

Published online:

03 October 2009

Print publication:

09 April 2007, pp 1-54
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter presents the algorithmic and combinatorial framework in which are developed the following chapters. It first specifies the concepts and notation used to work on strings, languages, and automata. The rest is mainly devoted to the introduction of chosen data structures for implementing automata, to the presentation of combinatorial results, and to the design of elementary pattern matching techniques. This organization is based on the observation that efficient algorithms for text processing rely on one or the other of these aspects.
Section 1.2 provides some combinatorial properties of strings that occur in numerous correctness proofs of algorithms or in their performance evaluation. They are mainly periodicity results.
The formalism for the description of algorithms is presented in Section 1.3, which is especially centered on the type of algorithm presented in the book, and introduces some standard objects related to queues and automata processing.
Section 1.4 details several methods to implement automata in memory, these techniques contribute, in particular, to results of Chapters 2, 5, and 6.
The first algorithms for locating strings in texts are presented in Section 1.5. The sliding window mechanism, the notions of search automaton and of bit vectors that are described in this section are also used and improved in Chapters 2, 3, and 8, in particular.

Bibliography
Maxime Crochemore, Christophe Hancart, Thierry Lecroq, Université de Rouen
Book:

Algorithms on Strings

Published online:

03 October 2009

Print publication:

09 April 2007, pp 364-376
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Index
Maxime Crochemore, Christophe Hancart, Thierry Lecroq, Université de Rouen
Book:

Algorithms on Strings

Published online:

03 October 2009

Print publication:

09 April 2007, pp 377-383
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

6 - Indexes
Maxime Crochemore, Christophe Hancart, Thierry Lecroq, Université de Rouen
Book:

Algorithms on Strings

Published online:

03 October 2009

Print publication:

09 April 2007, pp 219-242
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

The techniques introduced in the two previous chapters find immediate applications for the realization of the index of a text. The utility of considering the suffixes of a text for this kind of application comes from the obvious remark that every factor of a string can be extended in a suffix of the text (see Figure 6.1). By storing efficiently the suffixes, we get a kind of direct access to all the factors of the text or of a language, and this is certainly the main interest of these techniques. From this property comes quite directly an implementation of the notion of index on a text or on a family of texts, with efficient algorithms for the basic operations (Section 6.2) such as the membership problem and the computation of the positions of occurrences of a pattern. Section 6.3 gives a solution under the form of a transducer. We deduce also quite directly solutions for the detection of repetitions (Section 6.4) and for the computation of forbidden strings (Section 6.5). Section 6.6 presents an inverted application of the previous techniques by using the index of a pattern in order to help searching fro itself. This method is extended in a particularly efficient way to the search for the conjugates (or rotations) of a string.

8 - Approximate patterns
Maxime Crochemore, Christophe Hancart, Thierry Lecroq, Université de Rouen
Book:

Algorithms on Strings

Published online:

03 October 2009

Print publication:

09 April 2007, pp 287-331
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In this chapter, we are interested in the approximate search for fixed strings. Several notions of approximation on strings are considered: jokers, differences, and mismatches.
A joker is a symbol meant to represent all the letters of the alphabet. The solutions to the problem of searching a text for a pattern containing jokers use specific methods that are described in Section 8.1.
More generally, approximate pattern matching consists in locating all the occurrences of factors inside a text y that are similar to a string x. It consists in producing the positions of the factors of y that are at distance at most k from x, for a given natural integer k. We assume in the rest that k < |x| ≤ |y|. We consider two distances for measuring the approximation: the edit distance and the Hamming distance.
The edit distance between two strings u and v, that are not necessarily of the same length, is the minimum cost of a sequence of elementary edit operations between these two strings (see Section 7.1). The method at the basis of approximate pattern matching is a natural extension of the alignment method by dynamic programming of Chapter 7. It can be improved by using a restricted notion of distance obtained by considering the minimum number of edit operations rather than the sum of their costs.

2 - Pattern matching automata
Maxime Crochemore, Christophe Hancart, Thierry Lecroq, Université de Rouen
Book:

Algorithms on Strings

Published online:

03 October 2009

Print publication:

09 April 2007, pp 55-101
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In this chapter, we address the problem of searching for a pattern in a text when the pattern represents a finite set of strings. We present solutions based on the utilization of automata. Note first that the utilization of an automaton as solution of the problem is quite natural: given a finite language X ⊆ A*, locating all the occurrences of strings belonging to X in a text y ∈ A* amounts to determine all the prefixes of y that ends with a string of X; this amounts to recognize the language A*X; and as A*X is a regular language, this can be realized by an automaton. We additionally note that such solutions particularly suit to cases where a pattern has to be located in data that have to be processed in an online way: data flow analysis, downloading, virus detection, etc.
The utilization of an automaton for locating a pattern has already been discussed in Section 1.5. We complete here the subject by specifying how to obtain the deterministic automata mentioned at the beginning of this section. Complexities of the methods exposed at the end of Section 1.5 and that are valid for nondeterministic automata are also compared with those presented in this chapter.

9 - Local periods
Maxime Crochemore, Christophe Hancart, Thierry Lecroq, Université de Rouen
Book:

Algorithms on Strings

Published online:

03 October 2009

Print publication:

09 April 2007, pp 332-363
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter is devoted to the detection of local periodicities that can occur inside a string.
The method for detecting these periodicities is based on a partitioning of the suffixes that also allows to sort them in lexicographic order. The process is analogue to the one used in Chapter 4 for the preparation of the suffix array of a string and achieves the same time and space complexity, but the information on the string collected during its execution is more directly useful.
In Section 9.1, we introduce a simplified partitioning method that is adapted to different questions in the rest of the chapter. The detection of periods is dealt with immediately after in Section 9.2.
In Section 9.3, we consider squares. Their search in optimal time uses algorithms that require combinatorial properties together with the utilization of the structures of Chapter 5. We discuss also the maximal number of squares that can occur in a string, which gives upper bounds on the number of local periodicities.
Finally, in Section 9.4, we come back to the problem of lexicographically sorting the suffixes of a string and to the computation of their common prefixes. The solution presented there is another adaptation of the partitioning method; it can be used with benefit for the construction of a suffix array (Chapter 4).

7 - Alignments
Maxime Crochemore, Christophe Hancart, Thierry Lecroq, Université de Rouen
Book:

Algorithms on Strings

Published online:

03 October 2009

Print publication:

09 April 2007, pp 243-286
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Alignments constitute one of the processes commonly used to compare strings. They allow to visualize the resemblance between strings. This chapter deals with several methods that perform the comparison of two strings in this sense. The extension to comparison methods of more than two strings is delicate, leads to algorithms whose execution time is at least exponential, and is not treated here.
Alignments are based on notions of distance or of similarity between strings. The computations are usually realized by dynamic programming. A typical example used for the design of efficient methods is the computation of the longest subsequence common to two strings. It shows the algorithmic techniques that are to implement in order to obtain an efficient computation and to extend possibly to general alignments. In particular, the reduction of the memory space obtained by one of the algorithms is a strategy that can often be applied in the solutions to close problems.
After the presentation of some distances defined on strings, notions of alignment and of edit graph, Section 7.2 describes the basic techniques for the computation of the edit (or alignment) distance and the production of the associated alignments. The chosen method highlights a global resemblance between two strings using assumptions that simplify the computation.

Algorithms on Strings

Maxime Crochemore, Christophe Hancart, Thierry Lecroq
Published online:

03 October 2009

Print publication:

09 April 2007
- Book
- - Get access
    
    Buy a print copy
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
The book is intended for lectures on string processes and pattern matching in Master's courses of computer science and software engineering curricula. The details of algorithms are given with correctness proofs and complexity analysis, which make them ready to implement. Algorithms are described in a C-like language. The book is also a reference for students in computational linguistics or computational biology. It presents examples of questions related to the automatic processing of natural language, to the analysis of molecular sequences, and to the management of textual databases.

5 - Structures for indexes
Maxime Crochemore, Christophe Hancart, Thierry Lecroq, Université de Rouen
Book:

Algorithms on Strings

Published online:

03 October 2009

Print publication:

09 April 2007, pp 177-218
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In this chapter, we present data structures for storing the suffixes of a text. These structures are conceived for providing a direct and fast access to the factors of the text. They allow to work on the factors of the string in almost the same way as the suffix array of Chapter 4 does, but the more important part of the technique is put on the structuring of data rather than on algorithms to search the text.
The main application of these techniques is to provide the basis of an index implementation as described in Chapter 6. The direct access to the factors of a string allows a large number of other applications. In particular, the structures can be used for matching patterns by considering them as search machines (see Chapter 6).
Two types of objects are considered in this chapter, trees and automata, together with their compact versions. Trees have for effect to factorize the prefixes of the strings in the set. Automata additionally factorize their common suffixes. The structures are presented in decreasing order of size.
The representation of the suffixes of a string by a trie (Section 5.1) has the advantage to be simple but can lead to a quadratic memory space according to the length of the considered string.

Frontmatter
Maxime Crochemore, Christophe Hancart, Thierry Lecroq, Université de Rouen
Book:

Algorithms on Strings

Published online:

03 October 2009

Print publication:

09 April 2007, pp i-iv
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Search Results

Refine search

Refine search

Actions for selected content:

15 results

Contents

3 - String searching with a sliding window

Summary

Preface

Summary

4 - Suffix arrays

Summary

1 - Tools

Summary

Bibliography

Index

6 - Indexes

Summary

8 - Approximate patterns

Summary

2 - Pattern matching automata

Summary

9 - Local periods

Summary

7 - Alignments

Summary

Algorithms on Strings

5 - Structures for indexes

Summary

Frontmatter

Search Results

Refine search

Refine search

Actions for selected content:

Save Search

15 results

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Algorithms on Strings

Summary