Search results for Algorithmics, Complexity, Computer Algebra, Computational Geometry

Self-Affine Carpets on the Square Lattice
IRENE HUETER, YUVAL PERES
Journal:

Combinatorics, Probability and Computing / Volume 6 / Issue 2 / June 1997

Published online by Cambridge University Press:

01 June 1997, pp. 197-204
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
We explore the ‘Hausdorff dimension at infinity’ for self-affine carpets defined on the square lattice. This notion of dimension (due to Barlow and Taylor), which is the correct notion from a probabilistic perspective, differs for these sets from more ‘naive’ indices of fractal dimension.

Stochastic Analysis of Convergence via Dynamic Representation for a Class of Line-search Algorithms
L. PRONZATO, H. P. WYNN, A. A. ZHIGLJAVSKY
Journal:

Combinatorics, Probability and Computing / Volume 6 / Issue 2 / June 1997

Published online by Cambridge University Press:

01 June 1997, pp. 205-229
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Certain convergent search algorithms can be turned into chaotic dynamic systems by renormalisation back to a standard region at each iteration. This allows the machinery of ergodic theory to be used for a new probabilistic analysis of their behaviour. Rates of convergence can be redefined in terms of various entropies and ergodic characteristics (Kolmogorov and Rényi entropies and Lyapunov exponent). A special class of line-search algorithms, which contains the Golden-Section algorithm, is studied in detail. Their associated dynamic systems exhibit a Markov partition property, from which invariant measures and ergodic characteristics can be computed. A case is made that the Rényi entropy is the most appropriate convergence criterion in this environment.

I - Exact String Matching: The Fundamental String Problem
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 1-4
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Exact matching: what's the problem?
Given a string P called the pattern and a longer string T called the text, the exact matching problem is to find all occurrences, if any, of pattern P in text T.
For example, if P = aba and T = bbabaxababay then P occurs in T starting at locations 3, 7, and 9. Note that two occurrences of P may overlap, as illustrated by the occurrences of P at locations 7 and 9.
Importance of the exact matching problem
The practical importance of the exact matching problem should be obvious to anyone who uses a computer. The problem arises in widely varying applications, too numerous to even list completely. Some of the more common applications are in word processors; in utilities such as grep on Unix; in textual information retrieval programs such as Medline, Lexis, or Nexis; in library catalog searching programs that have replaced physical card catalogs in most large libraries; in internet browsers and crawlers, which sift through massive amounts of text available on the internet for material containing specific keywords; in internet news readers that can search the articles for topics of interest; in the giant digital libraries that are being planned for the near future; in electronic journals that are already being “published” on-line; in telephone directory assistance; in on-line encyclopedias and other educational CD-ROM applications; in on-line dictionaries and thesauri, especially those with cross-referencing features (the Oxford English Dictionary project has created an electronic on-line version of the OED containing 50 million words); and in numerous specialized databases.

Contents
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp vii-xii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Frontmatter
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp i-vi
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

II - Suffix Trees and Their Uses
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 87-88
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

4 - Seminumerical String Matching
from I - Exact String Matching: The Fundamental String Problem
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 70-86
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Arithmetic versus comparison-based methods
All of the exact matching methods in the first three chapters, as well as most of the methods that have yet to be discussed in this book, are examples of comparison-based methods. The main primitive operation in each of those methods is the comparison of two characters. There are, however, string matching methods based on bit operations or on arithmetic, rather than character comparisons. These methods therefore have a very different flavor than the comparison-based approaches, even though one can sometimes see character comparisons hidden at the inner level of these “seminumerical” methods. We will discuss three examples of this approach: the Shift-And method and its extension to a program called agrep to handle inexact matching; the use of the Fast Fourier Transform in string matching; and the random fingerprint method of Karp and Rabin.
The Shift-And method
R. Baeza-Yates and G. Gonnet [35] devised a simple, bit-oriented method that solves the exact matching problem very efficiently for relatively small patterns (the length of a typical English word for example). They call this method the Shift-Or method, but it seems more natural to call it Shift-And. Recall that pattern P is of size n and the text T is of size m.
Definition Let M be an n by m + 1 binary valued array, with index i running from 1 to n and index j running from 1 to m.

13 - Extending the Core Problems
from III - Inexact Matching, Sequence Alignment, Dynamic Programming
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 312-331
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In this chapter we look in detail at alignment problems in the more complex contexts typical of string problems that currently arise in computational molecular biology. These more complex problems require techniques that extend (rather than refine) the core alignment methods.
Parametric sequence alignment
Introduction
When using sequence alignment methods to study DNA or amino acid sequences, there is often considerable disagreement about how to weight matches, mismatches, insertions and deletions (indels), and gaps. The most commonly used alignment software packages require the user to specify fixed values for those parameters, and it is widely observed that the biological significance of the resulting alignment can be greatly affected by the choice of parameter settings. The following relates to alignments of proteins from the globin family and is representative of frequently seen comments in the biological literature:
…one must be able to vary the gap and gap size penalties independently and in a query dependent fashion in order to obtain the maximal sensitivity of the search.
[81]
A similar comment appears in [432]:
Sequence alignment is sensitive to the choices of gap penalty and the form of the relatedness matrix, and it is often desirable to vary these …
Finally, from [446],
One of the most prominent problems is the choice of parametric values, especially gap penalties. When very similar sequences are compared, the choice is not critical; but when the conservation is low, the resulting alignment is strongly affected.

5 - Introduction to Suffix Trees
from II - Suffix Trees and Their Uses
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 89-93
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

A suffix tree is a data structure that exposes the internal structure of a string in a deeper way than does the fundamental preprocessing discussed in Section 1.3. Suffix trees can be used to solve the exact matching problem in linear time (achieving the same worst-case bound that the Knuth-Morris-Pratt and the Boyer–Moore algorithms achieve), but their real virtue comes from their use in linear-time solutions to many string problems more complex than exact matching. Moreover (as we will detail in Chapter 9), suffix trees provide a bridge between exact matching problems, the focus of Part I, and inexact matching problems that are the focus of Part III.
The classic application for suffix trees is the substring problem. One is first given a text T of length m. After O(m), or linear, preprocessing time, one must be prepared to take in any unknown string S of length n and in O(n) time either find an occurrence of S in T or determine that S is not contained in T. That is, the allowed preprocessing takes time proportional to the length of the text, but thereafter, the search for S must be done in time proportional to the length of S, independent of the length of T. These bounds are achieved with the use of a suffix tree. The suffix tree for the text is built in O(m) time during a preprocessing stage; thereafter, whenever a string of length O(n) is input, the algorithm searches for it in O(n) time using that suffix tree.

Bibliography
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 505-523
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

19 - Models of Genome-Level Mutations
from IV - Currents, Cousins, and Cameos
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 492-500
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
String search, edit, and alignment tools have been extensively used in studies of molecular evolution. However, their use has primarily been aimed at comparing strings representing single genes or single proteins. For example, evolutionary studies have usually selected a single protein and have examined how the amino acid sequence for that protein differs in different species. Accordingly, string edit and alignment algorithms have been guided by objective functions that model the most common types of mutations occurring at the level of a single gene or protein: point mutations or amino acid substitutions, single character insertions and deletions, and block insertions and deletions (gaps).
Recently, attention has been given to mutations that occur on a scale much larger than the single gene. These mutations occur at the chromosome or at the genome level and are central in the evolution of the whole genome. These larger-scale mutations have features that can be quite different from gene- or protein-level mutations. With more genome-level molecular data becoming available, larger-scale string comparisons may give insights into evolution that are not seen at the single gene or protein level.
The guiding force behind genome evolution is “duplication with modification” [126, 128, 301, 468]. That is, parts of the genome are duplicated, possibly very far away from the original site, and then modified. Other genome-level mutations of importance include inversions, where a segment of DNA is reversed; translocations, where the ends of two chromosomes (telomeres) are exchanged; and transpositions, where two adjacent segments of DNA exchange places.

1 - Exact Matching: Fundamental Preprocessing and First Algorithms
from I - Exact String Matching: The Fundamental String Problem
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 5-15
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

The naive method
Almost all discussions of exact matching begin with the naive method, and we follow this tradition. The naive method aligns the left end of P with the left end of T and then compares the characters of P and T left to right until either two unequal characters are found or until P is exhausted, in which case an occurrence of P is reported. In either case, P is then shifted one place to the right, and the comparisons are restarted from the left end of P. This process repeats until the right end of P shifts past the right end of T.
Using n to denote the length of P and m to denote the length of T, the worst-case number of comparisons made by this method is Θ(nm). In particular, if both P and T consist of the same repeated character, then there is an occurrence of P at each of the first m − n + 1 positions of T and the method performs exactly n(m − n + 1) comparisons. For example, if P = aaa and T = aaaaaaaaaa then n = 3, m = 10, and 24 comparisons are made.
The naive method is certainly simple to understand and program, but its worst-case running time of Θ(nm) may be unsatisfactory and can be improved. Even the practical running time of the naive method may be too slow for larger texts and patterns.

Epilogue – where next?
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 501-504
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In this book I have tried to present fundamental ideas, algorithms, and techniques that have a wide range of application and that will likely remain important even as the present-day interests change. I have also tried to explain the fundamental reasons why computations on strings and sequences are productive in biology and will remain important even as the specific applications change. But with only 500 pages (a mere 285,639 words formed from 1,784,996 characters), there are certain algorithmic methods and certain present and anticipated applications that I could not cover.
Additional techniques
For additional pure computer science results on exact matching, the reader is referred to Text Algorithms by M. Crochemore and W. Rytter [117]. That book goes more deeply into several pure computer science issues, such as periodicities in strings and parallel algorithms. For a survey of many string searching algorithms and inexact matching methods, see String Searching Algorithms by G. Stephen [421]. For additional topics in computational molecular biology, particularly probabilistic and statistical questions about strings and sequences, see An Introduction to Computational Biology by M. Waterman [461]. For another introduction to combinatorial and string problems in computational molecular biology, see Introduction to Computational Molecular Biology, by J. Setubal and J. Meidanis [402]. For topics in computational molecular biology more focused on issues of protein structure, see the chapter Computational Molecular Biology by A. Lesk in [297].

3 - Exact Matching: A Deeper Look at Classical Methods
from I - Exact String Matching: The Fundamental String Problem
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 35-69
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

A Boyer–Moore variant with a “simple” linear time bound
Apostolico and Giancarlo [26] suggested a variant of the Boyer–Moore algorithm that allows a fairly simple proof of linear worst-case running time. With this variant, no character of T will ever be compared after it is first matched with any character of P. It is then immediate that the number of comparisons is at most 2m: Every comparison is either a match or a mismatch; there can only be m mismatches since each one results in a nonzero shift of P; and there can only be m matches since no character of T is compared again after it matches a character of P. We will also show that (in addition to the time for comparisons) the time taken for all the other work in this method is linear in m.
Given the history of very difficult and partial analyses of the Boyer–Moore algorithm, it is quite amazing that a close variant of the algorithm allows a simple linear time bound. We present here a further improvement of the Apostolico–Giancarlo idea, resulting in an algorithm that simulates exactly the shifts of the Boyer–Moore algorithm. The method therefore has all the rapid shifting advantages of the Boyer–Moore method as well as a simple linear worst-case time analysis.
Key ideas
Our version of the Apostolico–Giancarlo algorithm simulates the Boyer–Moore algorithm, finding exactly the same mismatches that Boyer–Moore would find and making exactly the same shifts.

III - Inexact Matching, Sequence Alignment, Dynamic Programming
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 209-211
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

At this point we shift from the general area of exact matching and exact pattern discovery to the general area of inexact, approximate matching, and sequence alignment. “Approximate” means that some errors, of various types detailed later, are acceptable in valid matches. “Alignment” will be given a precise meaning later, but generally means lining up characters of strings, allowing mismatches as well as matches, and allowing characters of one string to be placed opposite spaces made in opposing strings.
We also shift from problems primarily concerning substrings to problems concerning subsequences. A subsequence differs from a substring in that the characters in a substring must be contiguous, whereas the characters in a subsequence embedded in a string need not be. For example, the string xyz is a subsequence, but not a substring, in axayaz. The shift from substrings to subsequences is a natural corollary of the shift from exact to inexact matching. This shift of focus to inexact matching and subsequence comparison is accompanied by a shift in technique. Most of the methods we will discuss in Part III, and many of the methods in Part IV, rely on the tool of dynamic programming, a tool that was not needed in Parts I and II.
Much of computational biology concerns sequence alignments
The area of approximate matching and sequence comparison is central in computational molecular biology both because of the presence of errors in molecular data and because of active mutational processes that (sub)sequence comparison methods seek to model and reveal.

12 - Refining Core String Edits and Alignments
from III - Inexact Matching, Sequence Alignment, Dynamic Programming
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 254-311
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In this chapter we look at a number of important refinements that have been developed for certain core string edit and alignment problems. These refinements either speed up a dynamic programming solution, reduce its space requirements, or extend its utility.
Computing alignments in only linear space
One of the defects of dynamic programming for all the problems we have discussed is that the dynamic programming tables use Θ(nm) space when the input strings have length n and m. (When we talk about the space used by a method, we refer to the maximum space ever in use simultaneously. Reused space does not add to the count of space use.) It is quite common that the limiting resource in string alignment problems is not time but space. That limit makes it difficult to handle large strings, no matter how long we may be willing to wait for the computation to finish. Therefore, it is very valuable to have methods that reduce the use of space without dramatically increasing the time requirements.
Hirschberg [224] developed an elegant and practical space-reduction method that works for many dynamic programming problems. For several string alignment problems, this method reduces the required space from Θ(nm) to O(n) (for n < m) while only doubling the worst-case time bound. Miller and Myers expanded on the idea and brought it to the attention of the computational biology community [344].

8 - Constant-Time Lowest Common Ancestor Retrieval
from II - Suffix Trees and Their Uses
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 181-195
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
We now begin the discussion of an amazing result that greatly extends the usefulness of suffix trees (in addition to many other applications).
Definition In a rooted tree T, a node u is an ancestor of a node v if u is on the unique path from the root to v. With this definition a node is an ancestor of itself. A proper ancestor of v refers to an ancestor that is not v.
Definition In a rooted tree T, the lowest common ancestor (lca) of two nodes x and y is the deepest node in T that is an ancestor of both x and y.
For example, in Figure 8.1 the lca of nodes 6 and 10 is node 5 while the lca of 6 and 3 is 1.
The amazing result is that after a linear amount of preprocessing of a rooted tree, any two nodes can then be specified and their lowest common ancestor found in constant time. That is, a rooted tree with n nodes is first preprocessed in O(n) time, and thereafter any lowest common ancestor query takes only constant time to solve, independent of n. Without preprocessing, the best worst-case time bound for a single query is Θ(n), so this is a most surprising and useful result. The lca result was first obtained by Harel and Tarjan [214] and later simplified by Schieber and Vishkin [393]. The exposition here is based on the later approach.

2 - Exact Matching: Classical Comparison-Based Methods
from I - Exact String Matching: The Fundamental String Problem
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 16-34
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
This chapter develops a number of classical comparison-based matching algorithms for the exact matching problem. With suitable extensions, all of these algorithms can be implemented to run in linear worst-case time, and all achieve this performance by preprocessing pattern P. (Methods that preprocess T will be considered in Part II of the book.) The original preprocessing methods for these various algorithms are related in spirit but are quite different in conceptual difficulty. Some of the original preprocessing methods are quite difficult. This chapter does not follow the original preprocessing methods but instead exploits fundamental preprocessing, developed in the previous chapter, to implement the needed preprocessing for each specific matching algorithm.
Also, in contrast to previous expositions, we emphasize the Boyer–Moore method over the Knuth-Morris-Pratt method, since Boyer–Moore is the practical method of choice for exact matching. Knuth-Morris-Pratt is nonetheless completely developed, partly for historical reasons, but mostly because it generalizes to problems such as real-time string matching and matching against a set of patterns more easily than Boyer–Moore does. These two topics will be described in this chapter and the next.
The Boyer–Moore Algorithm
As in the naive algorithm, the Boyer–Moore algorithm successively aligns P with T and then checks whether P matches the opposing characters of T. Further, after the check is complete, P is shifted right relative to T just as in the naive algorithm. However, the Boyer–Moore algorithm contains three clever ideas not contained in the naive algorithm: the right-to-left scan, the bad character shift rule, and the good suffix shift rule.

11 - Core String Edits, Alignments, and Dynamic Programming
from III - Inexact Matching, Sequence Alignment, Dynamic Programming
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 215-253
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
In this chapter we consider the inexact matching and alignment problems that form the core of the field of inexact matching and others that illustrate the most general techniques. Some of those problems and techniques will be further refined and extended in the next chapters. We start with a detailed examination of the most classic inexact matching problem solved by dynamic programming, the edit distance problem. The motivation for inexact matching (and, more generally, sequence comparison) in molecular biology will be a recurring theme explored throughout the rest of the book. We will discuss many specific examples of how string comparison and inexact matching are used in current molecular biology. However, to begin, we concentrate on the purely formal and technical aspects of defining and computing inexact matching.
The edit distance between two strings
Frequently, one wants a measure of the difference or distance between two strings (for example, in evolutionary, structural, or functional studies of biological strings; in textual database retrieval; or in spelling correction methods). There are several ways to formalize the notion of distance between strings. One common, and simple, formalization [389, 299], called edit distance, focuses on transforming (or editing) one string into the other by a series of edit operations on individual characters. The permitted edit operations are insertion of a character into the first string, the deletion of a character from the first string, or the substitution (or replacement) of a character in the first string with a character in the second string.

Glossary
Dan Gusfield, University of California, Davis
Book:

Algorithms on Strings, Trees, and Sequences

Published online:

23 June 2010

Print publication:

28 May 1997, pp 524-529
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Algorithmics, Complexity, Computer Algebra, Computational Geometry

Refine search

Refine search

Actions for selected content:

6914 results in Algorithmics, Complexity, Computer Algebra, Computational Geometry

Self-Affine Carpets on the Square Lattice

Stochastic Analysis of Convergence via Dynamic Representation for a Class of Line-search Algorithms

I - Exact String Matching: The Fundamental String Problem

Summary

Contents

Frontmatter

II - Suffix Trees and Their Uses

4 - Seminumerical String Matching

Summary

13 - Extending the Core Problems

Summary

5 - Introduction to Suffix Trees

Summary

Bibliography

19 - Models of Genome-Level Mutations

Summary

1 - Exact Matching: Fundamental Preprocessing and First Algorithms

Summary

Epilogue – where next?

Summary

3 - Exact Matching: A Deeper Look at Classical Methods

Summary

III - Inexact Matching, Sequence Alignment, Dynamic Programming

Summary

12 - Refining Core String Edits and Alignments

Summary

8 - Constant-Time Lowest Common Ancestor Retrieval

Summary

2 - Exact Matching: Classical Comparison-Based Methods

Summary

11 - Core String Edits, Alignments, and Dynamic Programming

Summary

Glossary

Algorithmics, Complexity, Computer Algebra, Computational Geometry

Refine search

Refine search

Actions for selected content:

Save Search

6914 results in Algorithmics, Complexity, Computer Algebra, Computational Geometry

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary