To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter attacks an easy-to-state problem, which is to select a subsequence of items uniformly at random from a given input sequence. This problem is the backbone of many randomized algorithms, and admits solutions that are algorithmically challenging to design and analyze. In particular, this chapter deals with the two cases where the size of the input sequence is either known or unknown to the algorithm, and also addresses the cases where the sequence length is smaller or larger than the internal memory of the computer. This will allowus to introduce another model of computation, the streaming model.
This chapter tackles the simple problem of intersecting two (sorted) lists of increasing integers, which constitutes the backbone of every query resolver in databases and (Web) search engines. In dealing with this problem, the chapter describes several approaches of increasing sophistication and elegance, which eventually turn out to be efficient/optimal in terms of time and I/O complexities. A final solution will deploy a proper compression of the input integers and a two-level scheme aimed at reducing the final space occupancy and working efficiently over hierarchical memories.
This chapter describes a data compression technique devised by Mike Burrows and David Wheeler in 1994 at DEC Systems Research Center. This technique is known as the Burrows–Wheeler Transform (or BWT) and offers a revolutionary alternative to dictionary-based and statistical compressors. It is the algorithmic core of a new class of data compressors (such as bzip2), as well as of new powerful compressed indexes (such as the FM-index). The chapter describes the algorithmic details of the BWT and of two other simple compressors, Move-to-Front and Run-Length Encoding, whose combination constitutes the design core of bzip-based compressors. This description is accompanied by the theoretical analysis of the impact of BWT on data compression, in terms of the k-th order empirical entropy of the input data, and by a sketch of the main algorithmic issues that underlie the design of the first provably compressed suffix array to date, namely the FM-index. Given the technicalities involved in the description of the BWT and the FM-index, this chapter offers several running examples and illustrative figures which should ease their understanding.
This chapter deals with a classic topic in data compression and information theory, namely the design of compressors based on the statistics of the symbols present in the text to be compressed. This topic is addressed by means of an algorithmic approach that gives much attention to the time efficiency and algorithmic properties of the discussed statistical coders, while also evaluating their space performance in terms of the empirical entropy of the input text. The chapter deals in detail with the classic Huffman coding and arithmetic coding, and also discusses their engineered versionsc known as canonical Huffman coding and range coding. Its final part is dedicated to describing and commenting on the prediction by partial matching (PPM) coder, whose algorithmic structure is at the core of some of the best statistical coders to date.
This chapter revisits the classic sorting problem within the context of big inputs, where “Atomic” in the title refers to the fact that items occupy few memory words and are managed in their entirety by executing only comparisons. It discusses two classic sorting paradigms: the merge-based paradigm, which underlies the design of MergeSort, and the distribution-based paradigm, which underlies the design of QuickSort. It shows how to adapt them to work in a hierarchical memory setting, analyzes their I/O complexity, and finally proposes some useful algorithmic tools that allow us to speed up their execution in practice, such as the Snow-Plow technique and data compression. It also proves that these adaptations are I/O optimal in the two-level memory model by providing a sophisticated, yet very informative, lower bound.These results allow us to relate the sorting problem to the so-called permuting problem, typically neglected when dealing with sorting in the RAM model, and then argue an interesting I/O-complexity equivalence between these two problems which provides a mathematical ground for the ubiquitous use of sorters when designing I/O-efficient solutions for big data problems.
This chapter discusses the limitations incurred by the sorters of atomic items when applied to sort variable-length items (aka strings). It then introduces a simple, yet effective comparison-based lower bound, which is eventually matched by means of an elegant variant of QuickSort, named Multi-key QuickSort, properly designed to deal with strings. The structure of this string sorter will also allow us to introduce an interesting, powerful, and dynamic data structure for string indexing, the ternary search tree, which supports efficient prefix searches over a dynamic string dictionary that fits in the internal memory of a computer. The case of large string dictionaries that cannot be fit into the internal memory of a computer is discussed in Chapter 9.
The present and following chapter extend the treatment of the dictionary problem to the case of more sophisticated forms of key matching, namely prefix match and substring match between a variable-length pattern string and all strings of an input dictionary. In particular, this chapter addresses the former problem, which occurs in many real-life applications concerned, first and foremost, with key-value stores and search engines. Discussion starts with very simple array-based solutions for internal and external memory (i.e. disks), and then moves to evaluate their time , space, and I/O complexities, which motivates the introduction of more advanced solutions for string compression (i.e. front coding and locality-preserving front coding), and data-structure design for prefix string search (i.e. compacted tries and Patricia tries). The chapter is concluded with a discussion on the management of dynamic and very large string dictionaries, which leads to the description of String B-trees. As for all previous chapters, the algorithmic discussion is enriched with pseudocodes, illustrative figures, and many running examples.
This chapter deals with the design of compressed data structures, an algorithmic field born just 30 years ago which now offers plenty of compressed solutions for most, if not all, classic data structures such as arrays, trees, and graphs. This last chapter aims at providing just an idea about these novel approaches to data structure design, by discussing the ones that we consider the most significant and fruitful, from an educational point of view. A side effect of this discussion will be the introduction of the paradigm called “pointerless programming,” which waives the explicit use of pointers (and thus integer offsets of four–eight bytes to index arbitrary items, such as strings, nodes, or edges) and instead uses compressed data structures built upon proper binary arrays that efficiently subsume the pointers, and support efficiently/optimally some interesting query operations over them.
This chapter deals with the design of data structures and algorithms for the substring search problem, which occurs mainly in computational biology and textual database applications to date. Most of the chapter is devoted to describing the two main data-structure champions in this context, the suffix array and the suffix tree. Several pseudocodes and illustrative examples enrich this discussion, which is accompanied by the evaluation of time, space, and I/O complexities incurred by their construction and by the execution of some powerful query operations. In particular, the chapter deals with the efficient/optimal construction of large suffix arrays in external memory, hence describing the DC3 algorithm and the I/O-efficient scan-based algorithm proposed by Gonnet, Baeza-Yates, and Snider, and the efficient direct construction of suffix trees, via McCreight’s algorithm, or via suffix arrays and LCP arrays. It will also detail the elegant construction of this latter array in internal memory, which is fundamental for several text-mining applications, some of which are described at the end of the chapter.