To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Many problems can be formalized as optimization problems: among all possible solutions of a problem, find one which minimizes or maximizes a certain cost. For example, in Chapter 4 we have seen algorithms that find the shortest path between two vertices of a graph, among all possible such paths. Another example is a bipartite matching problem, in which we have to match the elements of one set with the elements of another set, assuming that each allowed match has a known cost. Among all possible ways of matching the two sets of elements, one is interested in a matching with minimum total cost.
In this chapter we show that many optimization problems can be reduced (in the sense already explained in Section 2.3) to a network flow problem. This polynomially solvable problem is a powerful model, which has found a remarkable array of applications. Roughly stated, in a network flow problem one is given a transportation network, and is required to find the optimal way of sending some content through this network. Finding a shortest path is a very simple instance of a network flow problem, and, even though this is not immediately apparent, so is the bipartite matching problem.
One of the most well-known network flow problems is the maximum flow problem (whose definition we recall on page 46). In this chapter we focus instead on a more general version of it, called the minimum-cost flow problem. In Section 5.1 we give an intuition of flow, and we show that, in a DAG, a flow is just a collection of paths. This basic observation will allow us to reduce the assembly problems in Chapter 15 to minimum-cost flow problems. In Sections 5.3 and 5.4 we will discuss matching and covering problems solvable in terms of minimum-cost flow, which will in their turn find later applications in the book.
A pragmatic problem arising in the analysis of biological sequences is that collections of genomes and especially read sets consisting of material from many species (see Chapter 16) occupy too much space. We shall now consider how to efficiently compress such collections. Observe that we have already considered in Section 10.7.1 how to support pattern searches on top of such compressed genome collections. Another motivation for genome compression comes from the compression distance measures in Section 11.2.5.
We shall explore the powerful Lempel–Ziv compression scheme, which is especially good for repetitive genome collections, but is also competitive as a general compressor: popular compressors use variants of this scheme.
In the Lempel–Ziv parsing problem we are given a text T = t1t2 … tn and our goal is to obtain a partitioning T = T1T2 … Tp such that, for all i ∈ [1..p], Ti is the longest prefix of which has an occurrence starting somewhere in T1 … Ti−1. Such strings Ti are called phrases of T. Let us denote by Li the starting position of one of those occurrences (an occurrence of Ti in T starting somewhere in T1 … Ti−1). In order to avoid the special case induced by the first occurrences of characters, we assume that the text is prefixed by a virtual string of length at most σ that contains all the characters that will appear in T. Then the string T can be encoded by storing the p pairs of integers (|T1|, L1), (|T2|, L2),…,(|Tp|, Lp). A naive encoding would use 2p log n bits of space. Decoding the string T can easily be done in time O(n), since, for every i, Li points to a previously decoded position of the string, so recovering Ti is done just by copying the |Ti| characters starting from that previous position.
A full-text index for a string T = t1t2 … tn is a data structure that is built once, and that is kept in memory for answering an arbitrarily large number of queries on the position and frequency of substrings of T, with a time complexity that depends sublinearly on n. Consider for example a set R = {R1, R2, …, Rd} of reads generated by a highthroughput sequencing machine from a genome G: assuming that each Ri is an exact copy of a substring of G, a routine question in read analysis is finding the position of each Ri in G. The lens of full-text indexes can also be used to detect common features of a set of strings, like their longest common substring: such analyses arise frequently in evolutionary studies, where substrings that occur exactly in a set of biological sequences might point to a common ancestral relationship.
The problem of matching and counting exact substrings might look artificial at a first glance: for example, reads in R are not exact substrings of G in practice. In the forthcoming chapters we will present more realistic queries and analyses, which will nonetheless build upon the combinatorial properties and construction algorithms detailed here. Although this chapter focuses on the fundamental data structures and design techniques of classical indexes, the presentation will be complemented with a number of applications to immediately convey the flexibility and power of these data structures.
k-mer index
The oldest and most popular index in information retrieval for natural-language texts is the so called inverted file. The idea is to sort in lexicographic order all the distinct words in a text T, and to precompute the set of occurrences of each word in T. A query on any word can then be answered by a binary search over the sorted list. Since biological sequences have no clear delimitation in words, we might use all the distinct substrings of T of a fixed length k. Such substrings are called k-mers.
Graphs are a fundamental model for representing various relations between data. The aim of this chapter is to present some basic problems and techniques relating to graphs, mainly for finding particular paths in directed and undirected graphs. In the following chapters, we will deal with various problems in biological sequence analysis that can be reduced to one of these basic ones.
Unless stated otherwise, in this chapter we assume that the graphs have n vertices and m arcs.
Directed acyclic graphs (DAGs)
A directed graph is called acyclic if it does not contain a directed cycle; we use the shorthand DAG to denote a directed acyclic graph. DAGs are one of the most basic classes of graphs, being helpful also in many problems in bioinformatics. DAGs admit special properties that not all graphs enjoy, and some problems become simpler when restricted to DAGs.
Topological ordering
The acyclicity of the arc relation of a DAG allows us to build recursive definitions and algorithms on its vertices. In such a case, the value of a function in a vertex v of a DAG depends on the values of the function on the in-neighbors of v. In order to implement its computation, we need an ordering of the vertices of the DAG such that any vertex appears in this ordering after all its in-neighbors. Such an ordering is called a topological ordering of the DAG. See Figure 4.1 for an example.
Any DAG admits a topological ordering, but it need not be unique. Exercise 4.1 asks the reader to construct a DAG in which the number of topological orderings is exponential in the number of vertices. A consequence of the existence of a topological ordering is that every DAG has at least one source (the first node in a topological ordering) and at least one sink (the last node in a topological ordering).
High-throughput sequencing has recently revolutionized the field of biological sequence analysis, both by stimulating the development of fundamentally new data structures and algorithms, and by changing the routine workflow of biomedical labs. Most key analytical steps now exploit index structures based on the Burrows–Wheeler transform, which have been under active development in theoretical computer science for over ten years. The ability of these structures to scale to very large datasets quickly led to their widespread adoption by the bioinformatics community, and their flexibility continues to spur new applications in genomics, transcriptomics, and metagenomics. Despite their fast and still ongoing development, the key techniques behind these indexes are by now well understood, and they are ready to be taught in graduate-level computer science courses.
This book focuses on the rigorous description of the fundamental algorithms and data structures that power modern sequence analysis workflows, ranging from the foundations of biological sequence analysis (like alignments and hidden Markov models) and classical index structures (like k-mer indexes, suffix arrays, and suffix trees), to Burrows–Wheeler indexes and to a number of advanced omics applications built on such a basis. The topics and the computational problems are chosen to cover the actual steps of large-scale sequencing projects, including read alignment, variant calling, haplotyping, fragment assembly, alignment-free genome comparison, compression of genome collections and of read sets, transcript prediction, and analysis of metagenomic samples: see Figure 1 for a schematic summary of all the main steps and data structures covered in this book. Although strongly motivated by high-throughput sequencing, many of the algorithms and data structures described in this book are general, and can be applied to a number of other fields that require the processing of massive sets of sequences. Most of the book builds on a coherent, self-contained set of algorithmic techniques and tools, which are gradually introduced, developed, and refined from the basics to more advanced variations.
In the preceding chapters we assumed the genome sequence under study to be known. Now it is time to look at strategies for how to assemble fragments of DNA into longer contiguous blocks, and eventually into chromosomes. This chapter is partitioned into sections roughly following a plausible workflow of a de novo assembly project, namely, error correction, contig assembly, scaffolding, and gap filling. To understand the reason for splitting the problem into these realistic subproblems, we first consider the hypothetical scenario of having error-free data from a DNA fragment.
Sequencing by hybridization
Assume we have separated a single DNA strand spelling a sequence T, and managed to measure its k-mer spectrum; that is, for each k-mer W of T we have the frequency freq (W) telling us how many times it appears in T. Microarrays are a technology that provides such information. They contain a slot for each k-mer W such that fragments containing W hybridize to the several copies of the complement fragment contained in that slot. The amount of hybridization can be converted to an estimate on the frequency count freq(W) for each k-mer W. The sequencing by hybridization problem asks us to reconstruct T from this estimated k-mer spectrum.
Another way to estimate the k-mer spectrum is to use high-throughput sequencing on T; the k-mer spectrum of the reads normalized by average coverage gives such an estimate.
Now, assume that we have a perfect k-mer spectrum containing no errors. It turns out that one can find in linear time a sequence T′ having exactly that k-mer spectrum. If this problem has a unique solution, then of course T′ = T.
The algorithm works by solving the Eulerian path problem (see Section 4.2.1) on a de Bruijn graph (see Section 9.7) representing the k-mers.
Here we need the following expanded de Bruijn graph G = (V, E).
Recall from Chapter 1 that, through transcription and alternative splicing, each gene produces different RNA transcripts. Depending on various factors, such as the tissue the cell is in, owing to disease, or in response to some stimuli, the RNA transcripts of a gene and the number of copies produced (their expression level) can be different.
In this chapter we assume that we have a collection of reads from all the different (copies of the) transcripts of a gene. We also assume that these reads have been aligned to the reference genome, using for example techniques from Section 10.6; in addition, Section 15.4 shows how to exploit the output of genome analysis techniques from Chapter 11 to obtain an aligner for long reads of RNA transcripts. Our final goal is to assemble the reads into the different RNA transcripts, and to estimate the expression level of each transcript. The main difficulty of this problem, which we call multiassembly, arises from the fact that the transcripts share identical substrings.
We illustrate different scenarios, and corresponding multi-assembly formulations, stated and solved for each individual gene. In Section 15.1 we illustrate the simplest one, in which the gene's transcripts are known in advance, and we need only find their expression levels from the read alignments. In Section 15.2 we illustrate the problem of assembling the RNA reads into the different unknown transcripts, without estimating their levels of expression. In Section 15.3 we present a problem formulation for simultaneous assembly and expression estimation.
As just mentioned, in this chapter we assume that we have a reference genome, and thus that we are in a so-called genome-guided setting. De novo multi-assembly is in general a rather hard task. Thus, we prefer here to stick to genome-guided multiassembly, which admits clean problem formulations and already illustrates the main algorithmic concepts. Nevertheless, in Insight 15.1 we briefly discuss how the leastsquares method from Section 15.1 could be applied in a de novo setting.
In the previous chapter we discussed how to analyze a sequence with respect to one or more other sequences, through various alignment techniques. In this chapter we introduce a different model, one that can abstract our biological insight about the available sequences, and can provide findings about a new sequence S. For example, given the simple assumption that in prokaryote genomes the coding regions have more C and G nucleotides than A and T nucleotides, we would like to find a segmentation of S into coding and non-coding regions. We will develop this toy problem in Example 7.1.
The model introduced by this chapter is called a hidden Markov model (HMM). Even though this is a probabilistic one, which thus copes with the often empirical nature of our biological understanding, we should stress that the focus of our presentation is mainly algorithmic. We will illustrate how simple techniques such as dynamic programming, and in particular the Bellman–Ford method from Section 4.2.2, can provide elegant solutions in this probabilistic context. The exercises given at the end of this chapter derive some other probabilistic models that connect back to the problem of aligning biological sequences from the previous chapter.
Let us start by recalling that a Markov chain is a random process generating a sequence S = s1s2…sn such that the probability of generating, or emitting, each si is fixed beforehand and depends only on the previously generated symbols s1 … si−1; we will denote this probability by ℙ(si | s1 …si−1). An important subclass of Markov chains consists in those of finite history, namely those in which the probability of generating any symbol si depends only on the last k previous symbols, for some constant k; that is, ℙ(si | s1 …si−1) = ℙ(si | si−k … si−1) (k is called the order of the Markov chain).
Recall the high-throughput sequencing applications from Chapter 1. For example, in whole-genome resequencing we obtain a set of reads, that is, short extracts of DNA sequences from random positions in a donor genome. Assuming that the reference genome of the species has been assembled, one can align the reads to the reference in order to obtain an approximation of the content of the donor genome. We will consider this process in Chapter 14.
We remind the reader that reads come from both strands and, in diploid organisms, from both copies of each chromosome. This means that either the read or its reverse complement should align to the reference genome, but in what follows we do not explicitly repeat this fact. Let us now focus on how to efficiently align a set of (short) reads to a large genome.
The first observation is that, if a position in the reference genome is covered by several reads, one could try to exploit multiple alignment methods from Section 6.6 to obtain an accurate alignment. This gives an enormous semi-local multiple alignment problem with the reads and the reference sequence as input, and a multiple alignment as output (where reads can have an arbitrary amount of cost-free gaps before and after). Owing to the hardness of finding an optimal multiple alignment, shown for example in Section 6.6.3, we leave this approach just as a thought experiment. However, Chapter 14 deals with an application of this idea on a smaller scale.
Let us consider an incremental approach, which consists of aligning each read independently to the reference. This approach can be justified by the fact that, if the read is long enough, then it should not align to a random position in the genome, but exactly to that unique position from where it was sequenced from the donor. This assumes that this position is at least present in the reference genome, and that it is not inside a repeat. See Insight 10.1 for a more detailed discussion on this matter.
We shall now explore how the techniques from the previous chapters can be used in studying the genome of a species. We assume here that we have available the so-called finalized genome assembly of a species. Ideally, this means the complete sequences of each chromosome in a species. However, in practice, this consists of chromosome sequences containing large unknown substrings, and some unmapped contigs/scaffolds.
To fix our mindset, we assume that the species under consideration is a diploid organism (like a human). We also assume that the assembled genome is concatenated into one long sequence T with some necessary markers added to separate its contents, and with some auxiliary bookkeeping data structures helpful for mapping back to the chromosome representation.
We start with a peak detection problem in Insight 14.1 that gives a direct connection to the segmentation problem motivating our study of hidden Markov models in Chapter 7. Then we proceed into variation calling, where we mostly formulate new problems assuming read alignment as input. The results of variant calling and read alignments are then given as the input to haplotype assembly.
Insight 14.1 Peak detection and HMMs
The coverage profile of the genome is an array storing for each position the amount of reads aligned to cover that position. In ChIP-sequencing and bisulfite sequencing only some parts of the genome should be covered by reads, so that clear peak areas should be noticeable in the coverage profile. In targeted resequencing, the areas of interest are known beforehand, so automatic detection of peak areas is not that relevant (although usually the targeting is not quite that accurate). Peak detection from a signal is a classical signal processing task, so there are many existing generalpurpose peak detectors available. In order to use the machinery we developed earlier, let us consider how HMMs could be used for our peak detection task.