Search results for Computer Science

Miscellaneous Frontmatter
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp xi-xi
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Part IV - Genome-Scale Algorithms
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 199-200
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Notation
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp xii-xvi
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

5 - Network flows
from Part I - Preliminaries
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 41-68
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Many problems can be formalized as optimization problems: among all possible solutions of a problem, find one which minimizes or maximizes a certain cost. For example, in Chapter 4 we have seen algorithms that find the shortest path between two vertices of a graph, among all possible such paths. Another example is a bipartite matching problem, in which we have to match the elements of one set with the elements of another set, assuming that each allowed match has a known cost. Among all possible ways of matching the two sets of elements, one is interested in a matching with minimum total cost.
In this chapter we show that many optimization problems can be reduced (in the sense already explained in Section 2.3) to a network flow problem. This polynomially solvable problem is a powerful model, which has found a remarkable array of applications. Roughly stated, in a network flow problem one is given a transportation network, and is required to find the optimal way of sending some content through this network. Finding a shortest path is a very simple instance of a network flow problem, and, even though this is not immediately apparent, so is the bipartite matching problem.
One of the most well-known network flow problems is the maximum flow problem (whose definition we recall on page 46). In this chapter we focus instead on a more general version of it, called the minimum-cost flow problem. In Section 5.1 we give an intuition of flow, and we show that, in a DAG, a flow is just a collection of paths. This basic observation will allow us to reduce the assembly problems in Chapter 15 to minimum-cost flow problems. In Sections 5.3 and 5.4 we will discuss matching and covering problems solvable in terms of minimum-cost flow, which will in their turn find later applications in the book.

12 - Genome compression
from Part IV - Genome-Scale Algorithms
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 262-281
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

A pragmatic problem arising in the analysis of biological sequences is that collections of genomes and especially read sets consisting of material from many species (see Chapter 16) occupy too much space. We shall now consider how to efficiently compress such collections. Observe that we have already considered in Section 10.7.1 how to support pattern searches on top of such compressed genome collections. Another motivation for genome compression comes from the compression distance measures in Section 11.2.5.
We shall explore the powerful Lempel–Ziv compression scheme, which is especially good for repetitive genome collections, but is also competitive as a general compressor: popular compressors use variants of this scheme.
In the Lempel–Ziv parsing problem we are given a text T = t1t2 … tn and our goal is to obtain a partitioning T = T1T2 … Tp such that, for all i ∈ [1..p], Ti is the longest prefix of which has an occurrence starting somewhere in T1 … Ti−1. Such strings Ti are called phrases of T. Let us denote by Li the starting position of one of those occurrences (an occurrence of Ti in T starting somewhere in T1 … Ti−1). In order to avoid the special case induced by the first occurrences of characters, we assume that the text is prefixed by a virtual string of length at most σ that contains all the characters that will appear in T. Then the string T can be encoded by storing the p pairs of integers (|T1|, L1), (|T2|, L2),…,(|Tp|, Lp). A naive encoding would use 2p log n bits of space. Decoding the string T can easily be done in time O(n), since, for every i, Li points to a previously decoded position of the string, so recovering Ti is done just by copying the |Ti| characters starting from that previous position.

8 - Classical indexes
from Part III - Genome-Scale Index Structures
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 129-156
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

A full-text index for a string T = t1t2 … tn is a data structure that is built once, and that is kept in memory for answering an arbitrarily large number of queries on the position and frequency of substrings of T, with a time complexity that depends sublinearly on n. Consider for example a set R = {R1, R2, …, Rd} of reads generated by a highthroughput sequencing machine from a genome G: assuming that each Ri is an exact copy of a substring of G, a routine question in read analysis is finding the position of each Ri in G. The lens of full-text indexes can also be used to detect common features of a set of strings, like their longest common substring: such analyses arise frequently in evolutionary studies, where substrings that occur exactly in a set of biological sequences might point to a common ancestral relationship.
The problem of matching and counting exact substrings might look artificial at a first glance: for example, reads in R are not exact substrings of G in practice. In the forthcoming chapters we will present more realistic queries and analyses, which will nonetheless build upon the combinatorial properties and construction algorithms detailed here. Although this chapter focuses on the fundamental data structures and design techniques of classical indexes, the presentation will be complemented with a number of applications to immediately convey the flexibility and power of these data structures.
k-mer index
The oldest and most popular index in information retrieval for natural-language texts is the so called inverted file. The idea is to sort in lexicographic order all the distinct words in a text T, and to precompute the set of occurrences of each word in T. A query on any word can then be answered by a binary search over the sorted list. Since biological sequences have no clear delimitation in words, we might use all the distinct substrings of T of a fixed length k. Such substrings are called k-mers.

Part II - Fundamentals of Biological Sequence Analysis
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 69-70
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

4 - Graphs
from Part I - Preliminaries
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 30-40
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Graphs are a fundamental model for representing various relations between data. The aim of this chapter is to present some basic problems and techniques relating to graphs, mainly for finding particular paths in directed and undirected graphs. In the following chapters, we will deal with various problems in biological sequence analysis that can be reduced to one of these basic ones.
Unless stated otherwise, in this chapter we assume that the graphs have n vertices and m arcs.
Directed acyclic graphs (DAGs)
A directed graph is called acyclic if it does not contain a directed cycle; we use the shorthand DAG to denote a directed acyclic graph. DAGs are one of the most basic classes of graphs, being helpful also in many problems in bioinformatics. DAGs admit special properties that not all graphs enjoy, and some problems become simpler when restricted to DAGs.
Topological ordering
The acyclicity of the arc relation of a DAG allows us to build recursive definitions and algorithms on its vertices. In such a case, the value of a function in a vertex v of a DAG depends on the values of the function on the in-neighbors of v. In order to implement its computation, we need an ordering of the vertices of the DAG such that any vertex appears in this ordering after all its in-neighbors. Such an ordering is called a topological ordering of the DAG. See Figure 4.1 for an example.
Any DAG admits a topological ordering, but it need not be unique. Exercise 4.1 asks the reader to construct a DAG in which the number of topological orderings is exponential in the number of vertices. A consequence of the existence of a topological ordering is that every DAG has at least one source (the first node in a topological ordering) and at least one sink (the last node in a topological ordering).

Frontmatter
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp i-iv
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Preface
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp xvii-xxii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Background
High-throughput sequencing has recently revolutionized the field of biological sequence analysis, both by stimulating the development of fundamentally new data structures and algorithms, and by changing the routine workflow of biomedical labs. Most key analytical steps now exploit index structures based on the Burrows–Wheeler transform, which have been under active development in theoretical computer science for over ten years. The ability of these structures to scale to very large datasets quickly led to their widespread adoption by the bioinformatics community, and their flexibility continues to spur new applications in genomics, transcriptomics, and metagenomics. Despite their fast and still ongoing development, the key techniques behind these indexes are by now well understood, and they are ready to be taught in graduate-level computer science courses.
This book focuses on the rigorous description of the fundamental algorithms and data structures that power modern sequence analysis workflows, ranging from the foundations of biological sequence analysis (like alignments and hidden Markov models) and classical index structures (like k-mer indexes, suffix arrays, and suffix trees), to Burrows–Wheeler indexes and to a number of advanced omics applications built on such a basis. The topics and the computational problems are chosen to cover the actual steps of large-scale sequencing projects, including read alignment, variant calling, haplotyping, fragment assembly, alignment-free genome comparison, compression of genome collections and of read sets, transcript prediction, and analysis of metagenomic samples: see Figure 1 for a schematic summary of all the main steps and data structures covered in this book. Although strongly motivated by high-throughput sequencing, many of the algorithms and data structures described in this book are general, and can be applied to a number of other fields that require the processing of massive sets of sequences. Most of the book builds on a coherent, self-contained set of algorithmic techniques and tools, which are gradually introduced, developed, and refined from the basics to more advanced variations.

16 - Metagenomics
from Part V - Applications
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 350-369
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

13 - Fragment assembly
from Part IV - Genome-Scale Algorithms
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 282-304
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In the preceding chapters we assumed the genome sequence under study to be known. Now it is time to look at strategies for how to assemble fragments of DNA into longer contiguous blocks, and eventually into chromosomes. This chapter is partitioned into sections roughly following a plausible workflow of a de novo assembly project, namely, error correction, contig assembly, scaffolding, and gap filling. To understand the reason for splitting the problem into these realistic subproblems, we first consider the hypothetical scenario of having error-free data from a DNA fragment.
Sequencing by hybridization
Assume we have separated a single DNA strand spelling a sequence T, and managed to measure its k-mer spectrum; that is, for each k-mer W of T we have the frequency freq (W) telling us how many times it appears in T. Microarrays are a technology that provides such information. They contain a slot for each k-mer W such that fragments containing W hybridize to the several copies of the complement fragment contained in that slot. The amount of hybridization can be converted to an estimate on the frequency count freq(W) for each k-mer W. The sequencing by hybridization problem asks us to reconstruct T from this estimated k-mer spectrum.
Another way to estimate the k-mer spectrum is to use high-throughput sequencing on T; the k-mer spectrum of the reads normalized by average coverage gives such an estimate.
Now, assume that we have a perfect k-mer spectrum containing no errors. It turns out that one can find in linear time a sequence T′ having exactly that k-mer spectrum. If this problem has a unique solution, then of course T′ = T.
The algorithm works by solving the Eulerian path problem (see Section 4.2.1) on a de Bruijn graph (see Section 9.7) representing the k-mers.
Here we need the following expanded de Bruijn graph G = (V, E).

15 - Transcriptomics
from Part V - Applications
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 325-349
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Recall from Chapter 1 that, through transcription and alternative splicing, each gene produces different RNA transcripts. Depending on various factors, such as the tissue the cell is in, owing to disease, or in response to some stimuli, the RNA transcripts of a gene and the number of copies produced (their expression level) can be different.
In this chapter we assume that we have a collection of reads from all the different (copies of the) transcripts of a gene. We also assume that these reads have been aligned to the reference genome, using for example techniques from Section 10.6; in addition, Section 15.4 shows how to exploit the output of genome analysis techniques from Chapter 11 to obtain an aligner for long reads of RNA transcripts. Our final goal is to assemble the reads into the different RNA transcripts, and to estimate the expression level of each transcript. The main difficulty of this problem, which we call multiassembly, arises from the fact that the transcripts share identical substrings.
We illustrate different scenarios, and corresponding multi-assembly formulations, stated and solved for each individual gene. In Section 15.1 we illustrate the simplest one, in which the gene's transcripts are known in advance, and we need only find their expression levels from the read alignments. In Section 15.2 we illustrate the problem of assembling the RNA reads into the different unknown transcripts, without estimating their levels of expression. In Section 15.3 we present a problem formulation for simultaneous assembly and expression estimation.
As just mentioned, in this chapter we assume that we have a reference genome, and thus that we are in a so-called genome-guided setting. De novo multi-assembly is in general a rather hard task. Thus, we prefer here to stick to genome-guided multiassembly, which admits clean problem formulations and already illustrates the main algorithmic concepts. Nevertheless, in Insight 15.1 we briefly discuss how the leastsquares method from Section 15.1 could be applied in a de novo setting.

Index
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 386-391
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

7 - Hidden Markov models (HMMs)
from Part II - Fundamentals of Biological Sequence Analysis
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 113-126
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In the previous chapter we discussed how to analyze a sequence with respect to one or more other sequences, through various alignment techniques. In this chapter we introduce a different model, one that can abstract our biological insight about the available sequences, and can provide findings about a new sequence S. For example, given the simple assumption that in prokaryote genomes the coding regions have more C and G nucleotides than A and T nucleotides, we would like to find a segmentation of S into coding and non-coding regions. We will develop this toy problem in Example 7.1.
The model introduced by this chapter is called a hidden Markov model (HMM). Even though this is a probabilistic one, which thus copes with the often empirical nature of our biological understanding, we should stress that the focus of our presentation is mainly algorithmic. We will illustrate how simple techniques such as dynamic programming, and in particular the Bellman–Ford method from Section 4.2.2, can provide elegant solutions in this probabilistic context. The exercises given at the end of this chapter derive some other probabilistic models that connect back to the problem of aligning biological sequences from the previous chapter.
Let us start by recalling that a Markov chain is a random process generating a sequence S = s1s2…sn such that the probability of generating, or emitting, each si is fixed beforehand and depends only on the previously generated symbols s1 … si−1; we will denote this probability by ℙ(si | s1 …si−1). An important subclass of Markov chains consists in those of finite history, namely those in which the probability of generating any symbol si depends only on the last k previous symbols, for some constant k; that is, ℙ(si | s1 …si−1) = ℙ(si | si−k … si−1) (k is called the order of the Markov chain).

Part I - Preliminaries
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 1-2
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

3 - Data structures
from Part I - Preliminaries
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 20-29
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

References
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 370-385
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

10 - Read alignment
from Part IV - Genome-Scale Algorithms
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 201-219
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Recall the high-throughput sequencing applications from Chapter 1. For example, in whole-genome resequencing we obtain a set of reads, that is, short extracts of DNA sequences from random positions in a donor genome. Assuming that the reference genome of the species has been assembled, one can align the reads to the reference in order to obtain an approximation of the content of the donor genome. We will consider this process in Chapter 14.
We remind the reader that reads come from both strands and, in diploid organisms, from both copies of each chromosome. This means that either the read or its reverse complement should align to the reference genome, but in what follows we do not explicitly repeat this fact. Let us now focus on how to efficiently align a set of (short) reads to a large genome.
The first observation is that, if a position in the reference genome is covered by several reads, one could try to exploit multiple alignment methods from Section 6.6 to obtain an accurate alignment. This gives an enormous semi-local multiple alignment problem with the reads and the reference sequence as input, and a multiple alignment as output (where reads can have an arbitrary amount of cost-free gaps before and after). Owing to the hardness of finding an optimal multiple alignment, shown for example in Section 6.6.3, we leave this approach just as a thought experiment. However, Chapter 14 deals with an application of this idea on a smaller scale.
Let us consider an incremental approach, which consists of aligning each read independently to the reference. This approach can be justified by the fact that, if the read is long enough, then it should not align to a random position in the genome, but exactly to that unique position from where it was sequenced from the donor. This assumes that this position is at least present in the reference genome, and that it is not inside a repeat. See Insight 10.1 for a more detailed discussion on this matter.

14 - Genomics
from Part V - Applications
Veli Mäkinen, University of Helsinki, Djamal Belazzougui, University of Helsinki, Fabio Cunial, University of Helsinki, Alexandru I. Tomescu, University of Helsinki
Book:

Genome-Scale Algorithm Design

Published online:

05 May 2015

Print publication:

07 May 2015, pp 307-324
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

We shall now explore how the techniques from the previous chapters can be used in studying the genome of a species. We assume here that we have available the so-called finalized genome assembly of a species. Ideally, this means the complete sequences of each chromosome in a species. However, in practice, this consists of chromosome sequences containing large unknown substrings, and some unmapped contigs/scaffolds.
To fix our mindset, we assume that the species under consideration is a diploid organism (like a human). We also assume that the assembled genome is concatenated into one long sequence T with some necessary markers added to separate its contents, and with some auxiliary bookkeeping data structures helpful for mapping back to the chromosome representation.
We start with a peak detection problem in Insight 14.1 that gives a direct connection to the segmentation problem motivating our study of hidden Markov models in Chapter 7. Then we proceed into variation calling, where we mostly formulate new problems assuming read alignment as input. The results of variant calling and read alignments are then given as the input to haplotype assembly.
Insight 14.1 Peak detection and HMMs
The coverage profile of the genome is an array storing for each position the amount of reads aligned to cover that position. In ChIP-sequencing and bisulfite sequencing only some parts of the genome should be covered by reads, so that clear peak areas should be noticeable in the coverage profile. In targeted resequencing, the areas of interest are known beforehand, so automatic detection of peak areas is not that relevant (although usually the targeting is not quite that accurate). Peak detection from a signal is a classical signal processing task, so there are many existing generalpurpose peak detectors available. In order to use the machinery we developed earlier, let us consider how HMMs could be used for our peak detection task.
[…]

Computer Science

Refine search

Refine search

Actions for selected content:

48287 results in Computer Science

Miscellaneous Frontmatter

Part IV - Genome-Scale Algorithms

Notation

5 - Network flows

Summary

12 - Genome compression

Summary

8 - Classical indexes

Summary

Part II - Fundamentals of Biological Sequence Analysis

4 - Graphs

Summary

Frontmatter

Preface

Summary

16 - Metagenomics

13 - Fragment assembly

Summary

15 - Transcriptomics

Summary

Index

7 - Hidden Markov models (HMMs)

Summary

Part I - Preliminaries

3 - Data structures

References

10 - Read alignment

Summary

14 - Genomics

Summary

Computer Science

Refine search

Refine search

Actions for selected content:

Save Search

48287 results in Computer Science

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary