We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Ab initio prediction: computational prediction based on first principles or using the most elementary information.
Accession number: unique number given to an entry in a biological database, which serves as a permanent identifier for the entry.
Agglomerative clustering: microarray data clustering method that begins by first clustering the two most similar data points and subsequently repeating the process to merge groups of data successively according to similarity until all groups of data are merged. This is in principle similar to the UPGMA phylogenetic approach.
Alternative splicing: mRNA splicing event that joins different exons from a single gene to form variable transcripts. This is one of the mechanisms of generating a large diversity of gene products in eukaryotes.
Bayesian analysis: statistical method using the Bayes theorem to describe conditional probabilities of an event. It makes inferences based on initial expectation and existing observations. Mathematically, it calculates the posterior probability (revised expectation) of two joint events (A and B) as the product of the prior probability of A event given the condition B (initial expectation) and conditional probability of B (observation) divided by the total probability of event A with and without the condition B. The method has wide applications in bioinformatics from sequence alignment and phylogenetic tree construction to microarray data analysis.
Bioinformatics: discipline of storing and analyzing biological data using computational techniques. More specifically, it is the analysis of the sequence, structure, and function of the biological macromolecules – DNA, RNA, and proteins – with the aid of computational tools that include computer hardware, software, and the Internet.
A natural extension of pairwise alignment is multiple sequence alignment, which is to align multiple related sequences to achieve optimal matching of the sequences. Related sequences are identified through the database similarity searching described in Chapter 4. As the process generates multiple matching sequence pairs, it is often necessary to convert the numerous pairwise alignments into a single alignment, which arranges sequences in such a way that evolutionarily equivalent positions across all sequences are matched.
There is a unique advantage of multiple sequence alignment because it reveals more biological information than many pairwise alignments can. For example, it allows the identification of conserved sequence patterns and motifs in the whole sequence family, which are not obvious to detect by comparing only two sequences. Many conserved and functionally critical amino acid residues can be identified in a protein multiple alignment. Multiple sequence alignment is also an essential prerequisite to carrying out phylogenetic analysis of sequence families and prediction of protein secondary and tertiary structures. Multiple sequence alignment also has applications in designing degenerate polymerase chain reaction (PCR) primers based on multiple related sequences.
It is theoretically possible to use dynamic programming to align any number of sequences as for pairwise alignment. However, the amount of computing time and memory it requires increases exponentially as the number of sequences increases. As a consequence, full dynamic programming cannot be applied for datasets of more than ten sequences. In practice, heuristic approaches are most often used.
To continue discussion of molecular phylogenetics from Chapter 10, this chapter introduces the theory behind various phylogenetic tree construction methods along with the strategies used for executing the tree construction.
There are currently two main categories of tree-building methods, each having advantages and limitations. The first category is based on discrete characters, which are molecular sequences from individual taxa. The basic assumption is that characters at corresponding positions in a multiple sequence alignment are homologous among the sequences involved. Therefore, the character states of the common ancestor can be traced from this dataset. Another assumption is that each character evolves independently and is therefore treated as an individual evolutionary unit. The second category of phylogenetic methods is based on distance, which is the amount of dissimilarity between pairs of sequences, computed on the basis of sequence alignment. The distance-based methods assume that all sequences involved are homologous and that tree branches are additive, meaning that the distance between two taxa equals the sum of all branch lengths connecting them. More details on procedures and assumptions for each type of phylogenetic method are described.
DISTANCE-BASED METHODS
As mentioned in Chapter 10, true evolutionary distances between sequences can be calculated from observed distances after correction using a variety of evolutionary models. The computed evolutionary distances can be used to construct a matrix of distances between all individual pairs of taxa. Based on the pairwise distance scores in the matrix, a phylogenetic tree can be constructed for all the taxa involved.
Note: all exercises were originally designed for use on a UNIX workstation. However, with slight modifications, they can be used on any other operating systems with Internet access.
EXERCISE 1. DATABASE SEARCHES
In this exercise, you will learn how to use several biological databases to retrieve information according to certan criteria. After learning the basic search techniques, you will be given a number of problems and asked to provide answers from the databases.
Use a web browser to retrieve a protein sequence of lambda repressor from SWISS-PROT (http://us.expasy.org/sprot/). Choose “Full text search in Swiss-Prot and TrEMBL.” In the following page, Enter “lambda repressor” (space is considered as logical operator AND) as keywords in the query window. Select “Search in Swiss-Prot only.” Click on the “submit” button. Note the search result contains hypertext links taking you to references that are cited or to other related information. Spend a little time studying the annotations.
In the same database, search more sequences for “human MAP kinase inhibitor,” “human catalase,” “synechocystis cytochrome P450,” “coli DNA polymerase,” “HIV CCR5 receptor,” and “Cholera dehydrogenase.” Record your findings and study the annotations.
Go to the SRS server (http://srs6.ebi.ac.uk/) and find human genes that are larger than 200 kilobase pairs and also have poly-A signals. Click on the “Library Page” button. Select “EMBL” in the “Nucleotide sequence databases” section. Choose the “Extended” query form on the left of the page. In the following page, Select human (“hum”) at the “Division” section. Enter “200000” in the “SeqLength >=” field. Enter “polya_signal” in the “AllText” field. Press the “Search” button. How many hits do you get?
One of the most important scientific achievements of the twentieth century was the discovery of the DNA double helical structure by Watson and Crick in 1953. Strictly speaking, the work was the result of a three-dimensional modeling conducted partly based on data obtained from x-ray diffraction of DNA and partly based on chemical bonding information established in stereochemistry. It was clear at the time that the x-ray data obtained by their colleague Rosalind Franklin were not sufficient to resolve the DNA structure. Watson and Crick conducted one of the first-known ab initio modeling of a biological macromolecule, which has subsequently been proven to be essentially correct. Their work provided great insight into the mechanism of genetic inheritance and paved the way for a revolution in modern biology. The example demonstrates that structural prediction is a powerful tool to understand the functions of biological macromolecules at the atomic level.
We now know that the DNA structure, a double helix, is rather invariable regardless of sequence variations. Although there is little need today to determine or model DNA structures of varying sequences, there is still a real need to model protein structures individually. This is because protein structures vary depending on the sequences. Another reason is the much slower rate of structure determination by x-ray crystallography or NMR spectroscopy compared to gene sequence generation from genomic studies. Consequently, the gap between protein sequence information and protein structural information is increasing rapidly. Protein structure prediction aims to reduce this sequence–structure gap.
The field of genomics encompasses two main areas, structural genomics and functional genomics (see Chapter 17). The former mainly deals with genome structures with a focus on the study of genome mapping and assembly as well as genome annotation and comparison; the latter is largely experiment based with a focus on gene functions at the whole genome level using high throughput approaches. The emphasis here is on “high throughput,” which is simultaneous analysis of all genes in a genome. This feature is in fact what separates genomics from traditional molecular biology, which studies only one gene at a time.
The high throughput analysis of all expressed genes is also termed transcriptome analysis, which is the expression analysis of the full set of RNA molecules produced by a cell under a given set of conditions. In practice, messenger RNA (mRNA) is the only RNA species being studied. Transcriptome analysis facilitates our understanding of how sets of genes work together to form metabolic, regulatory, and signaling pathways within the cell. It reveals patterns of coexpressed and coregulated genes and allows determination of the functions of genes that were previously uncharacterized. In short, functional genomics provides insight into the biological functions of the whole genome through automated high throughput expression analysis. This chapter mainly discusses the bioinformatics aspect of the transcriptome analysis that can be conducted using either sequence- or microarray-based approaches.
Biological sequence analysis is founded on solid evolutionary principles (see Chapter 2). Similarities and divergence among related biological sequences revealed by sequence alignment often have to be rationalized and visualized in the context of phylogenetic trees. Thus, molecular phylogenetics is a fundamental aspect of bioinformatics. In this chapter, we focus on phylogenetic tree construction. Before discussing the methods of phylogenetic tree construction, some fundamental concepts and background terminology used in molecular phylogenetics need to be described. This is followed by discussion of the initial steps involved in phylogenetic tree construction.
MOLECULAR EVOLUTION AND MOLECULAR PHYLOGENETICS
To begin the phylogenetics discussion, we need to understand the basic question, “What is evolution?” Evolution can be defined in various ways under different contexts. In the biological context, evolution can be defined as the development of a biological form from other preexisting forms or its origin to the current existing form through natural selections and modifications. The driving force behind evolution is natural selection in which “unfit” forms are eliminated through changes of environmental conditions or sexual selection so that only the fittest are selected. The underlying mechanism of evolution is genetic mutations that occur spontaneously. The mutations on the genetic material provide the biological diversity within a population; hence, the variability of individuals within the population to survive successfully in a given environment. Genetic diversity thus provides the source of raw material for the natural selection to act on.
A main application of pairwise alignment is retrieving biological sequences in databases based on similarity. This process involves submission of a query sequence and performing a pairwise comparison of the query sequence with all individual sequences in a database. Thus, database similarity searching is pairwise alignment on a large scale. This type of searching is one of the most effective ways to assign putative functions to newly determined sequences. However, the dynamic programming method described in Chapter 3 is slow and impractical to use in most cases. Special search methods are needed to speed up the computational process of sequence comparison. The theory and applications of the database searching methods are discussed in this chapter.
UNIQUE REQUIREMENTS OF DATABASE SEARCHING
There are unique requirements for implementing algorithms for sequence database searching. The first criterion is sensitivity, which refers to the ability to find as many correct hits as possible. It is measured by the extent of inclusion of correctly identified sequence members of the same family. These correct hits are considered “true positives” in the database searching exercise. The second criterion is selectivity, also called specificity, which refers to the ability to exclude incorrect hits. These incorrect hits are unrelated sequences mistakenly identified in database searching and are considered “false positives.” The third criterion is speed, which is the time it takes to get results from database searches. Depending on the size of the database, speed sometimes can be a primary concern.