To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The chapter in BSA that introduces Markov chains and hidden Markov models plays a critical role in that book. The sequence comparison algorithms described in Chapter 2 could not be developed without the introduction of the theoretically justified similarity scores and statistical theory of similarity score distributions. These developments, in turn, are not feasible without rational choices of probabilistic models for DNA and protein sequences. Both Markov chains and hidden Markov models are often remarkably good candidates for the sequence models. Moreover, hidden Markov models (HMMs) are potentially a more flexible means for biological sequence analysis because they allow simultaneous modeling of observable and non-observable (hidden) states. The presence of the two types of states perfectly fits the need to model some important additional information existing beyond sequences per se, such as the functional meaning of the sequence elements, matches and mismatches of symbols in pairs of aligned sequences, evolutionary conserved regions in multiple sequences, phylogenetic relationships, etc.
Chapter 3 of BSA introduces the fundamental algorithms of HMM theory: the Viterbi algorithm, the forward and backward algorithms, as well as the Baum–Welch algorithm. All of these algorithms are amenable for a variety of applications in biological sequence analysis. Of course, some of these HMM constructions exist in parallel with their non-probabilistic counterparts; for example, consider the Viterbi algorithm for a pair HMM and the classic dynamic programming algorithm for pairwise alignment. Both HMM and non-HMM approaches are known for finding conserved domains, building phylogenetic trees, etc.
In this chapter, the BSA problems focus on deriving the formulas that support probabilistic modeling and the HMM algorithm construction.
The notion of sequence similarity is perhaps the most fundamental concept in biological sequence analysis. In the same way that the similarity of morphological traits served as evidence of genetic and functional relationships between species in classic genetics and biology, biological sequence similarity could frequently indicate structural and functional conservation among evolutionary related DNA and protein sequences. Introduction of the biologically relevant quantitative measure of sequence similarity, the similarity score, is not a trivial task. No simpler is the other task, developing algorithms that would find the alignment of two sequences with the best possible score given the scoring system. Finally, the third necessary component of the computational analysis of sequence similarity is the method of evaluation of statistical significance of an alignment. Such a method, establishing the cut-off values for the observed scores to be statistically significant, works properly as soon as the statistical distribution of similarity scores is determined analytically or computationally.
Chapter 2 of BSA includes twelve problems that require knowledge of the concepts and properties of the pairwise alignment algorithms. This topic is traditionally best known to biologists due to its utmost practical importance. Indeed, an initial characterization of any DNA or protein sequence starts with the BLAST analysis, utilization of a highly efficient heuristic pairwise alignment algorithm for searching for homologous sequences in a database.
Additional nine problems provide more information for understanding the protein evolution theory behind the log-odds scores of amino acid substitutions, as well as the models involved in the assessment of the statistical significance of the observed sequence similarity scores.
Establishing phylogenetic relationships between species is one of the central problems of biological science. While in Chapter 7 the reader was introduced to non-probabilistic methods of building phylogenetic trees for DNA and protein sequences, Chapter 8 continues the subject from the standpoint of consistent probabilistic methodology. The evolution of biological sequences has been largely viewed as a random process, and several probabilistic models with varying levels of complexity have been proposed. Therefore, the reconstruction of phylogenetic relationships can be formulated in probabilistic terms as well.
Several introductory BSA problems in Chapter 8 are concerned with the properties of the simplest probabilistic models of evolution, such as the Jukes–Cantor and the Kimura models.
Given a set of sequences (associated with the leaves of a tree) and a model of the process of substitutions in a DNA or protein sequence, it is important to know how to compute the likelihood of a tree with a given topology. The Felsenstein algorithm addresses this issue using the post-order traversal. Felsenstein also developed an EM-type algorithm for finding the optimal (maximum likelihood) lengths of the tree edges. However, as the number of leaves increases, the number of tree topologies grows too quickly to be processed in a reasonable time.
Therefore, finding the optimal tree among all possible trees for a rather large number of sequences (leaves) is one of the major challenges. The mainstream approach to managing such a problem is sampling from the posterior distribution on the space of trees.
The tree HMM concept described in BSA could be used for phylogenetic tree construction utilizing most general models of the sequence evolution.
The one-dimensional string is too simple a model to reflect fully the properties of a real biological molecule, which have, after all, been determined by its three dimensional structure selected in the course of evolution. Physical interactions of amino acids and nucleotides in the three-dimensional folds have to be described by the models that would go beyond the short range correlations which are the typical targets of the Markov chain models. The long range correlations are more important for proteins than for DNA, which has a rather uniform double helix structure. However, the structure of another nucleic acid, RNA, commonly has a significant number of long range interactions of special type, which could be a target for yet another class of probabilistic models.
Chapter 9 introduces the Chomsky hierarchy of deterministic transformational grammars, the models developed originally for natural languages and then applied to computer languages. These grammars could be readily used for the description of a protein (a regular grammar could generate amino acid sequences described as the PROSITE patterns) and RNA (a context-free grammar could generate RNA sequences with a given secondary structure).
Further generalization of these deterministic grammar classes to stochastic ones increases opportunities for sequence modeling. Stochastic regular grammars could be shown to be equivalent to hidden Markov models. Stochastic context-free grammars (SCFGs) are useful for modeling RNA sequences.
Classifying biological sequences into families is one of the major challenges of bioinformatics. In fact, to provide the exact definition of the family (for example, the protein family) is difficult. Even with the introduction of the notion of a conserved in evolution protein domain as a structural determinant of a protein family, to classify multidomain proteins consistently is not a simple task. For practical purposes, nevertheless, it is important to develop efficient computational tools able to assign a protein translated from a newly predicted gene to one of already established families, thus characterizing the protein based on its amino acid sequence alone.
Computational tools of protein characterization have to recognize the familyspecific features in a new protein sequence. Frequently, these detectable common features are manifested as statistically significant structural conservations. The computational tools that are required to solve the classification problem should be able to (i) make use of known structural patterns specific for a given family, (ii) detect the family patterns in the new protein sequence by alignment of the new protein to the family model, and (iii) assess the statistical significance of the detected similarity in order to help correctly identify the true family members.
These three properties of the protein characterization algorithm are similar to the properties of the pairwise sequence alignment algorithm, but there are significant differences. First, the availability of several sequences makes the differential scoring of amino acid matches feasible; with matches (mismatches) in conserved positions receiving higher (lower) scores than scores of similar events in nonconserved positions.
The typical life cycle of an aphid is cyclical parthenogenesis which involves the alternation of sexual and asexual reproduction. However, aphid life cycles, even within a species, can encompass everything on a continuum from obligate sexuality, through facultative sexuality to obligate asexuality. Loss of the sexual cycle in aphids is frequently associated with the introduction of a new pest and can occur for a number of environmental and genetic reasons. Here we investigate loss of sexual function in Sitobion aphids in Australia. Specifically, we aimed to determine whether an absence of sexual reproduction in Australian Sitobion results from genetic loss of sexual function or environmental constraints in the introduced range. We addressed our aims by performing a series of breeding experiments. We found that some lineages have genetically lost sexual function while others retain sexual function and appear environmentally constrained to asexuality. Further, in our crosses, using autosomal and X-linked microsatellite markers, we identified processes deviating from normal Mendelian segregation. We observed strong deviations in X chromosome transmission through the sexual cycle. Additionally, when progeny genotypes were examined across multiple loci simultaneously we found that some multilocus genotypes are significantly over-represented in the sample and that levels of heterozygosity were much higher than expected at almost all loci. This study demonstrates that strong biases in the transmission of X chromosomes through the sexual cycle are likely to be widespread in aphids. The mechanisms underlying these patterns are not clear. We discuss several possible alternatives, including mutation accumulation during periods of functional asexuality and genetic imprinting.
Loywyck, V., Pinard-van der Laan, M.-H., Goldringer, I. & Verrier, E. (2006). On the need for combining complementary analyses to assess the effect of a candidate gen and the evolution of its polymorphism: the example of the Major Histocompatibility Complex in chicken. Genetical Research87, 125–131.
The published version of Table 2 omitted the column headings, the correct version is given below.
The Tol2 element of the medaka fish Oryzias latipes is a member of the hAT (hobo/Activator/Tam3) transposable element family. There is evidence for rapid expansion in the genome and throughout the species in the past but a high spontaneous transposition rate is not observed with current fish materials, suggesting that the Tol2 element and its host species have already acquired an interactive mechanism to control the transposition frequency. DNA methylation is a possible contributing factor, given its involvement with many other transposable elements. We therefore soaked embryos in 5-azacytidine, a reagent that causes reduction in the DNA methylation level, and examined amounts of PCR products reflecting the somatic excision frequency, obtaining direct evidence that exposure promotes Tol2 excision. Our results thus suggest that methylation of the genome DNA is a factor included in the putative mechanisms of control of transposition of the Tol2 element.
Weak selection is maintaining the Drosophila americana X/4 fusion chromosomal frequency cline. The gene(s) harbouring the advantageous variant(s) that is responsible for the establishment and maintenance of this chromosomal frequency gradient must be located in a region of the X and/or 4th chromosome that is genetically isolated between the X/4 fusion and non-fusion forms. The limits of these regions must thus be determined before an attempt is made to identify these genes. For this purpose, the correspondence between the D. virilis X and 4th chromosome genome scaffolds sequence and the D. americana gene order was established. Polymorphism levels and patterns at seven genes located at the base of the D. americana X chromosome, as well as three genes located at the base of the 4th chromosome, were analysed. The data suggest that the D. americana X/4 fusion is no more than 29000 years old. At the base of the X chromosome, there is suppression of recombination within X/4 fusion and non-fusion chromosomes, and little recombination between the two chromosomal forms. Apparent fixed silent and replacement differences are found in three of seven genes analysed located at the base of the X chromosome. There is no evidence for suppression of recombination between fusion and non-fusion chromosomes at the base of the 4th chromosome. The advantageous variant responsible for the establishment in frequency and maintenance of the X/4 fusion is thus inferred to be in the D. americana X centromere–inversion Xc basal breakpoint region.
Many biological processes, from cellular metabolism to population dynamics, are characterized by particular allometric scaling relationships between rate and size (power laws). A statistical model for mapping specific quantitative trait loci (QTLs) that are responsible for allometric scaling laws has been developed. We present an improved model for allometric mapping of QTLs based on a more general allometry equation. This improved model includes two steps: (1) use model II regression analysis to estimate the parameters underlying universal allometric scaling laws, and (2) substitute the estimated allometric parameters in the mixture-based mapping model to obtain the estimation of QTL position and effects. This model has been validated by a real example for a mouse F2 progeny, in which two QTLs were detected on different chromosomes that determine the allometric relationship between growth rate and body weight.
We propose a simple approach, the multiplicative background correction, to solve a perplexing problem in spotted microarray data analysis: correcting the foreground intensities for the background noise, especially for spots with genes that are weakly expressed or not at all. The conventional approach, the additive background correction, directly subtracts the background intensities from foreground intensities. When the foreground intensities marginally dominate the background intensities, the additive background correction provides unreliable estimates of the differential gene expression levels and usually presents M–A plots with ‘fishtails’ or fans. Unreliable additive background correction makes it preferable to ignore the background noise, which may increase the number of false positives. Based on the more realistic multiplicative assumption instead of the conventional additive assumption, we propose to logarithmically transform the intensity readings before the background correction, with the logarithmic transformation symmetrizing the skewed intensity readings. This approach not only precludes the ‘fishtails’ and fans in the M–A plots, but provides highly reproducible background-corrected intensities for both strongly and weakly expressed genes. The superiority of the multiplicative background correction to the additive one as well as the no background correction is justified by publicly available self-hybridization datasets.
The sex comb on the forelegs of Drosophila males is a secondary sexual trait, and the number of teeth on these combs varies greatly within and between species. To understand the relationship between the intra- and interspecific variation, we performed quantitative trait locus (QTL) analyses of the intraspecific variation in sex-comb tooth number. We used five mapping populations derived from two inbred Drosophila simulans strains that were divergent in the number of sex-comb teeth. Although no QTLs were detected on the X chromosome, we identified four QTLs on the second chromosome and three QTLs on the third chromosome. While identification and estimated effects of the second-chromosome QTLs depend on genetic backgrounds, significant and consistent effects of the two third-chromosome QTLs were found in two genetic backgrounds. There were significant epistatic interactions between a second-chromosome QTL and a third-chromosome QTL, as well as between two second-chromosome QTLs. The third-chromosome QTLs are concordant with the locations of the QTLs responsible for the previously observed differences in sex-comb tooth number between D. simulans and D. mauritiana.