To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In the BSA Chapter 3 we learned that a DP algorithm for pairwise sequence alignment allows a probabilistic interpretation. Indeed, the equivalent equations appear in the logarithmic form of the Viterbi algorithm for the hidden Markov model of a gapped sequence alignment. The hidden states of such a model, called a pair HMM, correspond to the alignment match, the x-gap, and the y-gap positions. The pair HMM state diagram is topologically similar to the diagram of the finite state machine (Durbin et al. (1998), Fig. 4.1), although the pair HMM parameters have clear probabilistic meanings. The optimal finite state machine alignment found by standard DP is equivalent to the most probable path through the pair HMM determined by the Viterbi algorithm. Both global and local optimal DP alignment algorithms have Viterbi counterparts for suitably defined HMMs. Interestingly, the HMM has an advantage over the finite state machine because the HMM can compute the full probability that sequences X and Y could be generated by a given pair HMM; thus, a probabilistic measure can be introduced to help establish evolutionary relationships. This full probabilistic model also defines (i) the posterior distribution over all possible alignments given sequences X and Y and (ii) the posterior probability that a particular symbol x of sequence X is aligned to a given symbol y of sequence Y. However, real biological sequences cannot be considered to be exact realizations of probabilistic models. This explains the difficulties met by the HMM based alignment methods for the similarity search (Durbin et al. (1998), Sect. 4.5), while more simplistic finite state machine methods perform sufficiently well.
The reader will quickly discover that the organization of this book was chosen to be parallel to the organization of Biological Sequence Analysis by Durbin et al. (1998). The first chapter of BSA contains an introduction to the fundamental notions of biological sequence analysis: sequence similarity, homology, sequence alignment, and the basic concepts of probabilistic modeling.
Finding these distinct concepts described back-to-back is surprising at first glance. However, let us recall several important bioinformatics questions. How could we construct a pairwise sequence alignment? How could we build an alignment of multiple sequences? How could we create a phylogenetic tree for several biological sequences? How could we predict an RNA secondary structure? None of these questions can be consistently addressed without use of probabilistic methods. The mathematical complexity of these methods ranges from basic theorems and formulas to sophisticated architectures of hidden Markov models and stochastic grammars able to grasp fine compositional characteristics of empirical biological sequences.
The explosive growth of biological sequence data created an excellent opportunity for the meaningful application of discrete probabilistic models. Perhaps, without much exaggeration, the implications of this new development could be compared with implications of the revolutionary use of calculus and differential equations for solving problems of classic mechanics in the eighteenth century.
The problems considered in this introductory chapter are concerned with the fundamental concepts that play an important role in biological sequence analysis: the maximum likelihood and the maximum a posteriori (Bayesian) estimation of the model parameters. These concepts are crucial for understanding statistical inference from experimental data and are impossible to introduce without notions of conditional, joint, and marginal probabilities.
Bioinformatics, an integral part of post-genomic biology, creates principles and ideas for computational analysis of biological sequences. These ideas facilitate the conversion of the flood of sequence data unleashed by the recent information explosion in biology into a continuous stream of discoveries. Not surprisingly, the new biology of the twenty-first century has attracted the interest of many talented university graduates with various backgrounds. Teaching bioinformatics to such a diverse audience presents a well-known challenge. The approach requiring students to advance their knowledge of computer programming and statistics prior to taking a comprehensive core course in bioinformatics has been accepted by many universities, including the Georgia Institute of Technology, Atlanta, USA.
In 1998, at the start of our graduate program, we selected the then recently published book Biological Sequence Analysis (BSA) by Richard Durbin, Anders Krogh, Sean R. Eddy, and Graeme Mitchison as a text for the core course in bioinformatics. Through the years, BSA, which describes the ideas of the major bioinformatic algorithms in a remarkably concise and consistent manner, has been widely adopted as a required text for bioinformatics courses at leading universities around the globe.
Many problems included in BSA as exercises for its readers have been repeatedly used for homeworks and tests. However, the detailed solutions to these problems have not been available. The absence of such a resource was noticed by students and teachers alike. The goal of this book, Problems and Solutions in Biological Sequence Analysis is to close this gap, extend the set of workable problems, and help its readers develop problem-solving skills that are vitally important for conducting successful research in the growing field of bioinformatics.
The theory described in Chapter 5 of BSA suggests that constructing the multiple alignment of several biological sequences should be a part of the algorithm of the profile HMM training. Such an iterative expectation maximization method is supposed to estimate parameters of the profile HMM from unaligned sequences by means of the construction of the multiple alignment in parallel with the HMM parameter estimation. The resulting alignment can be evoked at the last step of the algorithm via an optimal alignment of each individual sequence to the just built profile HMM. Nevertheless, since this impressive theoretical design meets many practical difficulties, discussed in great detail in BSA, it has not yet been implemented in its pure form as an efficient tool for multiple sequence alignment.
One of the major difficulties on the road to a universal and efficient multiple sequence alignment algorithm is as follows. Establishing a gold standard for a multiple sequence alignment that would help to distinguish a good alignment from a better one is difficult. Since both sequence and structure are evolving and the ancestral sequences and structures can be reconstructed only by theoretical means, it is impossible to verify experimentally either alignments or phylogenies. Nevertheless, a formal assignment of the alignment score immediately leads to the notion of the best alignment for a given set of sequences; however, the implications of a so defined optimal alignment have to be taken cautiously. There are several biologically motivated options for the score assignment. For instance, the sum-of-pairs score is computationally convenient and frequently used, but it has well known theoretical drawbacks (Durbin et al. (1998), p. 141).
This paper presents a simple numerical method for forward kinematics of general Stewart–Gough platforms, which can generate a unique solution directly. This method utilizes the trivial nature of the inverse kinematics of parallel manipulators, and derives a straightforward linear relationship between the small change in joint variables (leg lengths) and the resulting small motion of the platform. The solution to the forward kinematics is then achieved through a series of small changes in joint variables. Numerical examples validate and confirm the efficiency of the method.
Stochastic transformational grammars, particularly stochastic context-free grammars, turned out to be effective modeling tools for RNA sequence analysis. Two biologically interesting problems are the prediction of RNA secondary structure and the construction of multiple alignments of RNA families. Non-stochastic algorithms for the RNA secondary structure prediction were developed more than twenty years ago (by Nussinov et al. (1978) and by Zuker and Stiegler (1981)). Notably, the Nussinov algorithm could be immediately rewritten in SCFG terms as a version of the Cocke–Younger–Kasami (CYK) algorithm. The SCFG interpretation provides an insight into the probabilistic meaning of parameters of the original Nussinov algorithm and also suggests statistical procedures for parameter estimation. A similar translation into SCFG terms is possible for the Zuker algorithm.
Interestingly, equivalence between the non-probabilistic dynamic programming sequence alignment algorithm and the Viterbi algorithm for a pair HMM is analogous to equivalence between the non-probabilistic algorithm of RNA structure prediction and the CYK algorithm for a SCFG. There is also an analogy between the use of the profile HMM for alignment of multiple DNA or protein sequences and the use of the SCFG-based RNA structure profiles, called covariance models (CMs), for constructing structurally sound alignments of multiple RNAs. Furthermore, parameters of the covariance models could be derived by the inside–outside expectation maximization algorithm (compare with the simultaneous profile HMM parameter estimation and construction of multiple sequence alignment).
This study addresses the problem of adaptive controlling of both a nonredundant and a redundant robotic manipulator with state-dependent constraints. The task of the robot is to follow a prescribed geometric path given in the task space, by the end-effector. The aforementioned robot task has been solved on the basis of the Lyapunov stability theory, which is used to derive the control scheme. A new adaptive Jacobian controller is proposed in the paper for the path following of the robot, with both uncertain kinematics and dynamics. The numerical simulation results carried out for a planar redundant three-DOF (three degrees of freedom) manipulator whose end-effector follows a prescribed geometric path given in a two-dimensional (2D) task space, illustrate the trajectory performance of the proposed control scheme.
The chapter in BSA that introduces Markov chains and hidden Markov models plays a critical role in that book. The sequence comparison algorithms described in Chapter 2 could not be developed without the introduction of the theoretically justified similarity scores and statistical theory of similarity score distributions. These developments, in turn, are not feasible without rational choices of probabilistic models for DNA and protein sequences. Both Markov chains and hidden Markov models are often remarkably good candidates for the sequence models. Moreover, hidden Markov models (HMMs) are potentially a more flexible means for biological sequence analysis because they allow simultaneous modeling of observable and non-observable (hidden) states. The presence of the two types of states perfectly fits the need to model some important additional information existing beyond sequences per se, such as the functional meaning of the sequence elements, matches and mismatches of symbols in pairs of aligned sequences, evolutionary conserved regions in multiple sequences, phylogenetic relationships, etc.
Chapter 3 of BSA introduces the fundamental algorithms of HMM theory: the Viterbi algorithm, the forward and backward algorithms, as well as the Baum–Welch algorithm. All of these algorithms are amenable for a variety of applications in biological sequence analysis. Of course, some of these HMM constructions exist in parallel with their non-probabilistic counterparts; for example, consider the Viterbi algorithm for a pair HMM and the classic dynamic programming algorithm for pairwise alignment. Both HMM and non-HMM approaches are known for finding conserved domains, building phylogenetic trees, etc.
In this chapter, the BSA problems focus on deriving the formulas that support probabilistic modeling and the HMM algorithm construction.
The notion of sequence similarity is perhaps the most fundamental concept in biological sequence analysis. In the same way that the similarity of morphological traits served as evidence of genetic and functional relationships between species in classic genetics and biology, biological sequence similarity could frequently indicate structural and functional conservation among evolutionary related DNA and protein sequences. Introduction of the biologically relevant quantitative measure of sequence similarity, the similarity score, is not a trivial task. No simpler is the other task, developing algorithms that would find the alignment of two sequences with the best possible score given the scoring system. Finally, the third necessary component of the computational analysis of sequence similarity is the method of evaluation of statistical significance of an alignment. Such a method, establishing the cut-off values for the observed scores to be statistically significant, works properly as soon as the statistical distribution of similarity scores is determined analytically or computationally.
Chapter 2 of BSA includes twelve problems that require knowledge of the concepts and properties of the pairwise alignment algorithms. This topic is traditionally best known to biologists due to its utmost practical importance. Indeed, an initial characterization of any DNA or protein sequence starts with the BLAST analysis, utilization of a highly efficient heuristic pairwise alignment algorithm for searching for homologous sequences in a database.
Additional nine problems provide more information for understanding the protein evolution theory behind the log-odds scores of amino acid substitutions, as well as the models involved in the assessment of the statistical significance of the observed sequence similarity scores.
Establishing phylogenetic relationships between species is one of the central problems of biological science. While in Chapter 7 the reader was introduced to non-probabilistic methods of building phylogenetic trees for DNA and protein sequences, Chapter 8 continues the subject from the standpoint of consistent probabilistic methodology. The evolution of biological sequences has been largely viewed as a random process, and several probabilistic models with varying levels of complexity have been proposed. Therefore, the reconstruction of phylogenetic relationships can be formulated in probabilistic terms as well.
Several introductory BSA problems in Chapter 8 are concerned with the properties of the simplest probabilistic models of evolution, such as the Jukes–Cantor and the Kimura models.
Given a set of sequences (associated with the leaves of a tree) and a model of the process of substitutions in a DNA or protein sequence, it is important to know how to compute the likelihood of a tree with a given topology. The Felsenstein algorithm addresses this issue using the post-order traversal. Felsenstein also developed an EM-type algorithm for finding the optimal (maximum likelihood) lengths of the tree edges. However, as the number of leaves increases, the number of tree topologies grows too quickly to be processed in a reasonable time.
Therefore, finding the optimal tree among all possible trees for a rather large number of sequences (leaves) is one of the major challenges. The mainstream approach to managing such a problem is sampling from the posterior distribution on the space of trees.
The tree HMM concept described in BSA could be used for phylogenetic tree construction utilizing most general models of the sequence evolution.
There are several known results asserting that undirected graphs can be partitioned in a way that satisfies various constraints imposed on the degrees. The corresponding results for directed graphs, where degrees are replaced by outdegrees, often fail, and when they do hold, they are usually much harder, and lead to fascinating open problems. In this note we list three problems of this type, and mention the undirected analogues. All graphs and digraphs considered here are simple, that is, they have no loops and no multiple edges.
For a fixed graph $H$, we define the rainbow Turán number $\ex^*(n,H)$ to be the maximum number of edges in a graph on $n$ vertices that has a proper edge-colouring with no rainbow $H$. Recall that the (ordinary) Turán number $\ex(n,H)$ is the maximum number of edges in a graph on $n$ vertices that does not contain a copy of $H$. For any non-bipartite $H$ we show that $\ex^*(n,H)=(1+o(1))\ex(n,H)$, and if $H$ is colour-critical we show that $\ex^{*}(n,H)=\ex(n,H)$. When $H$ is the complete bipartite graph $K_{s,t}$ with $s \leq t$ we show $\ex^*(n,K_{s,t}) = O(n^{2-1/s})$, which matches the known bounds for $\ex(n,K_{s,t})$ up to a constant. We also study the rainbow Turán problem for even cycles, and in particular prove the bound $\ex^*(n,C_6) = O(n^{4/3})$, which is of the correct order of magnitude.
The one-dimensional string is too simple a model to reflect fully the properties of a real biological molecule, which have, after all, been determined by its three dimensional structure selected in the course of evolution. Physical interactions of amino acids and nucleotides in the three-dimensional folds have to be described by the models that would go beyond the short range correlations which are the typical targets of the Markov chain models. The long range correlations are more important for proteins than for DNA, which has a rather uniform double helix structure. However, the structure of another nucleic acid, RNA, commonly has a significant number of long range interactions of special type, which could be a target for yet another class of probabilistic models.
Chapter 9 introduces the Chomsky hierarchy of deterministic transformational grammars, the models developed originally for natural languages and then applied to computer languages. These grammars could be readily used for the description of a protein (a regular grammar could generate amino acid sequences described as the PROSITE patterns) and RNA (a context-free grammar could generate RNA sequences with a given secondary structure).
Further generalization of these deterministic grammar classes to stochastic ones increases opportunities for sequence modeling. Stochastic regular grammars could be shown to be equivalent to hidden Markov models. Stochastic context-free grammars (SCFGs) are useful for modeling RNA sequences.
Let $P(G,t)$ and $F(G,t)$ denote the chromatic and flow polynomials of a graph $G$. G. D. Birkhoff and D C. Lewis showed that, if $G$ is a plane near-triangulation, then the only zeros of $P(G,t)$ in $(-\infty,2]$ are 0, 1 and 2. We will extend their theorem by showing that a stronger result to the dual statement holds for both planar and non-planar graphs: if $G$ is a bridge graph with at most one vertex of degree other than three, then the only zeros of $F(G,t)$ in $(-\infty,\alpha]$ are 1 and 2, where $\alpha\approx 2.225\cdots$ is the real zero in $(2,3)$ of the polynomial $t^4-8t^3+22t^2-28t+17$. In addition we construct a sequence of ‘near-cubic’ graphs whose flow polynomials have zeros converging to $\alpha$ from above.