To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The semantics of the Prolog ‘cut’ construct is explored in the context of some desirable properties of logic programming systems, referred to as the witness properties. The witness properties concern the operational consistency of responses to queries. A generalization of Prolog with negation as failure and cut is described, and shown not to have the witness properties. A restriction of the system is then described, which preserves the choice and first-solution behaviour of cut but allows the system to have the witness properties. The notion of cut in the restricted system is more restricted than the Prolog hard cut, but retains the useful first-solution behaviour of hard cut, not retained by other proposed cuts such as the ‘soft cut’. It is argued that the restricted system achieves a good compromise between the power and utility of the Prolog cut and the need for internal consistency in logic programming systems. The restricted system is given an abstract semantics, which depends on the witness properties; this semantics suggests that the restricted system has a deeper connection to logic than simply permitting some computations which are logical. Parts of this paper appeared previously in a different form in the Proceedings of the 1995 International Logic Programming Symposium (Andrews, 1995).
Boolean functions can be used to express the groundness of, and trace grounding dependencies between, program variables in (constraint) logic programs. In this paper, a variety of issues pertaining to the efficient Prolog implementation of groundness analysis are investigated, focusing on the domain of definite Boolean functions, Def. The systematic design of the representation of an abstract domain is discussed in relation to its impact on the algorithmic complexity of the domain operations; the most frequently called operations should be the most lightweight. This methodology is applied to Def, resulting in a new representation, together with new algorithms for its domain operations utilising previously unexploited properties of Def – for instance, quadratic-time entailment checking. The iteration strategy driving the analysis is also discussed and a simple, but very effective, optimisation of induced magic is described. The analysis can be implemented straightforwardly in Prolog and the use of a non-ground representation results in an efficient, scalable tool which does not require widening to be invoked, even on the largest benchmarks. An extensive experimental evaluation is given.
In the previous chapter, we concentrated on the flow of information from DNA to RNA to protein sequence and emphasized RNA's role as a “messenger” carrying copies of DNA's protein recipes to the ribosome for production. In this function, RNA's role is similar to the paper in a photocopy machine: though perhaps not always flat, it must be flattened out and dealt with in a linear fashion when used. RNA also plays catalytic roles in which it more resembles the paper used in origami: the shape into which it is folded is what matters most.
Messenger and Catalytic RNA
RNA is sufficiently versatile as a catalyst that some scientists postulate that the origins of life lie in an “RNA world” preceding both DNA and proteins. Other scientists have invented new RNA catalysts in the laboratory by experimentation with random RNA sequences. Important categories of catalytic RNAs in higher organisms include the following.
transfer RNA (tRNA). These RNAs, typically around 85 bases long, are crucial in the translation process. tRNAs assume a shape approximating the letter “L”. In the presence of a ribosome, the three-base anticodon at the top of the L is able to bind to a single codon in the messenger RNA. The shape of the base of the L allows it to bind to a single molecule of a particular one of the 20 amino acids. During translation, tRNAs collect their respective amino acids from the cytoplasm with the assistance of enzymes called aminoacyl-tRNA synthetases. When a tRNA approaches a ribosome poised over the appropriate codon on the mRNA, the tRNA binds to the codon and the ribosome. Next, the amino acid is released from the tRNA and attaches to the growing protein. Finally, the tRNA is released from the mRNA and departs in search of another amino acid molecule.
In Chapters 3 and 5 we saw how to develop an alignment scoring matrix and, given such a matrix, how to find the alignment of two strings with the highest score. In Chapter 6, we learned about some of the large genomic databases available for reference. Perhaps the most commonly performed bioinformatic task is to search a large protein sequence database for entries whose similarity scores may indicate homology with some query sequence, often a newly sequenced protein or putative protein.
The Needleman–Wunsch algorithm described in Chapter 3 constructs global alignments – alignments of the entireties of its two input sequences. In practice, homologous proteins are not similar over their entire lengths. This is because of differences in the importance of different segments of the sequence for the function of the protein. A typical protein has one or more active sites that play crucial roles in the chemical reactions it catalyzes. Acceptance of mutations in the midst of an active site is infrequent, since such mutations are likely to disrupt the protein's function. The segments intervening between the active sites help give the protein its peculiar shape but do not form strong bonds with other molecules as the protein performs its function. Mutations in these regions are more easily tolerated and thus are more common. Hemoglobin provides a good example; it easily tolerates mutations on its outer surface, but mutations affecting the active sites in its interior can destroy its ability to hold the iron-binding heme group essential to its role as oxygen carrier.
The urge to record or to reconstruct “family trees” seems to be a strong one in many different areas of human activity. Animal breeders have an obvious interest in pedigrees, linguists have grouped human languages into families descended from a common (and in many cases unattested) ancestor, and, when several manuscripts of the same text have been recovered, biblical scholars have tried to piece together which ones served as sources for which others. Even before the advent of the theory of evolution, naturalists attempted to discern the “Divine Plan” by assigning each known organism to its correct place in a system of nesting categories known as a taxonomy.
More modern biologists have tried to reconstruct the course of evolution by building trees reflecting similarities and differences in relevant features or characters of various species. Whereas early work relied upon morphological features such as shape of leaf or fruit for classification, most recent efforts focus on less subjective molecular features. In this chapter, we develop a program for constructing the evolutionary tree – or phylogeny – that best accounts for the differences observed in a multiple sequence alignment.
Parsimonious Phylogenies
Broadly speaking, there are two approaches to reconstructing the phylogeny of a group of species. One approach first reduces the similarities and differences among the n species to n(n – 1)/2 numerical scores; it then finds the phylogeny that optimizes a certain mathematical function of those scores.
In Chapter 3 we considered the problem of aligning DNA sequences that had been read from the same source but with errors introduced by laboratory procedures. Rather arbitrarily, we assigned a reward of +1 for a match and penalties of -1 and -2 for mismatches and gaps. In this chapter, we will examine how to align protein sequences that differ as a result of evolution itself rather than owing to experimental error. The outcome will be a method for constructing substitution matrices for scoring alignments. Potentially, these matrices can assign a different reward or penalty for each of the 210 possible unordered pairs of amino acids that may appear in a column of an alignment.
The function of a protein is determined by its shape and charge distribution, not by the exact sequence of amino acids. During DNA replication, various mutations can alter the protein produced by a gene. Some types of mutations are:
point mutations, in which the machinery of replication randomly substitutes an incorrect nucleotide for the correct one;
indels, or insertions and deletions, in which extra bases are randomly inserted or bases are omitted;
translocations, in which longer pieces of DNA – possibly including one or more entire genes – are moved from one part of the chromosome to another part, or to another chromosome; and
duplications, in which long pieces of DNA are copied and integrated into a chromosome.
Some historical and current trends in reflective practice (RP), artificial intelligence (AI), and engineering design (ED) are presented and compared. Human artistry, context, and connectionist approaches to knowledge are the common threads highlighted. ED is considered to be a type of RP and AI a part of RP. This is supported by an analysis of the transformation processes involved in each. AI and systems are presented as approaches for the formalization of RP at the technical and conceptual levels, respectively. Interconnectedness in a hierarchical fashion and purposeful process loops are defined as the key ingredients of a systems approach. AI techniques that could support a range of ED categories (case-based reasoning, decomposition, and transformation) are identified, as are the wider RP approaches that subsume those categories. The ED, AI, and RP categories are identified as spanning from routine to creative, connectionist to cognitivist, and intuitive to deliberate, respectively.
Once a family of homologous proteins has been identified, it is often useful to arrange their sequences in a multiple alignment such as the one in Figure 9.1.
A multiple alignment is useful for constructing a so-called consensus sequence, which – while probably differing from every individual sequence in the family – is nonetheless a better representative of the family than any of its actual members. Multiple alignments can also form the basis of more abstract statistical models of the protein family called profiles.
By examining which elements of the consensus are present in most or all family members and which exhibit a greater degree of variability, we can also find clues to the protein's function. Highly conserved regions are likely to have been conserved because they form active sites crucial to function, while more variable regions are more likely to have merely structural roles.
We have already seen in Chapter 3 that the number of ways in which a mere two sequences of only moderate length can be aligned is comparable to current estimates of the number of atoms in the observable universe. The addition of more sequences only increases the number of possibilities. We need both a criterion for evaluating multiple alignments and a computational strategy that will allow us to eliminate large sets of alignments at one stroke.
To describe our evaluation criterion, we will rely on the notion of projection of a multiple alignment.
In Chapter 7, we justified the use of fast alignment heuristics like BLAST by our need to quickly align large numbers of sequences in a database with a given query sequence to determine which were most similar to the query. In this chapter, we will consider statistical aspects of the set of alignment scores we might encounter when performing such a database search. The distribution of scores obviously depends on the substitution matrix employed (PAM30, BLOSUM62, PAM250, etc.), and a proof of the general result requires rather extensive use of sophisticated mathematical notation. We will avoid this by concentrating on a specific, simple scoring matrix for DNA before outlining the general result.
Like all but the most recent versions of BLAST, we will focus on gapless alignments. The theory of statistical properties of scores of alignments with gaps has been elucidated only approximately and only for special cases. The theory supports the empirical observation that their behavior is similar to the behavior of statistics without gaps.
BLAST Scores for Random DNA
Suppose that Q is a query sequence of DNA and that D is a sequence from a database. As usual for DNA, we will score +1 for matched bases and -1 for mismatched bases in alignments of D and Q. Suppose a BLAST search discovers a local gap-free alignment with a score of 13. Does this suggest that D and Q are in some way related, or could this be better explained by chance? To answer this question, we must define precisely what “by chance” means to us, and then compute – or at least estimate – the probability that a score of 13 or higher occurs under that definition.
Each of us has observed physical and other similarities among members of human families. While some of these similarities are due to the common environment these families share, others are inherited, that is, passed on from parent to child as part of the reproductive process. Traits such as eye color and blood type and certain diseases such as red–green color blindness and Huntington's disease are among those known to be heritable. In humans and all other nonviral organisms, heritable traits are encoded and passed on in the form of deoxyribonucleic acid, or DNA for short. The DNA encoding a single trait is often referred to as a gene. Most human DNA encodes not traits that distinguish one human from another but rather traits we have in common with all other members of the human family. Although I do not share my adopted children's beautiful brown eyes and black hair, we do share more than 99.9% of our DNA. Speaking less sentimentally, all three of us share perhaps 95% of our DNA with the chimpanzees.
DNA consists of long chains of molecules of the modified sugar deoxyribose, to which are joined the nucleotides adenine, cytosine, guanine, and thymine. The scientific significance of these names is minimal – guanine, for example, is named after the bird guano from which it was first isolated – and we will normally refer to these nucleotides or bases by the letters A, C, G, and T. For computational purposes, a strand of DNA can be represented by a string of As, Cs, Gs, and Ts.
A framework for a design tool based on shape grammars is presented as an effective means for supporting the early stages of design. The framework uses a shape grammar interpreter to implement parametric shape grammars, allowing the grammar to be used interactively by a designer or optimization routine. A shape grammar to design inner hood panels of vehicles is introduced as an example of a parametric engineering shape grammar, and it is used with the framework to create standard and novel designs made possible by rules that take advantage of shape emergence.
We have already learned that the process of DNA replication is not perfect and that, in fact, this is a source of mutations both beneficial and deleterious in sequences encoding amino acid chains and regulating their production. Intergenic (“junk”) DNA is also subject to mutations, and since these mutations pose no impediment to survival, they are preserved at much greater rates than mutations to coding and regulatory sequences. In such DNA, it is common to find tandem repeats consisting of several contiguous repetitions of the same short sequence.
The repetitions themselves may vary slightly as a result of point mutations. Furthermore, individuals within a population often carry different numbers of repetitions as a result of deviations from normal DNA replication. For example, the feature known as HUMHPRTB, which consists of varying numbers of repetitions of AGAT, was found to exist in nine different forms in a group of 417 humans; 314 group members carried two different forms or alleles. The sites of such variation are collectively called VNTR (variable number of tandem repeat) loci. Since mutations at VNTR loci are so frequent – up to 1% per gamete per generation – and are inherited, they form the basis of the highly publicized “DNA fingerprinting” techniques used to resolve paternity disputes and to free the wrongfully incarcerated. VNTRs can also be powerful tools for reconstructing pedigrees and phylogenies.
VNTR sequences, or satellites, are commonly subdivided into two categories, which originate by distinct processes.
One of the ways biologists begin to analyze a long sequence of DNA is to develop a restriction site map. Restriction sites are the locations at which the sequence is cut by enzymes known as restriction enzymes – or, more precisely, restriction endonucleases. Restriction enzymes are found in bacteria, where they provide some protection against viral invasion by destroying viral DNA. Each bears a name such as EcoRI and HindIV, derived from the bacterium in which it was discovered. Each restriction enzyme can cut DNA at any location containing a specific short sequence. Examples of some commonly used restriction enzymes and the sequences they cut are given in Figure 16.1. The sequence is typically a palindrome of even length. In most cases, the enzyme cuts the double strand unevenly so that a small group of unpaired nucleotides – a sticky end – remains on both sides of the cut. In nature, this feature may facilitate further degradation of the invading DNA by other enzymes; in the laboratory, it is often used to “cut and paste” new sequences into DNA at known locations.
A restriction map of a sequence is simply a list of the locations at which one or more restriction enzymes are known to cut the sequence. Such a map can be used to pinpoint the origin of a gene or other subsequence within a larger sequence.
Restriction maps are generally constructed by performing digestion experiments: The sequence to be mapped is replicated by PCR and then exposed to one or more restriction enzymes, together and/or separately.