To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This appendix describes a simple yet very efficient Perl solution to a problem known as the disjoint sets problem, the dynamic equivalence relation problem, or the unionfind problem. This problem appears in applications with the following scenario.
Each one of a finite set of keys is assigned to exactly one of a number of classes. These classes are the “disjoint sets” or the partitions of an equivalence relation. Often, the set of keys is known in advance, but this is not necessary to use our Perl package.
Initially, each key is in a class by itself.
As the application progresses, classes are joined together to form larger classes; classes are never divided into smaller classes. (The operation of joining classes together is called union or merge.)
At any moment, it must be possible to determine whether two keys are in the same class or in different classes.
To solve this problem, we create a package named UnionFind. Objects of type UnionFind represent an entire collection of keys and classes. The three methods of this package are:
$uf = UnionFind– > new(), which creates a new collection of disjoint sets, each of which has only one element;
$uf– > inSameSet($key1,$key2), which returns true if its two arguments are elements of the same disjoint set or false if not;
$uf– > union($key1,$key2), which combines the sets to which its two arguments belong into a single set.
This book is designed to be a concrete, digestible introduction to the area that has come to be known as “bioinformatics” or “computational molecular biology”. My own teaching in this area has been directed toward a mixture of graduate and advanced undergraduate students in computer science and graduate students from the biological sciences, including biomathematics, genetics, forestry, and entomology. Although a number of books on this subject have appeared in the recent past – and one or two are quite well written – I have found none to be especially suitable for the widely varying backgrounds of this audience.
My experience with this audience has led me to conclude that its needs can be met effectively by a book with the following features.
To meet the needs of computer scientists, the book must teach basic aspects of the structure of DNA, RNA, and proteins, and it must also explain the salient features of the laboratory procedures that give rise to the sorts of data processed by the algorithms selected for the book.
To meet the needs of biologists, the book must (to some degree) teach programming and include working programs rather than abstract, high-level descriptions of algorithms – yet computer scientists must not become bored with material more appropriate for a basic course in computer programming.
Justice to the field demands that its statistical aspects be addressed, but the background of the audience demands that these aspects be addressed in a concrete and relatively elementary fashion.
Engineering conceptual design can be defined as that phase of the product development process during which the designer takes a specification for a product to be designed and generates many broad solutions to it. This paper presents a constraint-based approach to supporting interactive conceptual design. The approach is based on an expressive and general technique for modeling: the design knowledge that a designer can exploit during a design project; the life-cycle environment that the final product faces; the design specification that defines the set of requirements the product must satisfy; and the structure of the various schemes that are developed by the designer. A computational reasoning environment based on constraint filtering is proposed as the basis of an interactive design support tool. Using such a tool, human designers can be assisted in interactively developing and evaluating a set of schemes that satisfy the various constraints imposed on the design.
Once DNA fragments have been sequenced and assembled, the results must be properly identified and labeled for storage so that their origins will not be the subject of confusion later. As the sequences are studied further, annotations of various sorts will be added. Eventually, it will be appropriate to make the sequences available to a larger audience. Generally speaking, a sequence will begin in a database available only to workers in a particular lab, then move into a database used primarily by workers on a particular organism, then finally arrive in a large publicly accessible database. By far, the most important public database of biological sequences is one maintained jointly by three organizations:
the National Center for Biotechnology Information (NCBI), a constituent of the U.S. National Institutes of Health;
the European Molecular Biology Laboratory (EMBL); and
the DNA Databank of Japan (DDBJ).
At present, the three organizations distribute the same information; many other organizations maintain copies at internal sites, either for faster search or for avoidance of legal “disclosure” of potentially patent-worthy sequence data by submission for search on publicly accessible servers. Although their overall formats are not identical, these organizations have collaborated since 1987 on a common annotation format. We will focus on NCBI's GenBank database as a representative of the group.
Product development involves multiple phases. Design review (DR) is an essential activity formally conducted to ensure a smooth transition from one phase to another. Such a formal DR is usually a multicriteria decision problem, involving multiple disciplines. This paper proposes a systematic framework for DR using fuzzy set theory. This fuzzy approach to DR is considered particularly relevant for several reasons. First, information available at early design phases is often incomplete and imprecise. Second, the relationships between the product design parameters and the review criteria cannot usually be exactly expressed by mathematical functions due to the enormous complexity. Third, DR is frequently carried out using subjective expert judgments with some degree of uncertainty. The DR is defined as the reverse mapping between the design parameter domain and design requirement (review criterion) domain, as compared with Suh's theory of axiomatic design. Fuzzy sets are extensively introduced in the definitions of the domains and the mapping process to deal with imprecision, uncertainty, and incompleteness. A simple case study is used to demonstrate the resulting fuzzy set theory of axiomatic DR.
Once a new segment of DNA is sequenced and assembled, researchers are usually most interested in knowing what proteins, if any, it encodes. We have learned that much DNA does not encode proteins: some encodes catalytic RNAs, some regulates the rate of production of proteins by varying the ease with which transcriptases or ribosomes bind to coding sequences, and much has no known function. If study of proteins is the goal, how can their sequences be extracted from the DNA? This question is the main focus of gene finding or gene prediction.
One approach is to look for open reading frames (ORFs). An open reading frame is simply a sequence of codons beginning with ATG for methionine and ending with one of the stop codons TAA, TGA, or TAG. To gain confidence that an ORF really encodes a gene, we can translate it and search for homologous proteins in a protein database. However, there are several difficulties with this method.
It is ineffective in eukaryotic DNA, in which coding sequences for a single gene are interrupted by introns.
It is ineffective when the coding sequence extends beyond either end of the available sequence.
Random DNA contains many short ORFs that don't code for proteins. This is because one of every 64 random codons codes for M and three of every 64 are stop codons.
The proteins it detects will probably not be that interesting since they will be very similar to proteins with known functions.
The human genome comprises approximately three billion (3 ×109) base pairs distributed among 23 pairs of chromosomes. The Human Genome Project commenced in the 1990s with the primary goal of determining the sequence of this DNA. The task was completed ahead of schedule a few months before this book was completed.
Scientists have been anxious to remind the public that the completion of the Human Genome Project's massive DNA sequencing effort will hardly mark the end of the Project itself; we can expect analysis, interpretation, and medical application of the human DNA sequence data to provide opportunities for human intellectual endeavor for the foreseeable future.
It is equally true, though less well understood, that this milestone in the Human Genome Project will not mark the end of massive sequencing. Homo sapiens is just one species of interest to Homo economicus, and major advances in agricultural productivity will result from ongoing and new sequencing projects for rice, corn, pine, and other crops and their pests. Because of the large differences in size, shape, and personality of different breeds, the Dog Genome Project promises to bring many insights into the relative influences of nature and nurture. Even recreational sequencing by amateur plant breeders may not lie far in the future.
Suppose we are given a strand of DNA and asked to determine whether it comes from corn (Zea mays) or from fruit flies (Drosophila melanogaster). One very simple way to attack this problem is to analyze the relative frequencies of the nucleotides in the strand. Even before the double-helix structure of DNA was determined, researchers had observed that, while the numbers of Gs and Cs in a DNA were roughly equal (and likewise for As and Ts), the relative numbers of G + C and A + T differed from species to species. This relationship is usually expressed as percent GC, and species are said to be GC-rich or GC-poor. Corn is slightly GC-poor, with 49% GC. Fruit fly is GC-rich, with 55% GC.
We examine the first ten bases of our DNA and see: GATGTCGTAT. Is this DNA from corn or fruit fly?
First of all, it should be clear that we cannot get a definitive answer to the question by observing bases, especially just a few. Corn's protein-coding sequences are distinctly GC-rich, while its noncoding DNA is GC-poor. Such variations in GC content within a single genome are sometimes exploited to find the starting point of genes in the genome (near so-called CpG islands). In the absence of additional information, the best we can hope for is to learn whether it's “more likely” that we have corn or fly DNA, and how much more likely.
In Chapter 3, we introduced the sequence assembly problem and discussed how to determine whether two fragments of a DNA sequence “overlap” when each may contain errors. In this chapter, we will consider how to use overlap information for a set of fragments to reassemble the original sequence. Of course, there is an infinite number of sequences that could give rise to a particular set of fragments. Our goal is to produce a reconstruction of the unknown source sequence that is reasonably likely to approximate the actual source well.
In Section 13.1 we will consider a model known as the shortest common superstring problem. This simple model assumes that our fragments contain no errors and that a short source sequence is more likely than a longer one. If the input set reflects the source exactly, then only exact overlaps between fragments are relevant to reconstruction. This approach's focus on exact matching allows us to use once more that extremely versatile data structure from Chapter 12, the suffix tree.
Beginning in Section 13.2, we consider another solution, a simplified version of the commonly used PHRAP program. PHRAP assumes that errors are present in the sequence files and furthermore that each base of each sequence is annotated with a quality value describing the level of confidence the sequencing machinery had in its determination of the base. It will not be possible to present all of the issues and options dealt with by PHRAP, but in Sections 13.3–13.7 we will present several important aspects of its approach in a Perl “mini-PHRAP”.
In the previous chapter we saw that finding the very best multiple alignment of a large number of sequences is difficult for two different reasons. First, although the direct method is general, it seems to require a different program text for each different number of sequences. Second, if there are K sequences of roughly equal length then the number of entries in the dynamic programming table increases as fast as the K th power of the length, and the time required to fill each entry increases as the K th power of 2.
In this chapter, we will address both of these difficulties to some degree. It is, in fact, possible to create a single program text that works for any number of strings. And, although some inputs seem to require that nearly all of the table entries be filled in, careful analysis of the inputs supplied on a particular run will often help us to avoid filling large sections of the table that have no influence on the final alignment. Part of this analysis can be performed by the heuristic methods described in the previous chapter, since better approximate alignments help our method search more quickly for the best alignment.
Pushing through the Matrix by Layers
As a comfortable context for learning two of the techniques used in our program, we will begin by modifying Chapter 3's subroutine similarity for computing the similarity of two sequences.
This study aims to compare various computerized bilingual dictionaries (henceforth CBDs) for their relative effectiveness in helping Japanese college students at several language proficiency levels to access new English target vocabulary. Its rationale was based on several observations and research claims (see Atkins & Knowles, 1990; Bejoint & Moulin, 1987; Laufer & Hadar, 1997) that bilingual and bilingualized dictionaries in general, as well as electronic dictionaries in particular appear to be much more rapid and effective than monolingual book dictionaries for the acquisition of new L2 vocabulary by language learners. The author has been testing and analyzing various CBDs in four major categories for the past two years. These include (i) portable electronic dictionaries (PEDs); (ii) software CBDs; (iii) online dictionary websites; and (iv) optical character recognition/translation (OCR/OCT) devices, both portable handheld ’Reading Pens‘ (e.g. Quickionary/Quicklink) and also flatbed OCR scanners (Logo Vista) bundled with translation programs. His research started over ten years ago, however, culminating in a dissertation entitled ‘Developing and testing vocabulary training methods and materials for Japanese college students studying English as a foreign language’ (Loucky, 1996; or summary thereof, Loucky, 1997). This dissertation studied the pre- and post-test vocabulary, comprehension, listening and total reading levels of over 1,000 Japanese college students at six institutions. Since then the author has devised a simple yet practical Vocabulary Knowledge Scale (VKS), helping to more clearly define and test the differences between passive or receptive understanding vocabulary and active or productive use vocabulary. Computerized technology has now made possible multimedia programming with the benefits of interactive processing and immediate feedback. Modern CAI/CAELL along with well-made CBDs, either online or off, can already be found to scan, pronounce and translate for us in any direction of the four language skills. This study examined Japanese college students’ use of four kinds of CBDs for more rapid accessing and archiving of new L2 terms, recommending integration of their use into a more systematic taxonomy of vocabulary learning strategies.
This paper summarizes the research described in a PhD thesis (Pujolă, 2000) which presents a description of how learners use the help facilities of a web-based multimedia CALL program, called ImPRESSions, designed to foster second language learners’ reading and listening skills and language learning strategies. The study investigates the variation of strategy use in a CALL environment: Twenty two Spanish adult students of English worked with the program in four sessions and their computer movements were digital-video screen recorded. Together with direct observation and retrospective questions a detailed picture of learners’ deployment of strategies was drawn. As the emphasis was on the process rather than the product, the description and analysis of the data focus on the observation of the language learning strategies learners deployed when using the help facilities provided: Dictionary, Cultural Notes, Transcript, Subtitles and Play Controls, Feedback and an Experts module specifically designed to provide the language learner training component of the program. The qualitative analysis of the data indicates that many variables have an influence on the amount and quality of the use of the help provided by the program, from the learners’ individual differences to the fact that the CALL environment may prompt learners to behave or work in a different way from a more conventional type of learning. The results of the study provide information for future CALL material design and the type of research offers new possibilities for CALL research methods.
The objective of this study is to investigate how low-proficiency English for Specific Purposes (ESP) students benefit from a hypermedia-enhanced learning environment, specifically in terms of incidental acquisition of lexical items in the target language. Additionally, this study aims at finding out how frequently the comprehension tools available from within the application were used by the learners. The hypermedia-assisted learning environment used in this study provides a rich environment where learners gain exposure to foreign language texts by listening and reading in the target language. Learners are invited to explore Chemistry-related video segments through listening comprehension questions in combination with access to various comprehension tools, for example, L1 translations of questions and answers, L2 video manuscript, translation of manuscript sentences, specific video segments containing answers to questions and video control tools.A total of 40 subjects with a low-intermediate proficiency level were exposed to the treatment. An achievement test was administered before the treatment as a pre-test, and two weeks following the completion of the treatment as a post-test, to evaluate retention. The courseware was set up by the researcher to record every user action and compile these data into user logs of behaviour data which were subsequently submitted to analysis. The findings of this experimental study are conclusive in suggesting that hypermedia-based instruction can provide an effective learning environment to build vocabulary among ESP students. It can be claimed that the access to a variety of comprehension tools results in a deep processing of the lexical items and can therefore contribute to favour retention.
There are many CALL resources available today but often they need to be adapted for level or culture. CALL practitioners would like to reuse currently existing material rather than reinvent the wheel, but often this is not possible. Thus, they end up building CALL material, both language content and software, from scratch. This is inefficient in terms of CALL practitioners’ time and as Felix (1999) points out, there is no point ‘doing badly’ what has already been done well. Why can’t we reuse what already exists? Often the language content is hard-wired in the software and cannot be modified or the CALL material comes as an executable which is hard or impossible to change. Authoring tools can provide a degree of flexibility, but often focus on particular parts of the language learning process (e.g. interactive exercises) and the associated language content cannot, in general, easily be exported into other formats and presentation software. One solution to this problem is provided by XML technologies. They provide a strict separation between data and processing. Thus, three types of reuse are possible: reuse of the data processing engine (i.e. the XSL processing files), reuse of the language content structure (i.e. the XML data files) and reuse the linguistic resources (i.e. the language content). In this paper, an example is given of a CALL template that has been developed using XML technologies. The template provides a structure into which the language content can be slotted and a processing engine to act upon the data to create CALL material. The template was developed for the production of CALL materials for Endangered Languages (ELs), but could be used for MCTLs (Most Commonly Taught Languages) and LCTLs (Less Commonly Taught Languages) also. It has been used to develop a language learning course for Nawat (a language of El Salvador), Akan (a Ghanaian language) and Irish, demonstrating the reusability of the language content structure as well as proving the reusability of the processing engine provided by the template. A further reusable feature is the ability to create courseware in different media (Internet, CD and print) from the same source language content.