To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Be nice to the whites, they need you to rediscover their humanity.
(Archbishop Desmond Tutu)
If we have access to the genome sequence of an individual there are a very large number of questions to be posed regarding that genome. For instance, what genetic markers are present that are of a predictive value in medical terms? What trait and disease alleles are present? What genetic properties are present that make the individual sensitive or resistant to a certain drug treatment? We may also want to know about the relationship of the individual to other individuals and whether there are markers characteristic of a certain human population. These are all questions that may be addressed using bioinformatics tools. In the previous chapter we examined SNPs and used them to get an idea about the genetic differences between individuals in general. Here, we will again use SNP data to analyse genomes, but we will see how we may identify SNPs that are shared between a group of individuals. We will also illustrate how SNP data may be mapped to information regarding exons, thus identifying SNPs that are likely to be in coding regions. In this way we are able to learn about the consequences of different SNPs at the level of protein products. The genomes that we are to examine are those of a few South African individuals. These genomes also highlight interesting questions regarding the early history of man.
Although the material in this appendix is not used directly in the rest of the book, it has a profound importance in understanding the very essence of randomness and complexity, which are fundamental to probabilities and statistics. The algorithmic information or complexity theory is founded on the theory of recursive functions, the roots of which go back to the logicians Godel, Kleen, Church, and above all to Turing, who described a universal computer, the Turing machine, whose computing capability is no less than that of the latest supercomputers. The field of recursiveness has grown into an extensive branch of mathematics, of which we give just the very basics, which are relevant to statistics. For a comprehensive treatment we refer the reader to [26].
The field is somewhat peculiar in that the derivations and proofs of the basic results can be performed in two very different manners. First, since the partial recursive functions can be axiomatized as is common in other branches of mathematics the proofs are similar. But since the same set is also defined in terms of the computer, the Turing machine, a proof has to be a program. Now, the programs as binary strings of that machine are very primitive, like the machine language programs in modern computers, and they tend to be long and both hard to read and hard to understand. To shorten them and make them more comprehensible an appeal is often made to intuition, and the details are left to the reader to fill in.
We currently see a vast amount of information being generated as a result of
experimental work in biomedicine. Particularly impressive is the development in
DNA sequencing. As a result, we are now facing a new era of genomics where a lot
of different species, as well as many different human individuals, are being
analysed. There are many important biological questions being addressed in such
genome-sequencing projects, including questions of medical relevance. A critical
technical part of all these projects is computational analysis. With the large
amount of sequence information generated, computational analysis is often a
bottleneck in the pipeline of a genomics project. Therefore, there is great
demand for individuals with the appropriate computational competence. Ideally,
such individuals should not only be proficient in the relevant mathematical and
computer scientific tools, but should also be able to fully understand the
different biological problems that are posed. This book was partly motivated by
the urgent need for bioinformatics competence due to recent developments in
genomics.
A student or scientist may enter into bioinformatics from different disciplines.
This book is written mainly for the biologist that wants to be introduced to
computational and programming tools. There are certainly books out there already
for that type of audience. However, I was attracted by the idea of assembling a
book that would cover a large number of relevant biological topics and, at the
same time, illustrate how these topics may be studied using relatively simple
programming tools. Therefore, an important principle of the book is that it will
attempt to convince the reader that relatively simple programming is sufficient
for many bioinformatics tasks and that you need not be a programming expert to
be effective. Another important principle of the book is that I wanted the
bioinformatics examples to be very practical and explicit. Thus, the reader
should be able to follow all the details in a procedure all the way from a
biological problem to the results obtained through a technical approach. As one
demonstration of this principle, all files and scripts mentioned in this book
are available for download at www.cambridge.org/samuelsson. This means
the reader is able to try it all out on his/her own computer. I also
wanted this book to illustrate the interdisciplinary nature of
bioinformatics.
This chapter will introduce molecular phylogeny, a science where DNA, RNA or protein sequences are used to deduce relationships between organisms. Such relationships are typically shown in the form of a tree. The entities under study are often species, but phylogenetic methods could be used to examine other types of evolutionary relationships, such as how individuals in a population are related or how different members of a protein family are related by orthology and paralogy. For the example in this chapter a phylogenetic tree will show the relationship between different HIV isolates. But we will start out with a somewhat ghastly criminal story.
A fatal injection
This story takes place in Lafayette in Louisiana. Maria Jones, a married nurse of age 20, met a gastroenterologist named Robert White. Robert, aged 34, was also married and he had three children. As Robert and Maria entered a relationship, Maria divorced her husband. In return Robert promised to divorce his wife, but never followed through with it. Still, the relationship between Maria and Robert continued. After five years Maria had become pregnant three times, but Robert convinced her every time to have an abortion. Maria did give birth to a child; Robert was the father. Robert eventually became exceedingly jealous and controlling. When Maria saw other men, Robert would sometimes threaten to kill them. He also threatened Maria. After ten turbulent years, Maria finally decided to leave Robert in July 1994.
Now that we know how to estimate real-valued parameters optimally the question arises of how confident we can be in the estimated result, which requires infinite precision numbers. It is clear that if we repeat the estimation on a new set of data, generated by the same physical machinery, the result will not be the same. It seems that the model class is too rich for the amount of data we have. After all, if we fit Bernoulli models to a binary string of length n, there cannot be more than 2n properties in the data that we can learn, even if no two strings have a common property. And yet the model class has a continuum of parameter values, each representing a property.
One way to balance the learnable information in the data and the model class is to restrict the precision in the estimated parameters. However, we should not take any fixed precision, for example each parameter quantized to two decimals, because that does not take into account the fact that the sensitivity of models with respect to changes in the parameters depends on the parameters. The problem is related to statistical robustness, which while perfectly meaningful is based on practical considerations such as a model's sensitivity to outliers rather than on any reasonably comprehensive theory. If we stick to the model classes of interest in this book the parameter precision amounts to interval estimation.
This chapter describes approaches used for video processing, in particular, for face and gesture detection and recognition. The role of video processing, as described in this chapter, is to extract all the information necessary for higher-level algorithms from the raw video data. The target high-level algorithms include tasks such as video indexing, knowledge extraction, and human activity detection. The main focus of video processing in the context of meetings is to extract information about presence, location, motion, and activities of humans along with gaze and facial expressions to enable higher-level processing to understand the semantics of the meetings.
Object and face detection
The object and face detection methods used in this chapter include pre-processing through skin color detection, object detection through visual similarity using machine learning and classification, gaze detection, and face expression detection.
Skin color detection
For skin color detection, color segmentation is usually used to detect pixels with a color similar to the color of the skin (Hradiš and Juranek, 2006). The segmentation is done in several steps. First, an image is converted from color into gray scale using a skin color model – each pixel value corresponds to a skin color likelihood. The gray scale image is binarized by thresholding. The binary image is then filtered by a sequence of morphological operations so as to avoid noise. Finally, the components of the binary image can be labeled and processed in order to recognize the type of the object.
[Jim Kent] embarked on a four-week programming marathon, icing his wrists at night to prevent them from seizing up as he churned out computer code by day. His deadline was June 26, when the completion of the rough draft was to be announced.
(DNA: The Secret of Life, describing Jim Kent's efforts in the human genome sequencing project in 2000; Watson and Berry, 2003)
It was a somewhat historic event when President Bill Clinton announced, on 26 June 2000, the completion of the first survey of the entire human genome. We were able for the first time to read all three billion letters of the human genetic make-up. This information was the ground-breaking result of the Human Genome Project. The success of this project relied on advanced technology, such as a number of experimental molecular biology methods. However, it also required a significant contribution from more theoretical disciplines such as computer science. Thus, in the final phase of the project, numerous pieces of information like those in a giant jigsaw puzzle needed to be appropriately combined. This step was critically dependent on programming efforts. Adding further tension to the programming exercises was the fact that a private company, Celera, was competing with the academic Human Genome Project. This competition was sometimes referred to as ‘the Genome War’ (Shreeve, 2004). While computationally talented people like Jim Kent ‘churned out computer code’, other gifted bioinformaticians, such as Gene Myers at Celera, worked on related jigsaw-puzzle problems. Ideally, scientists should not war against each other; however, there was an important conclusion from these projects in which important genetic information was generated: computing is an essential part of biological research.
It consists essentially in a spasmodic action of all the voluntary muscles of
the system, of involuntary and more or less irregular motions of the
extremities, face and trunk…. The first indications of its appearance
are spasmodic twitching of the extremities, generally of the fingers which
gradually extend and involve all the involuntary muscles. This derangement
of muscular action is by no means uniform; in some cases it exists to a
greater, in others to a lesser, extent, but in all cases gradually induces a
state of more or less perfect dementia.
(Charles Oscar Waters, 1841, describing what is now known as
Huntington's disease)
After a discussion of gene technology methods in the preceding chapters, we now
turn to a few topics of more immediate medical interest. We will be dealing with
different kinds of human diseases that have a genetic component and see how some
aspects of these diseases may be examined using bioinformatics tools such as
Perl.
Inherited disease and changes in DNA
In Chapter 1 we saw how genetic information flows from DNA to proteins. The
sequence of bases in DNA determines the sequence of amino acids in protein and
that sequence in turn determines the biological function of the protein. The
relationship between genetic information in the form of DNA sequences and
biological function in proteins has been demonstrated by numerous experiments
carried out in molecular biology laboratories. At the same time it is intriguing
to note that this relationship is also elegantly demonstrated in nature. Thus,
we know of numerous examples of naturally occurring changes in DNA that have
marked effects on the function of the corresponding protein. Such changes often
give rise to disease. In discussing this further we need to clarify what types
of alterations are observed in DNA. We may distinguish two major categories.
First, there are highly local changes, such as point mutations (single
nucleotide changes) and addition or deletion of a smaller number of nucleotides.
Second, there are large-scale rearrangements of DNA sequences
that result from DNA recombination events.
Whenever you are using a computer you interact with it with the help of an operating system (OS), a vital interface between the hardware and the user. The operating system does a number of different things. For example, multiple programs are often run at the same time and in this situation the operating system allocates resources to the different programs or may be able to appropriately interrupt programs. Another common feature of an operating system is a graphical user interface (GUI), originally developed for personal computers. Examples of popular operating systems are Microsoft Windows, Mac OS X and Linux.
Linux is an example of a Unix (or ‘Unix-like’) operating system. Unix was originally developed in 1969 at Bell Laboratories in the United States. Many different flavours of the Unix OS have been developed, such as Solaris, HP-UX and AIX, and there are a number of freely available Unix or Unix-like systems such as GNU/Linux in different distributions such as Red Hat Enterprise Linux, Fedora, SUSE Linux Enterprise, openSUSE and Ubuntu.
Analyzing the behaviors of people in smart environment using multimodal sensors requires to answer a set of typical questions: who are the people? where are they? what activities are they doing? when? with whom are they interacting? and how are they interacting? In this view, locating people or their faces and characterizing them (e.g. extracting their body or head orientation) allows us to address the first two questions (who and where), and is usually one of the first steps before applying higher-level multimodal scene analysis algorithms that address the other questions. In the last ten years, tracking algorithms have experienced considerable progress, particularly in indoor environment or for specific applications, where they have reached a maturity allowing their deployment in real systems and applications. Nevertheless, there are still several issues that can make tracking difficult: background clutter and potentially small object size; complex shape, appearance, and motion, and their changes over time or across camera views; inaccurate/rough scene calibration or inconsistent camera calibration between views for 3D tracking; real-time processing requirements. In what follows, we discuss some important aspects of tracking algorithms, and introduce the remaining chapter content.
Scenarios and Set-ups. Scenarios and application needs strongly influence the considered physical environment, and therefore the set-up (where, how many, and what type of sensors are used) and choice of tracking method. A first set of scenarios commonly involves the tracking of people in the so-called smart spaces (Singh et al., 2006).
This textbook, for second- or third-year students of computer science, presents insights, notations, and analogies to help them describe and think about algorithms like an expert, without grinding through lots of formal proof. Solutions to many problems are provided to let students check their progress, while class-tested PowerPoint slides are on the web for anyone running the course. By looking at both the big picture and easy step-by-step methods for developing algorithms, the author guides students around the common pitfalls. He stresses paradigms such as loop invariants and recursion to unify a huge range of algorithms into a few meta-algorithms. The book fosters a deeper understanding of how and why each algorithm works. These insights are presented in a careful and clear way, helping students to think abstractly and preparing them for creating their own innovative ways to solve problems.
The acceleration of globalization and the growth of emerging economies present significant opportunities for business expansion. One of the quickest ways to achieve effective international expansion is by leveraging the web, which allows for technological connectivity of global markets and opportunities to compete on a global basis. To systematically engage and thrive in this networked global economy, professionals and students need a new skill set; one that can help them develop, manage, assess and optimize efforts to successfully launch websites for tapping global markets. This book provides a comprehensive, non-technical guide to leveraging website localization strategies for global e-commerce success. It contains a wealth of information and advice, including strategic insights into how international business needs to evolve and adapt in light of the rapid proliferation of the 'Global Internet Economy'. It also features step-by-step guidelines to developing, managing and optimizing international-multilingual websites and insights into cutting-edge web localization strategies.
This book addresses the issues raised by the rapid advance of information technology (IT). IT is singularly pervasive: its applications affect people in all walks of life in a way that few other technologies do. The author's thesis is that it would be wise to become well informed about the capabilities and limitations of IT in order to make rational decisions on its use. The book gives a sufficient, non-technical, description of IT for non-specialist readers to appraise its potential and to evaluate critically proposals for new uses. The impact of IT in particular areas is examined and the influence on people and communities is soberly assessed. The book ends with an agenda for all concerned. Murray Laver is a well-known and respected commentator on topics concerning computers. He provides a realistic overview of IT, steering a middle course between rosy utopias and bleak apocalyptic nightmares.
Provides an innovative hands-on introduction to techniques for specifying the behaviour of software components. It is primarily intended for use as a text book for a course in the 2nd or 3rd year of Computer Science and Computer Engineering programs, but it is also suitable for self-study. Using this book will help the reader improve programming skills and gain a sound foundation and motivation for subsequent courses in advanced algorithms and data structures, software design, formal methods, compilers, programming languages, and theory. The presentation is based on numerous examples and case studies appropriate to the level of programming expertise of the intended readership. The main topics covered are techniques for using programmer-friendly assertional notations to specify, develop, and verify small but non-trivial algorithms and data representations, and the use of state diagrams, grammars, and regular expressions to specify and develop recognizers for formal languages.
The interaction between computer and mathematics is becoming more and more important at all levels as computers become more sophisticated. This book shows how simple programs can be used to do significant mathematics. The purpose of this book is to give those with some mathematical background a wealth of material with which to appreciate both the power of the microcomputer and its relevance to the study of mathematics. The authors cover topics such as number theory, approximate solutions, differential equations and iterative processes, with each chapter self-contained. Many exercises and projects are included giving ready made material for demonstrating mathematical ideas. Only a fundamental knowledge of mathematics is assumed and programming is restricted to 'basic BASIC' which will be understood by any microcomputer. The book may be used as a textbook for algorithmic mathematics at several levels, with all the topics covered appearing in any undergraduate mathematics course.
In spite of the rapid growth of interest in the computer analysis of language, this book provides an integrated introduction to the field. Inevitably, when many different approaches are still being considered, a straightforward work of synthesis would be neither possible nor practicable. Nevertheless, Ralph Grishman provides a valuable survey of various approaches to the problems of syntax analysis, semantic analysis, text analysis and natural language generation, while considering in greater detail those that seem to him most productive. The book is written for readers with some background in computer science and finite mathematics, but advanced knowledge of programming languages or compilers is not necessary and nor is a background in linguistics. The exposition is always clear and students will find the exercises and extensive bibliography supporting the text particularly helpful.
Extensively class-tested, this textbook takes an innovative approach to software testing: it defines testing as the process of applying a few well-defined, general-purpose test criteria to a structure or model of the software. It incorporates the latest innovations in testing, including techniques to test modern types of software such as OO, web applications, and embedded software. The book contains numerous examples throughout. An instructor's solution manual, PowerPoint slides, sample syllabi, additional examples and updates, testing tools for students, and example software programs in Java are available on an extensive website.
The study of spatial processes and their applications is an important topic in statistics and finds wide application particularly in computer vision and image processing. This book is devoted to statistical inference in spatial statistics and is intended for specialists needing an introduction to the subject and to its applications. One of the themes of the book is the demonstration of how these techniques give new insights into classical procedures (including new examples in likelihood theory) and newer statistical paradigms such as Monte-Carlo inference and pseudo-likelihood. Professor Ripley also stresses the importance of edge effects and of lack of a unique asymptotic setting in spatial problems. Throughout, the author discusses the foundational issues posed and the difficulties, both computational and philosophical, which arise. The final chapters consider image restoration and segmentation methods and the averaging and summarising of images. Thus, the book will find wide appeal to researchers in computer vision, image processing, and those applying microscopy in biology, geology and materials science, as well as to statisticians interested in the foundations of their discipline.