To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The basic modeling problem begins with a set of observed data yn = {yt : t = 1, 2, …, n}, generated by some physical machinery, where the elements yt may be of any kind. Since no matter what they are they can be encoded as numbers we take them as such, i.e. natural numbers with or without the order if the data come from finite or countable sets, and real numbers otherwise. Often each number yt is observed together with others x1,t, x2,t, …, called explanatory data, written collectively as a K × n matrix X = {xi,j}, and the data then are written as yn ∣X. It is convenient to use the terminology “variables” for the source of these data. Hence, we say that the data {yt} come from the variable Y, and the explanatory data are generated by variables X1, X2, and so on.
In physics the explanatory data often determine the data yn of interest, called a “law,” but not so in statistical problems. Although by taking sufficiently many explanatory data we may also fit a function to the given set of observed data, but this is not a “law,” since if the same machinery were to generate additional data yn+1, x1,n+1, x2,n+1, … the function would not give yn+1. This is the reason the objective is to learn the statistical properties of the data yn, possibly in the context of the explanatory data.
All science is either physics or stamp collecting.
(Ernest Rutherford)
The 1918 flu pandemic, also referred to as the Spanish flu, was a devastating
infectious disease. It is estimated that 50 million people, about 3% of the
world's population at the time, died of the disease. About 500 million people
were infected. The causative agent was an influenza virus. In this chapter we
will learn more about these viruses. We will make use of highly significant
molecular biology databases and bioinformatics tools. These are useful not only
for learning about influenza viruses, but are widely used to explore just about
any topic in biology.
Short history of sequence databases
A vast amount of information is collected by projects around the world designed
to characterize genomes, genes and proteins. The development with respect to DNA
sequencing is particularly remarkable. One important task in bioinformatics is
to store all of this information in databases and, importantly, to make it
available to the scientific community for downloading and analysis. Numerous
dedicated individuals working on database projects are the unsung heroes of
bioinformatics and molecular biology (see also the quotation on stamp collecting
above).
Many kinds of information technology can be used to make meetings more productive, some of which are related to what happens before and after meetings, while others are intended to be used during a meeting. Document repositories, presentation software, and even intelligent lighting can all play their part. However, the following discussion of user requirements will be restricted to systems that draw on the multimodal signal processing techniques described in the earlier chapters of this book to capture and analyze meetings. Such systems might help people understand something about a past meeting that has been stored in an archive, or they might aid meeting participants in some way during the meeting itself. For instance, they might help users understand what has been said at a meeting, or even convey an idea of who was present, who spoke, and what the interaction was like. We will refer to all such systems, regardless of their purpose or when they are used, as “meeting support technology.”
This chapter reviews the main methods and studies that elicited and analyzed user needs for meeting support technology in the past decade. The chapter starts by arguing that what is required is an iterative software process that through interaction between developers and potential users gradually narrows and refines sets of requirements for individual applications. Then, it both illustrates the approach and lays out specific user requirements by discussing the major user studies that have been conducted for meeting support technology.
Professor Miomir Vukobratović passed away on March 11, 2012 at the age of 81. However, his achievements in robotics will remain with us always.
Miomir Vukobratović was born in 1931. He graduated in 1957 from the Faculty of Mechanical Engineering, University of Belgrade, where he also obtained his first PhD in 1964. In January 1958 he joined the Aeronautical Institute in Belgrade. At the beginning of 1965 he moved to the Mihajlo Pupin Institute in Belgrade, where he became director of the Robotics Laboratory. In 1980, he became professor at the Production Engineering Department of the Faculty of Mechanical Engineering, University of Belgrade.
Meetings are a rich resource of information that, in practice, is mostly untouched by any form of information processing. Even now it is rare that meetings are recorded, and fewer are then annotated for access purposes. Examples of the latter only include meetings held in parliaments, courts, hospitals, banks, etc., where a record is required for reasons of decision tracking or legal obligations. In these cases a labor-intensive manual transcription of the spoken words is produced. Giving much wider access to the rich content is the main aim of the AMI consortium projects, and there are now many examples of interest in that access – through the release of commercial hardware and software services. Especially with the advent of high-quality telephone and videoconferencing systems the opportunity to record, process, recognize, and categorize the interactions in meetings is recognized even by skeptics of speech and language processing technology.
Of course meetings are an audio-visual experience by nature and humans make extensive use of visual and other sensory information. To illustrate the rich landscape of information is the purpose of this book and many applications can be implemented even without looking at the spoken word. However, it is still verbal communication that forms the backbone of most meetings, and accounts for the bulk of the information transferred between participants. Hence automatic speech recognition (ASR) is key to access the information exchanged and is the most important part required for most higher level processing.
SMS language presents special phenomena and important deviations from natural language. Every day, an impressive amount of chat messages, SMS messages, and e-mails are sent all over the world. This widespread use makes important the development of systems that normalize SMS language into natural language. However, typical machine translation approaches are difficult to adapt to SMS language because of many irregularities that are shown by this kind of language. This paper presents a new approach for SMS normalization that combines lexical and phonological translation techniques with disambiguation algorithms at two different levels: lexical and semantic. The method proposed does not depend on big annotated corpus, which is difficult to build and is applied in two different domains showing its easiness of adaptation across different languages and domains. The results obtained by the system outperform some of the existing methods of SMS normalization despite the fact that the Spanish language and the corpus created have some features that complicate the normalization task.
I fell in love with RNA in one of my first jobs as an undergraduate.
(Joan Steitz, quoted by Sedwick, 2011)
Methods of gene prediction
We saw in the previous chapter how prediction of CpG islands may be used to
identify transcription start sites of protein-coding genes. However, there are
many other elements and statistical properties of such genes that we may exploit
for gene finding.
What are the methods available for computational gene finding? In general, one
may distinguish between two major categories: de novo or
ab initio methods and homology-based
methods. The de novo methods make use of statistical signals in
DNA sequences that are characteristic of protein-coding genes; the
homology-based methods rely on the identification of exons by matching known
mRNA or protein sequences or even profile HMMs to a genomic sequence. The
homology-based methods are powerful but they require that mRNA or protein
sequence information is available. Here we will focus on the de
novo type of gene finding. What are the signals characteristic of
proteincoding genes?
At some point a particularly remarkable molecule was formed by accident. We will call it the Replicator. It may not have been the biggest or the most complex molecule around, but it had the extraordinary property of being able to create copies of itself.
(Richard Dawkins, 1989)
The RNA world
So far this book has focused on proteins and the genes that encode them. The human genome encodes some 21000 different proteins and the vast majority of them are important. On the other hand, there is a whole range of RNAs transcribed from the human genome that do not code for proteins, but have other functions. We refer to these RNAs as non-coding RNAs (ncRNAs). In fact, a major portion of the human genome is transcribed, although only about 1.5% of it corresponds to coding regions. We still do not know the function of many of these RNAs, but there are a large number of ncRNA families that have been characterized. Classic examples are tRNAs and ribosomal RNAs, which are part of the translation machinery. A set of U RNAs are involved in splicing (Chapter 16) and there are catalytically important RNA molecules of the RNA-processing enzymes RNases P and MRP. A vital and highly populated class of ncRNA is the RNAs involved in gene silencing as described in Chapter 3.
Meeting support technology evaluation can broadly be considered to be in three categories, which will be discussed in sequence in this chapter, in terms of goals, methods, and outcomes, following a brief introduction on methodology and undertakings prior to the AMI Consortium (Section 13.1). Evaluation efforts can be technology-centric, focused on determining how specific systems or interfaces performed in the tasks for which they were designed (Section 13.2). Evaluations can also adopt a task-centric view, defining common reference tasks such as fact finding or verification, which directly support cross-comparisons of different systems and interfaces (Section 13.3). Finally, the user-centric approach evaluates meeting support technology in its real context of use, measuring the increase in efficiency and user satisfaction that it brings (Section 13.4).
These aspects of evaluation differ from the component evaluation that accompanies each of the underlying technologies described in Chapters 3 to 10, which is often a black-box evaluation based on reference data and distance metrics (although task-centric approaches have been adopted for summarization evaluation, as shown in Chapter 10). Rather, the evaluation of meeting support technology is a stage in a complex software development process for which the helix model was proposed in Chapter 11. We think back on this process in the light of evaluation undertakings, especially for meeting browsers, at the end of this chapter (Section 13.5).
Approaches to evaluation: methods, experiments, campaigns
The evaluation of meeting browsers, as pieces of software, should be related (at least in theory) to a precise view of the specifications they answer.
This book has two parts: the first summarizes the facts of coding and information theory which are needed to understand the essence of estimation and statistics, and the second describes a new theory of estimation, which also covers a good part of statistics as well. After all, both estimation and statistics are about extracting information from the often chaotic looking data in order to learn what it is that makes the data behave the way they do. The first part together with an outline of the algorithmic information in Appendix A is meant for the statistician who wants to understand his or her discipline rather than just learn a bag of tricks with programs to apply them to various data, tricks that are not based on any theory and do not stand a critical examination although some of them can be quite useful, providing solutions for important statistical problems.
The word information has many meanings, two of which have been formalized by Shannon. The first is fundamental in communication as just the number of messages, strings of symbols, either to be stored or to be sent over some communication channel, the practical question being the size of the storage device needed or the time it takes to send them. The second meaning is the measure of the strength of the statistical property a string has, which is fundamental in statistics, and very different from that in communication.
Great fleas have little fleas upon their backs to bite ’em,
And little fleas have lesser fleas, and so ad infinitum.
(Augustus De Morgan, 1806–1871)
For this chapter, as well as Chapters 12 and 13, we turn to the important
genomics and bioinformatics problem of identifying biological function based on
nucleotide and amino acid sequences.
Assigning function based on sequence similarity
A common problem in molecular biology is that you are faced with a gene or a gene
product and you have no clue from experimental studies as to its function. In
this context a critical contribution of bioinformatics is to attribute the
sequence of a gene or a gene product a function. As one example, a genome
sequencing project may give rise to tens of thousands of predicted protein
sequences. In such a case we want to assign as many of these as possible a
biological function using computational tools. In this manner we avoid many
laborious wetlab experiments. In addition to genome sequencing projects, there
are other more specialized situations where we want to find functions of genes.
For instance, we could identify genes as being related to a specific genetic
trait or disease, or a set of genes as being expressed under certain
conditions.
A number of computational tools are available to predict a biological function
associated with a protein sequence. In this chapter we will see an example in
which we assign a function to a protein based on sequence similarity. Consider
the human gene encoding the protein BRCA1, originally sequenced in 1994 (Miki
et al., 1994). It was found to be related in sequence to a
yeast protein RAD9. This yeast protein is involved in cell cycle control. This
observation gave scientists a hint about possible roles of the BRCA1 gene. We
see here an example of inferring a function based on a homology relationship to
a protein that has already been functionally characterized. We will see yet
another example of this situation in this chapter, where we will make use of
BLAST to identify a homology relationship. We already encountered BLAST in the
context of the BCR–ABL fusion protein in Chapter 7.
From 1878 to 1896, 3482 Tiger skins were despatched from [a tannery] to
London where they were made into waistcoats.
(Norman Laird, article in The Mercury, 7 October 1968;
cited in (Owen, 2003)
This chapter will deal further with phylogenetic analysis. We will introduce
methods in addition to those of neighbour-joining and we will use a Perl script
to examine taxonomy data. For these topics we will take a closer look at an
extinct animal, the Tasmanian tiger.
Extinction
The Tasmanian tiger was not, in fact, a tiger; it was a dog-like marsupial
animal. Thylacine is the more adequate scientific name. In the
early twentieth century it existed only in Tasmania, and even there it was very
scarce. A farmer named Wilf Batty lived in the Mawbanna district of northeastern
Tasmania. On 13 May 1930 he spotted a thylacine attempting to break into his
chicken coop. Batty had observed the thylacine around his house for weeks, and
this day he took his rifle and shot the animal. As it happened, this was the
last wild thylacine to be killed. Another specimen, most likely a female, was
captured in 1933 and kept at the Hobart Zoo in Tasmania. She died on 7 September
1936, apparently as a result of neglect. The animal was kept outdoors and was
not allowed access to her den, despite difficult temperatures. Ironically, the
death took place only two months after the thylacine species was given full
legal protection by the Tasmanian government. There are sightings of the
thylacine reported after 1936, but none of these are well documented and we
unfortunately need to regard the thylacine as being extinct.