To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Data are big, data are hard, data are experimental; data are everywhere.
Converting data into (useful) information is a challenge, but a necessary step to not only justify data collection but also appraise processes and arrive at conclusions. This chapter strives to reach such an ambitious goal. Beyond the provision of analytical methods, it aims to highlight potential pitfalls and tips for success.
We shall start by contemplating the very essence of data analysis, discussing the theoretical concepts and relevant approaches required for a particular analysis. Given the diversity in complexity and nature, we will consider whether it is possible, or desirable, to separate a part from the whole: does it make sense to look only at a subset of data to draw a conclusion? Such a question is just as valid when separating one particular dataset from another, for example when pulling apart a graphical representation from statistical analysis. A figure can only illustrate a result, not quantify its significance. Very importantly, when designing an experiment and collecting data one should always remember the words of R.A. Fisher, the eminent statistician and geneticist: ‘To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of.’
Then we will get into the practicalities of data handling, particularly the actual manipulation of the data into a format required for further analysis. It may not be the most glamorous subject, but it is essential to enable the appropriate and efficient use of the data collected. Just like a well-designed experiment, well-organised data can make its collection, visualisation and analysis a smooth and enjoyable experience. Data reformatting presents frequent mishaps, including an inadvertent loss, an erroneous characterisation, a mix-up and even an outright corruption. Unfortunately, such misfortune happens all too often; so being prepared will not only minimise the impact, but help capture any eventual problem earlier, easing its fix. We will also emphasise that data handling should be directly linked to its graphical representation and expected analysis on the platform of choice. We will therefore discuss the optimal use of spreadsheets, R and Python languages (see also Chapter 18).
Advanced nucleic acid sequencing and bioinformatic technologies allow the investigation of genomes and transcriptomes, and thus provide useful tools to investigate the molecular biology, biochemistry and physiology of organisms.
This chapter describes genome-sequencing methodologies, approaches and algorithms used for genome assembly and annotation. Compared with the long-established biochemical methodologies, the sequencing and drafting of genomes is constantly evolving as new bioinformatics tools become available. Despite technological advances, there have been considerable challenges in the sequencing, annotation and analyses of previously uncharacterised eukaryotic genomes. In this chapter, we will therefore discuss some bioinformatic workflows that have proven to be efficient for the assembly and annotation of complex eukaryotic genomes, and give a perspective on future research toward improving genomic and transcriptomic methodologies.
A genome project starts with the assembly and annotation of a draft genome from experimentally determined DNA sequences referred to as reads (nucleotide sequence regions). The success of an assembly depends on the sequencing technology and the assembly algorithms used, as well as the quality of the data. Typically, short-read, shotgun assemblies do not lead to complete chromosomal assemblies of eukaryote genomes, mainly because of challenges in resolving regions with repeats, a lack of uniform read coverage linked to a suboptimal quality of genomic DNA or poor library construction. Clearly, the quality of a genomic assembly is critical for subsequent gene predictions, and the quality of gene prediction is crucial to achieve an acceptable annotation.
The annotation of a genome is typically divided into structural and functional phases. For structural annotation, genomic features, such as genes, RNAs and repeats, are predicted, and their composition and location in the genome inferred. For gene prediction, results of both ab initio and evidence-based predictions (from mRNA, cDNA and/or proteomic data) are often combined. Functional annotation (also called functional prediction) assigns a potential function to a gene or genome element. In general, functions are predicted using similarity searches, structural comparisons, phylogenetic approaches, genetic interaction networks and machine-learning approaches. The following sections cover commonly used algorithms and recent methods for genome assembly, the prediction of protein-encoding genes and the functional annotation of such genes and their products.
Catalytic reactions in biological processes are facilitated by two types of catalysts, enzymes and ribozymes. Whereas enzymes are proteins, ribozymes consist of ribonucleic acid (RNA). Most enzymes are much larger than the substrates they process, but this is not a requirement (for example, restriction enzymes that cleave DNA). The catalytic features of enzymes arise through a particular three-dimensional arrangement of functional groups in a small number of amino acids (see Section 23.4.2) in the active site of the enzyme. The geometrical arrangement of such groups enables productive interactions with the bound substrate and leads to formation of a transition state for which the energy barrier (activation energy) is significantly reduced as compared to the non-catalysed reaction (see Figure 23.1). Consequently, the reaction rate is increased by several orders of magnitude relative to the non-catalysed reaction. Importantly, enzymes do not alter the position of equilibrium of the reversible reactions they catalyse, rather they accelerate establishment of the position of equilibrium for the reaction.
Many enzymes are the key players in metabolic or signalling pathways. In these coordinated pathways, enzymes are collectively responsible for maintaining the metabolic needs of cells under varying physiological conditions (Section 23.5). Through their individual catalytic activities, they control the rate of a particular metabolic or signalling pathway. A range of regulatory mechanisms operates to allow short-, medium- and long-term changes in activity (Section 23.5.1). Therefore, the over- or under-expression of an enzyme can lead to cell dysfunction, which may manifest itself as a particular disease. Hence, enzymes have been the most important target in the development of therapeutics, and typical drugs for the treatment of pathological conditions act as inhibitors of particular enzymes (Sections 23.5.3). Other clinically relevant applications include monitoring of enzyme levels in assessment of disease states. For example, damage to the heart muscle as a result of oxygen deprivation following a heart attack results in the release of cellular enzymes into extra-cellular fluids and eventually into the blood. Such release can be monitored to aid diagnosis of the organ damage and to make a prognosis for the patient's future recovery (Section 10.3).
Over the last decade, we have witnessed a massive expansion in demand for and access to low-cost high-throughput sequencing of nucleic acids, which can be predominantly attributed to the advent and establishment of so-called next-generation sequencing technologies (NGS; see Section 20.2.2). The availability and progressively decreasing costs of such technologies has been accompanied by an ever-increasing number of nucleic acid and protein sequences being deposited in public repositories and, in turn, by the need to draw biologically meaningful information or interpretations from these data. As a consequence, the discipline of bioinformatics has become instrumental in many areas of biology and, in particular, molecular biology.
Bioinformatics can be defined as a ‘fusion’ of biology and informatics, which includes applied mathematics, computer sciences, information technology and statistics. This multi-disciplinary field of research includes two major components, one aimed at developing computational tools and algorithms to facilitate storage, analysis and manipulation of sequence data, and one aimed at applying such tools to the discovery of new biological insights on the organism(s) under consideration. Researchers involved in the field of bioinformatics comprise both algorithmor software- developers and end-users. While the main interest of the first group lies in writing sequence analysis programs and tools (programming, often called ‘ coding’), the second wishes to apply these tools to answer questions of biological relevance. Within this latter group, experienced end-users often download and maintain programs on their private personal computers or servers, analyse a number of sequences (thousands to millions) simultaneously, have a working knowledge of programming languages and are therefore skilled in the use of command-linebased software. On the other hand, occasional users mainly deal with a limited number of sequences and thus prefer the use of ‘user-friendly’, web-server-based tools, which often offer a reduced set of options and a limited capacity when compared with the corresponding downloadable software packages. This chapter intends to provide an overview of the basic methods and bioinformatics resources available for the analysis of nucleic acids and protein sequences, and is primarily addressed to occasional users. For details on bioinformatic analyses of largescale sequence datasets, such as those generated by NGS technologies, the reader is referred to Chapter 20, while programmatic or script access to programs will require more advanced programming skills, for example using the Python language (see Chapter 18).
Mass spectrometry (MS) is an extremely valuable analytical technique in which the molecules in a test sample are converted to gaseous ions that are subsequently separated in a mass analyser according to their mass-to-charge (m / z) ratio and then detected. The mass spectrum is a plot of the relative abundances of the ions at each m / z ratio. Note that it is the mass-to-charge ratio (m / z), and not the actual mass, that is measured. For example, if a biomolecule is ionised in positive ion mode, the instrument measures the m / z after the addition of one proton (i.e. 1.0072772984 Da for exact mass or 1.0078 Da for average mass). Similarly, for a biomolecule ionised in negative ion mode, an m / z after the loss of one proton is measured.
The essential features of all mass spectrometers are:
• G eneration of ions in the gas phase
• S eparation of ions in a mass analyser
• D etection of each species of a particular m / z ratio.
Several techniques exist to generate ions and are discussed below. Of these, the development of electrospray ionisation (ESI; Section 15.2.4) and matrix-assisted laser desorption ionisation (MALDI; Section 15.2.5), has effectively expanded the detectable mass range, enabling the measurement of almost any biomolecule. Mass analysers separate ions by use of either a magnetic or an electric field (Section 15.3); detectors produce a measurable signal in the form of either a voltage or a current that can be transformed by a computer into data to be analysed (Section 15.4). The symbol M r is used to designate relative molecular mass. As a relative measure, M r has no units.
The treatment of mass spectrometry in this chapter will be rather non-mathematical and non-technical. Mass spectrometry has a wide array of applications, including drug discovery and the sciences of proteomics (Chapter 21) and metabolomics (Chapter 22).
This chapter will focus on the fundamental principles of mass spectrometry. The intention is to give an overview of the different types of instrumentation that are available and discussion of their applications, complementary techniques and the advantages/disadvantages of each system. Sample preparation and data analysis will also be covered. A further reading list is provided, covering more technical and mathematical aspects of mass spectrometry.