To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
An oligonucleotide probe is a short piece of single-stranded DNA complementary to the target gene whose expression is measured on the microarray by that probe. In most microarray applications, oligonucleotide probes are between 20 and 60 bases long. The probes are either spotted onto the array or synthesised in situ, depending on the microarray platform (Chapter 1).
Usually, oligonucleotide probes for microarrays are designed within several hundred bases of the 3′ end of the target gene sequence. So for a fixed oligonucleotide length, there are several hundred potential oligonucleotides, one for each possible starting base. Some of these oligonucleotides work better than others as probes on a microarray. This chapter describes methods for the computer selection of good oligonucleotide probes.
What Makes a Good Oligonucleotide Probe?
Good oligonucleotide probes have three properties: they are sensitive, specific and isothermal.
Asensitive probe is one that returns a strong signal when the complementary target is present in the sample. There are two factors that determine the sensitivity of a probe:
▪ The probe does not have internal secondary structure or bind to other identical probes on the array.
▪ The probe is able to access its complementary sequence in the target, which could potentially be unavailable as a result of secondary structure in the target.
A specific probe is one that returns a weak signal when the complementary target is absent from the sample; i.e., it does not cross-hybridise. There are two factors that determine the specificity of a probe:
▪ Cross-hybridization to other targets as a result of Watson–Crick base-pairing
▪ Non-specific binding to the probe; e.g., as a result of G-quartets
DNA array technology is almost fifteen years old, and still rapidly evolving. It is one of very few platforms capable of matching the scale of sequence data produced by genome sequencing. Applications range fromanalysing single base changes, SNPs, to detecting deletion or amplification of large segments of the genome, CGH. At present, its most widespread use is in the analysis of gene expression levels. When carried out globally on all the genes of an organism, this analysis exposes its molecular anatomy with unprecedented clarity. In basic research, it reveals gene activities associated with biological processes and groups genes into networks of interconnected activities. There have been practical outcomes, too. Most notably, large-scale expression analysis has revealed genes associated with disease states, such as cancer, informed the design of new methods of diagnosis, and provided molecular targets for drug development.
At face value, the method is appealingly simple. An array is no more than a set of DNA reagents for measuring the amount of sequence counterparts among them RNAs of a sample. However, the quality of the result is affected by several factors, including the quality of the array and the sample, the uniformity of hybridisation process, and the method of reading signals. Errors, inevitable at each stage, must be taken into account in the design of the experiment and in the interpretation of results. It is here that the scientist needs the help of advanced statistical tools.
Dr. Stekel is a mathematician with several years of experience in the microarray field. He has used his expertise in a company setting, developing advanced methods for probe design and for the analysis of large, complex data sets.
The image of the microarray generated by the scanner (Section 1.3) is the raw data of your experiment. Computer algorithms, known as feature extraction software, convert the image into the numerical information that quantifies gene expression; this is the first step of data analysis. The image processing involved in feature extraction has a major impact on the quality of your data and the interpretation you can place on it.
In Chapter 1 we discussed three technologies by which microarrays are manufactured: in-situ synthesis with the Affymetrix platform, inkjet in-situ synthesised arrays (Rosetta, Agilent and Oxford Gene Technology) and pin-spotted microarrays. This chapter focusses on pin-spotted arrays. Affymetrix has integrated its image processing algorithms into the Gene Chip experimental process and there are no decisions for the end-user to make. Inkjet arrays are of much higher quality than pin-spotted arrays and do not suffer from many of the image-processing difficulties of spotted arrays; also, Agilent provides image-processing software tailor-made for their platform, so there are no decisions for the end-user either.
Pin-spotted arrays, on the other hand, provide the user with a wide range of choices of how to process the image. These choices have an impact on the data, and so this chapter describes the fundamentals of these computational methods to give a better understanding as to how they impact the data.
FEATURE EXTRACTION
The first step in the computational analysis of microarray data is to convert the digital TIFF images of hybridisation intensity generated by the scanner into numerical measures of the hybridisation intensity of each channel on each feature. This process is known as feature extraction.
Chapter 1 introduced microarray technologies and discussed the use of microarrays in the laboratory. The remainder of the book is dedicated to microarray bioinformatics. This chapter, together with the next chapter, discusses the bioinformatics required to design a DNA microarray. In this chapter, we look at the sequence databases that are used to select and annotate the genes that the microarray detects and, thus, the sequences that will appear on the array. Chapter 3 looks at the computer design of oligonucleotide probes for oligonucleotide arrays.
There are two broad questions and one more specific consideration that this chapter seeks to address:
1. What resources could I use to design my own custom array?
If you are designing a custom microarray to study a particular disease, tissue or organism, you will need to identify the genes that might be expressed in your samples and identify the sequences of those genes. One of the aims of this chapter is to give an understanding of which databases you could use to select such genes.
How can I find more information about the sequences of the genes on my array?
DNA microarrays contain sequences that will have derived from DNA sequence databases. The output file containing the numerical results of the microarray experiment that you will analyse also contains a number of fields that relate these sequences to the databases from which they derive. This chapter describes the meanings of these fields and the nature of the databases.
One of the most exciting areas of microarray research is the use of microarrays to find groups of genes that can be used diagnostically to determine the disease that an individual is suffering from, or prognostically to predict the success of a course of therapy or results of an experiment.
In these studies, samples are taken from several groups of individuals with known pathologies, outcomes or phenotypes and hybridised to microarrays. The aim is to find a small number of genes that can predict to which group each individual belongs. These genes can then be used in the future as part of a molecular test on further individuals, either using a focussed microarray, or a simpler method such as quantitative polymerase chain reaction (PCR).
EXAMPLE 9.1 DATA SET 9A
Bone marrow samples are taken from 27 patients suffering from acute lymphoblastic leukemia (ALL) and 11 patients suffering from acute myeloid leukaemia (AML) and hybridised to Affymetrix arrays. We want to be able to diagnose the leukemia in future patients using either Affymetrix technology or usingmore focussed arrays with a small number of genes. How do we choose a set of rules to classify these samples?
The development of such predictive models depends on statistical and computational techniques, many of which are still the subject of active research. There are essentially three parts to developing a predictive model, and so the chapter is arranged into three further sections:
Section 9.2: Methods of Classification, looks at a number of commonly used methods for distinguishing between groups of individuals based on a given set of measurements. There are several well-established methods for doing this, many of which have been shown to work well with microarray data.
The design of experiments is one of the most important areas of microarray bioinformatics and is a long-standing topic in classical statistics. The reason for good experimental design is that it allows you to obtain maximum information from an experiment for minimum effort – which translates into time and money. The alternative to good experimental design is to performmicroarray experiments which produce data that cannot be analysed.
You might ask why it is that this topic appears at this point in the book, after data analysis rather than earlier in the book, alongside the material on the design of microarrays themselves. There are two reasons for this. The first is that the topics in this section use concepts from some of the earlier chapters, most importantly the ideas of hypothesis tests and p-values introduced in Chapter 7. But there is also amore philosophical reason why I have chosen to place the material on experimental design after the material on data analysis. In my view, it is absolutely critical to understand the scientific questions you are trying to answer, or even the scientific hypotheses you are seeking to generate, before you design your experiment. To this end, you should have a clear idea of the structure of the data you are seeking to produce and the types of data analysis you intend to employ before you design an experiment.
This chapter considers three areas of experimental design:
Section 10.2: Blocking, Randomisation and Blinding, looks at the statistical problems of confounding and bias, and the methods that are used to resolve these issues.
Microarrays are a genomic technology. Genomics is different from genetics in that it looks not at genes in isolation, but at how many genes work together to produce phenotypic effects. In Chapter 7 we saw how microarrays can be used to study genes in isolation. But much of the real power of microarrays is their ability to be used to study the relationships between genes and to identify genes or samples that behave in a similar or coordinated manner. This chapter looks at a number of analysis techniques to find and verify such relationships.
We will use two example data sets to examine the ideas of this chapter.
EXAMPLE 8.1 YEAST SPORULATION DATA (DATA SET 8A)
Budding yeast can reproduce sexually by producing haploid cells through a process called sporulation. Yeast was placed in a sporulating medium, and samples were taken at six time points following the start of sporulation and hybridised to microarrays. We want to identify groups of genes that behave in a coordinated manner in this time series.
EXAMPLE 8.2 DIFFUSE B-CELL LYMPHOMA SUBTYPES (DATA SET 8B)
Samples were taken from 39 patients suffering from diffuse large B-cell lymphomas and hybridized to microarrays. We want to identify genes that are co-regulated in this disease. We are also interested in whether there are groups of patients with similar gene expression profiles.
This chapter discusses methods that can be used to answer such questions; it is organised into the following five sections:
Section 8.2: Similarity of Gene or Sample Profiles, looks at different methods for quantifying the similarity or dissimilarity of gene expression profiles. We show how the different methods can give different results and, hence, the need to think carefully about choosing the method you use.
In this book we have described many different types of microarray experiment. All these experiments generate large volumes of complex data. As scientists, we need to be able to communicate the results of our experiments with other scientists. There are many reasons why scientists seek to share data, including the following:
▪ To verify the results of a published microarray experiment. It is accepted that for scientific results to be published in a refereed journal, it is necessary to provide sufficient information so that others can reproduce the experiment.
▪ To perform further experimental work based on the results. Microarray data frequently generate hypotheses that require further experimental investigation; often, microarray experiments are performed precisely to generate such hypotheses.
▪ To undertake further data analysis of the results. Sometimes it is possible to perform further data analysis beyond the analyses carried out by the researchers in their original paper, which requires full access to the data.
▪ To compare the results with other functional genomics data. It is valuable to make comparisons either between different microarray experiments, or between microarray data and data from other sources (e.g., proteomics).
▪ To developnovel data analysis methods. Bioinformatics researchers developing novel data analysis methods need data sets for testing their methods.
Data sharing is not simply a function of scientists. A microarray laboratory will typically run a number of different computer applications to capture, store, publish and analyse microarray data (Figure 11.1). In order for the laboratory to operate successfully, each of these computer applications needs to be able to exchange data with the other.
Normalisation is a general term for a collection of methods that are directed at resolving the systematic errors and bias introduced by the microarray experimental platform. Normalisation methods stand in contrast with the data analysis methods described in Chapters 7, 8 and 9 that are used to answer the scientific questions for which the microarray experiment has been performed. The aim of this chapter is to give an understanding of why we need to normalise microarray data, and the methods for normalisation that are most commonly used. The chapter is arranged into three further sections:
Section 5.2: Data Cleaning and Transformation, looks at the first steps in cleaning and transforming the data generated by the feature extraction software before any further analysis can take place.
Section 5.3: Within-Array Normalisation, describes methods that allow for the comparison of the Cy3 and Cy5 channels of a two-colour microarray. This section is only relevant for two-colour arrays.
Section 5.4: Between-Array Normalisation, describes methods that allow for the comparison of measurements on different arrays. This section is applicable both to two-colour and single channel arrays, including Affymetrix arrays.
SECTION 5.2 DATA CLEANING AND TRANSFORMATION
The microarray data generated by the feature extraction software is typically in the form of one or more text files (Table 4.2). Before you use the data to answer scientific questions, there are a number of steps that are commonly taken to ensure that the data is of high quality and suitable for analysis. This section describes three stages of data cleaning and transformation:
DNA microarrays are devices that measure the expression of many thousands of genes in parallel. They have revolutionised molecular biology, and in the past five years their use has grownrapidly in academia, medicine, and the pharmaceutical, biotechnology, agrochemical and food industries.
One of the principal features of microarrays is the volume of quantitative data that they generate. As a result, the major challenge in the field is how to handle, interpret and make use of this data. The field of bioinformatics has come to mean the applications of mathematics, statistics and information technology in the biological sciences, and the bioinformatics of microarrays is the answer to that challenge.
This book is a comprehensive guide to all of the bioinformatics you will need to successfully operate DNA microarray experiments. It is written for researchers, clinicians, laboratory heads and managers, from both biology and bioinformatics backgrounds, who work with or who intend to work with microarrays. The book covers all aspects of microarray bioinformatics, giving you the tools to design arrays and experiments, to analyze your data, and to share you results with your organisation or with the international community. It has been inspired by the Microarray Bioinformatics professional course at Oxford University, and thus would also be suitable for teaching the subject at postgraduate or professional level.
The book assumes a minimum knowledge of molecular biology, computer use and statistics. On the biology front, readers will find it helpful if they have an understanding of the basic principles of molecular biology, i.e., DNA, RNA, transcription and translation, as well as the notions of genome sequencing and the existence of sequence databases.
Age The time period elapsed since an identifiable point in the life cycle of an organism. (If a developmental stage is specified, the identifiable point would be the beginning of that stage. Otherwise the identifiable point must be specified such as planting) [MGED Ontology Definition]
Amount of nucleic acid labeled The amount of nucleic acid labeled
Amplification method The method used to amplify the nucleic acid extracted
Array design The layout or conceptual description of array that can be implemented as one or more physical arrays. The array design specification consists of the description of the common features of the array as the whole, and the description of each array design elements (e.g., each spot). MIAME distinguishes between three levels of array design elements: feature (the location on the array), reporter (the nucleotide sequence present in a particular location on the array), and composite sequence (a set of reporters used collectively to measure an expression of a particular gene)
Array design name Given name for the array design, that helps to identify a design between others (e.g., EMBL yeast 12K ver1.1)
Array dimensions The physical dimension of the array support (e.g., of slide)
Array related information Description of the array as the whole Attachment How the element (reporter) sequences are physically attached to the array (e.g., covalent, ionic) Author, laboratory, and contact Person(s) and organization(s) names and details (address, phone, FAX, email, URL)
Biomaterial manipulation Information on the treatment applied to the biomaterial
Data analysis is seen as the largest and possibly the most important area of microarray bioinformatics. Reflecting this, there are three chapters in this book describing data analysis methods, which themselves answer three sets of scientific questions that are asked of microarray data:
Which genes are differentially expressed in one set of samples relative to another?
What are the relationships between the genes or samples being measured?
Is it possible to classify samples based on gene expression measurements?
In this chapter, we describe the methods for the first of these questions: the search for up- or down-regulated genes; Chapters 8 and 9 answer the other two questions. This chapter covers a variety of techniques, drawn from both classical statistics and more modern theory, to give a detailed account of how to analyze DNA microarray data for differentially expressed genes. We start the chapter with three examples to illustrate what we mean by the identification of differentially expressed genes.
EXAMPLE 7.1 DATA SET 7A
Samples are taken from 20 breast cancer patients, before and after a 16-week course of doxorubicin chemotherapy, and analyzed using microarrays. We wish to identify genes that are up- or down-regulated in breast cancer following that treatment.
EXAMPLE 7.2 DATA SET 7B
Bone marrow samples are taken from 27 patients suffering from acute lymphoblastic leukemia (ALL) and 11 patients suffering from acute myeloid leukemia (AML) and analyzed using Affy metrix arrays. We wish to identify the genes that are up- or downregulated in ALL relative to AML.
A DNA microarray consists of a solid surface, usually a microscope slide, 1 onto which DNA molecules have been chemically bonded. The purpose of a microarray is to detect the presence and abundance of labelled nucleic acids in a biological sample, which will hybridise to the DNA on the array via Watson-Crick duplex formation, and which can be detected via the label. In the majority of microarray experiments, the labelled nucleic acids are derived from the mRNA of a sample or tissue, and so the microarray measures gene expression. The power of a microarray is that there may be many thousands of different DNA molecules bonded to an array, and so it is possible to measure the expression of many thousands of genes simultaneously.
This book is about the bioinformatics of DNA microarrays: the mathematics, statistics and computing you will need to design microarray experiments; to acquire, analyse and store your data; and to share your results with other scientists. One of the features of microarray technology is the level of bioinformatics required: it is not possible to performa meaningful microarray experiment without bioinformatics involvement at every stage.
However, this chapter is different from the remainder of the book. While the other chapters discuss bioinformatics, the aim of this chapter is to set out the basics of the chemistry and biology of microarray technology. It is hoped that someone new to the technology will be able to read this chapter and gain an understanding of the laboratory process and howit impacts the quality of the data.
Chapter 5 described a number of methods to correct for unwanted systematic variability either within an array or between different arrays. In this chapter, we describe methods to measure and quantify the random variabilities introduced by the microarray experiment. The common sources of variability (Figure 6.1) are
▪ The variability between replicate features on the same array
▪ The variability between two separately labelled samples hybridised to the same array
▪ The variability between samples hybridised to different arrays
▪ The variability between different individuals in a population hybridised to different arrays
Estimates of these variabilities are essential to gaining an understanding of how well the microarray platform you are using is performing. They are also important parameters for determining the number of replicates required for a microarray experiment – a topic that is discussed in full in Chapter 10.
The first two levels of variability – between replicate features or samples hybridised to the same array – are meaningful only for two-colour arrays.However, the second two levels of variability – between hybridisations to different arrays and between individuals in a population – are meaningful both for two-colour arrays and Affymetrix arrays.
MEASURING AND QUANTIFYING MICROARRAY VARIABILITY
The variabilities between different features on an array, between two samples hybridized to the same array or between samples hybridized to different arrays are all introducedby the microarray experimental process. In contrast, the variation between individuals in the population is independent of the microarray process itself. Experimental variability is measured with calibration experiments; population variability is measured with pilot studies.
Calibration Experiments
The aim of a calibration experimentis to identify and quantify the sources of variability in your microarray experimental platform.