To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
What determines the sex of an individual organism? It turns out there are different answers for different species. In many vertebrates, including mammals and birds, sex is determined by chromosomes. In other groups, sex is not determined genetically but instead by environmental conditions. For example, in alligators the temperature of the egg during a several-week period determines the sex of the resulting hatchling. In other species, such as clownfish, the sex of an individual may change during the course of its lifetime.
A fundamental question in evolutionary biology is how such different sex-determination systems arose. In this part of the book we address one specific piece of this very big question: The origin of the avian and mammalian sex-determination systems. While both systems are chromosomal, they use different chromosomes. Did the two sex-determination systems evolve independently or arise from a single common ancestral system? Surprisingly, this question can be addressed using a computational analysis of the genes of mammals and birds. In this chapter, we’ll develop a number of widely applicable biological and computational techniques. In the homework problems you’ll then use these to explore the origins of the mammalian and avian sex-determination systems.
Sixty thousand years ago, southwestern France was in the midst of an ice age. The climate was much cooler than it is today, and the region had different animals. These included many large, cold-adapted animals such as woolly mammoths, Irish elk, and cave bears.
There were also people living in the region. Artifacts from this time are abundant, and include spear points and scrapers of the kind that would be used in killing and butchering big game. Human remains reveal people with a short, stocky build, and distinctive facial features including a prominent nose and swept back cheeks. These people were Neanderthals, and by this time they had been occupying Europe for many millennia (Figure 9.1).
Since that time, Europe came to be occupied by other humans, modern humans like ourselves. A key issue in human evolution is the connection between later humans and Neanderthals. Are modern humans descendants of Neanderthals? Or did they come from elsewhere and replace them? Computational approaches have made key contributions toward answering this fascinating question.
One of the things that distinguishes anatomically modern humans from other human groups in the fossil record is the diversity and quality of the artifacts they produced. Figure 10.1 shows three bone needles that were discovered in Liaoning province in northeastern China. These date to between 20,000 and 30,000 years ago, and were almost certainly produced by anatomically modern humans. Liaoning province is cold today, and was even colder tens of thousands of years ago. Needles like these were used to produce individually tailored sewn clothing, which is very warm.
The Liaoning site shows that anatomically modern humans were living at the far eastern edge of Asia by 20,000 to 30,000 years ago. A very important question is how they got there. Amazingly enough, this question can be addressed through the analysis of sequences taken from living humans.
How is this possible? The solution involves building a phylogenetic tree from the sequences, and then making inferences based on this tree. We’ll illustrate this with an example.
This unit contains three chapters on three different topics: RNA folding, gene regulation, and genetic algorithms. Each chapter presents a problem that leads to a programming project. Each chapter also introduces some major new ideas that are transferable to other areas of computer science and computational biology. For example, Chapter 12 examines the use of more sophisticated recursion in the context of RNA folding, Chapter 13 introduces the method of maximum likelihood in the context of inferring gene regulatory networks, and Chapter 14 introduces the concepts of algorithm efficiency, NP-hardness, and genetic algorithms.
Nucleic acids are the chief information-bearing molecules in cells. Recall that they consist of polymers of repeating units called nucleotides that contain a sugar, a phosphate group, and a nitrogenous structure called a base. The base is the part that is variable and thus carries information. In deoxyribonucleic acid (DNA) there are four types of bases: adenine (A), cytosine (C), guanine (G), and thymine (T). The sequence of a nucleic acid polymer is defined by the order of these bases, which we can represent with a string of A's, C's, G's, and T's.
Large strands of DNA, such as are found in bacterial chromosomes, can have millions of nucleotide units. Though you might expect that the four bases would occur in roughly equal numbers in such sequences, this is often not the case. The percentage of nucleotides that are G or C, called the GC content, varies considerably among organisms and can be used to categorize and compare them. For example, Salmonella enterica typhi, the pathogenic species of Salmonella that causes typhoid fever, has a GC content of approximately 52%. The GC content of other bacteria ranges from about 25% to 75%.
We now return to the question that we posed at the beginning of this part: Are the human X and the chicken Z chromosomes homologous? That is, did they descend from a common ancestor? To answer this question, we’ll compare genes on a number of mammalian and avian chromosomes including the X and the Z. We’ll measure the similarity between pairs of mammalian and avian genes and use our results to identify orthologous pairs. Finally, having found pairs of orthologs, we’ll be able to assess the relationship between entire mammalian and avian chromosomes.
This approach requires a biologically reasonable way of scoring the similarity between pairs of genes. In Chapter 5, we developed the differences function and in Chapters 6 and 7 we developed the longest common subsequence (LCS) method. While both of these approaches provide some measure of similarity, they fail to capture some important biological processes. Thus, our first task is to develop a more realistic scoring method. With that new method in hand, we’ll return to exploring the homology of sex chromosomes.
One way that you can spot a computer scientist is that they begin counting from 0 rather than from 1. So this is Chapter 0. But it’s also Chapter 0 to signify that it’s a warm-up chapter to get you on the path to feeling comfortable with Python, the programming language that we’ll be using in this book. Every subsequent chapter will begin with an application in biology followed by the computer science ideas that we’ll need to solve that problem.
Python is a programming language that, according to its designers, aims to combine “remarkable power with very clear syntax.” Indeed, Python programs tend to be relatively short and easy to read. Perhaps for this reason, Python is growing rapidly in popularity among computer scientists, biologists, and others.
The best way to learn to program is to experiment! Therefore, we strongly urge you to pause frequently as you read this book and try some of the things that we’re doing here (and experiment with variations) in Python. It will make the reading more fun and meaningful.
In the previous chapter, we saw how computational methods can be used to find pathogenicity islands in DNA. But now we want more! Specifically, we want to find actual genes in these pathogenicity islands so that we can analyze them and determine their function.
In this chapter, you’ll write a program that finds candidate genes in the genome. We say “candidate” because some things that look like genes turn out not to be. In the next and final chapter of this unit, you’ll take the last step to write a program that determines which of the candidate genes are likely to be real. You’ll then use your gene-finding program to identify the genes in a pathogenicity island of Salmonella typhi. Finally, you’ll use a web-based search tool called BLAST to compare these genes to known genes in the GenBank database. This final comparison will allow you to infer the function of some of the genes you’ve identified.
Open Reading Frames and the Central Dogma
Genes carry the instructions for proteins, which are the main molecules that “do things” in cells. To be able to find genes we must know something about how they operate. As you may recall, the process of constructing a protein proceeds according to the central dogma of molecular biology: The sequence of nucleotides in the DNA of a gene is transcribed into the sequence of nucleotides in a messenger RNA. This messenger RNA, in turn, is read off in units of three nucleotides, called codons. Each codon specifies a particular amino acid, and the sequence of codons in the messenger RNA determines a particular sequence of amino acids in the protein.
This chapter is about the birds and the bees, but it’s probably not what you’re thinking! Let’s start with the backstory.
Imagine that you’re a salesperson who needs to travel to a set of cities to show your products to potential customers. The good news is that there’s a direct flight between every pair of cities and, for each pair, you’re given the cost of flying between those two cities. Your objective is to start in your home city, visit each city exactly once, and return back home at lowest total cost. This is called the Traveling Salesperson Problem and it’s one of the most famous problems in computer science.
For example, consider the set of cities and flights shown in Figure 14.1 and imagine that your start city is Aville.
A tempting approach to solving the Traveling Salesperson Problem is to use an approach like this. Starting at our home city, Aville, fly on the cheapest flight. That’s the flight of cost 1 to Beesburg. (This is not yet the part about the bees.) From Beesburg, we could fly on the least expensive flight to a city that we have not yet visited, in this case Ceefield. From Ceefield, we would then fly on the cheapest flight to a city that we have not yet visited. (Remember, the problem stipulates that you only fly to a city once, presumably because you’re busy and you don’t want to fly to any city more than once - even if it might be cheaper to do so.) So now, we fly from Ceefield to Deesdale and from there to Eetown.
Certain regions of the Salmonella genome contain genes that are directly involved in causing disease. These so-called pathogenicity islands have genes with functions related to invading and living inside a host organism. As you might expect, medical researchers are very interested in identifying and studying such regions. One characteristic that is useful in locating pathogenicity islands is GC content. The GC content inside a pathogenicity island often differs significantly from what is found in the rest of the genome. For this reason, it’s useful to find sections of the genome with unusual GC content. This allows us to zoom in on parts of the genome that are candidate pathogenicity islands.
In this chapter, you’ll write a program that computes and reports the GC content of different regions of the genome. In the following chapters, we’ll refine our search even further to find individual genes in the pathogenicity islands.
How Salmonella Enters Host Cells
Below is an image showing Salmonella bacteria invading cultured human cells. The bacteria will physically enter and reside inside these human cells.
Biologists are frequently interested in the relationships between species. To represent such relationships they use a diagram called a phylogenetic tree. The phylogenetic tree in Figure 9.2 consists of a set of nodes connected by branches. In this example, we have marked the nodes with dots. The ones on the right are called leaf nodes or simply leaves. They represent the species whose relationships we want to understand. In this example, the leaves are five currently living primate species, including great apes and humans. There are also internal nodes, which represent hypothesized ancestral species. For example, the node marked in red represents the most recent common ancestor of chimpanzees and humans, an animal that lived between 5 and 8 million years ago.
The tree provides information about the evolutionary relationships between species. We can use the internal nodes to define groups of closely related species called clades. All the species descended from a particular internal node form a clade, and are more closely related to each other than they are to other species. Thus the red dot in Figure 9.2 defines a clade that we might call the human-chimpanzee clade. This clade includes three living species: human, the common chimpanzee, and the pygmy chimpanzee. The fact that they are in a clade together tells us, for example, that chimpanzees are more closely related to humans than they are to gorillas.
So far, everything we’ve done with recursion we could have also done with for loops. The recursive functions that we’ve seen were conceptually somewhat different from the for loop versions, but recursion hasn’t yet given us powers to compute things that we couldn’t compute before. But now it will!
Peptide Fragments
Imagine that we have a set of fragments of proteins, where each fragment has a given mass. For example, the masses of five protein fragments might be [2, 3, 8, 10, 12] in some unit of mass. In addition, we know the mass of the original protein, say 25 units for the sake of example. The question is whether or not there is a subset of our list of fragment masses that add up to the mass of the protein. In this example, the answer is “yes”: fragments with masses 3, 10, and 12 add up to 25. On the other hand, if the fragments had masses [2, 15, 17, 20], we could not have found a subset that adds up to 25. For now, we’re assuming that each mass in the list can be used at most once. This problem arises in the study of protein structure and has been the focus of a recent study.
The computer is the most powerful general-purpose tool available to biologists.
In part, this is due to the continuing rapid growth of biological data. For example, at the time of writing, the GenBank database had over 100 million genetic sequences with over 100 billion DNA characters. Among the contents of that database are genes from many organisms, annotated with what’s known about their function.
Imagine that you’re studying a bacterium and wish to understand what causes it to be infectious. One promising approach is to identify genes in the bacterium and compare these to known genes in GenBank. If you’re able to find similar genes whose function is known, it will tell you a great deal about the role of the genes in your bacterium. This approach represents a computational challenge, and is, in fact, the topic of Part I of this book.
But searching enormous databases is not the only reason that computers are so useful to biologists. Many biological problems have a large number of different possible solutions and only a computer – programmed with carefully designed computational recipes or “algorithms” – has any chance of finding the right one. For example, biological molecules such as proteins and RNA fold into complex shapes that strongly impact their function. Computational techniques have been developed to predict how these molecules fold. Such techniques help us understand how proteins and RNA work and can even help us design new molecules to treat disease.
Acquired Immune Deficiency Syndrome or AIDS was first recognized in the early 1980s. Within a few years of its discovery, scientists had identified the virus that causes it, called the Human Immunodeficiency Virus or HIV. Through the 1980s and early 1990s, the HIV/AIDS epidemic grew steadily, with the number of affected people and the number of deaths increasing every year. By the late 1990s it was causing millions of deaths per year worldwide, and had a prevalence of more than 20% in the adult population of some countries.
An epidemic of this magnitude requires multiple types of response. One approach has been to limit the spread of the disease through education, for example, by encouraging the use of condoms and discouraging drug users from sharing needles. A second approach has been to carry out research into the basic biology of HIV in an effort to develop better treatments or even a vaccine.
The programming problems for this section are connected with basic research into HIV. They involve predicting the RNA secondary structure in a single gene from the HIV genome. We’ll talk about secondary structure and its prediction shortly. But first, let’s briefly discuss HIV’s genome and life cycle.