To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter kicks off with an example for a simple program in Python. Through this example, we will delve right away into the syntax of this popular programming language. There will be quite a few technical details in this example, which are probably new to anyone with no prior Python experience. The rest of the chapter will then walk the reader briefly yet methodologically through the basic Pythonic syntax. By the end of this chapter, you will be able to write simple programs in Python to solve various problems of a computational nature.
A graph is defined by two sets. One set, the nodes, describes a collection of objects, for example biological species, such as human, whale, mountain lion, crow, pigeon, sea gull, Arabian surgeonfish, zebra fish, and Anemonefish. The second set, the edges, comprises pairs of nodes, such as {human, mountain lion}, {pigeon, sea gull}, {Arabian surgeonfish, Anemonefish}. Each pair represents a connection between objects, which has some meaning. For example, two species are related to one another if they belong to the same taxonomic family, or they both live on the same continent, or they are active in the same part of the day, and so on. Each such feature will create a different graph, because different properties create different sets of pairs (edges).
Computers and computer programs are harnessed these days to solve an extremely diverse range of problems. These problems include weather forecast, scheduling problems, designing, and controlling trajectories for satellites, recognizing individual human faces, playing chess, trading in the stock market, diagnosing cancer cells, assembling automobiles and airplanes, and driving autonomous cars (this is obviously a very partial list). In many cases, computers outperform human experts working on the same problem. In the future decades, we may expect computers to gradually replace medical doctors (completely or partially), express emotions, compose high quality music, etc. etc. Given these amazing fits, one wonders if computers are omnipotent: Can computers solve every problem, provided it is well defined?
In this chapter, we introduce the notion of graph algorithms, which are basically algorithms working on graphs. There are many such algorithms, aimed at solving a wide range of problems. We will focus on one such problem – the shortest paths problem. This problem has several algorithms, under different constraints, that solve it. We will present the well-known breadth first search (BFS), algorithm that solves a simple version of that problem. This algorithm will be explained in detail and implemented in Python. We will conclude the chapter by mentioning additional common problems in graph theory.
Much of the initial work in bioinformatics was centered around biological sequences of medium lengths, such as genes and protein sequences. Typical problems were finding common genomic motifs, e.g., conserved sequence motifs in the promoters’ region of a gene family, or looking for sequence similarity among proteins of different species. These tasks, and many others, have significant biological consequences. Yet the key to their solution lies in the so-called area of stringology, which is the computer science jargon for the study of string algorithms and properties.
Computers allow us to store an enormous amount of data. Data come in several types, such as numbers, sound, and images. But how are these types of data represented in the computer’s memory?
Modern biology has been undergoing a dramatic revolution in recent decades. Enormous amounts of data are produced at an unprecedented rate. These data come in various forms, such as gene and protein expression level data, DNA, RNA, and protein sequencing, high quality biological and medical images. One consequence of this “data explosion” is that computational methods are increasingly being used in life science research. Computational methods in this context are not the mere use of tools, but the integration of computational and algorithmic thinking to lab-experiment design; to data generation, integration, and analyses; and to modeling and simulation. It is becoming widely recognized that such thinking skills should be incorporated into the standard training of life scientists in the 21st century.
In this chapter, we will define Turing’s Halting Problem. We will then describe Turing’s proof that the Halting Problem cannot be solved by any computer, ever. This was the first problem proven to be, fundamentally, beyond the capabilities of any computer. This result subsequently served as a springboard to prove that a whole, infinite class of other problems are unsolvable as well.
This chapter begins with a definition of what a graph is. While the definition is rather abstract, it is general enough to make graph theory relevant and applicable to a wide range of systems, in a variety of contexts. We will also see how to represent graphs in Python (or any other programming language). Such a representation allows a computerized inspection of many graph properties, as well as implementing various algorithms on graphs. By the end of this chapter, you will be familiar with many basic notions in this field. The last two sections of this chapter introduce the notions of clusters and clustering, and of hierarchical clustering in the context of phylogenetic trees. We note that the treatment of these two topics here hardly touches their surface – both are deep subjects in their own right, covered by vast and diverse literature.
In the previous chapter, we studied well-defined computational problems that cannot be solved by any computer program. Consider now a setting where we are given a well-defined computational problem that is solvable by some computer program. Yet any such program takes a very long time to complete. Long enough that by the time the execution terminates, the solution is no longer relevant.
In this chapter, you will encounter one of the most fundamental notions in CS – complexity of computations. We take a rather non-formal approach in presenting this topic, with an emphasis on intuition. Complexity of algorithms has two flavors – time complexity, which is related to running time, and memory, or space complexity, which reflects the memory allocation requirements of an algorithm. In this chapter, and in most other chapters as well, we will focus primarily on time complexity (except for some cases, in which memory poses a real limitation). You are not expected to become an expert in analyzing the complexity of algorithms. However, you should be able, by the end of this chapter, to understand how running time depends on an algorithm’s input size and to estimate empirically an algorithm’s actual running time. Later in the book, in chapter 11 (Mission Infeasible), we will get acquainted with the notion of complexity from another perspective – computational problems that can be solved in principle, but for which no efficient solutions are known.
Many biological systems can be explored by computer simulation: an imitation of the behavior of the system over time, done by running a computer program. Computer simulations are also termed in-silico experiments, paraphrasing the terms in-vitro and in-vivo. Computer simulation in biology can be used to replace or complement some tedious and costly lab experiments. It enables conducting numerous “experiments” under various conditions, at a scale that is infeasible experimentally.
The term image processing refers to the manipulation or analysis of digital images. While manipulation alters the image in some way to make it more meaningful or informative, analysis merely extracts information from the image without changing its pixels. This chapter describes some very basic notions in image processing, namely segmentation, morphological operators such as erosion and dilation and noise reduction, and labeling. Image processing is really a whole field of expertise, involving rather sophisticated mathematical methods. This chapter aims to familiarize the reader with the field, setting the ground for further exploration of this fascinating topic.
Images are used extensively nowadays. With the digital revolution, we easily generate, send, and observe images, all at a very low cost. Images are also used extensively in biological research and in the medical clinic. Biological images are studied mainly in basic science research, mostly at the cellular and molecular level. Medical images are used for clinical purposes, and focus on the tissue and organ level. Both kinds of images are often complicated and heterogeneous, and analyzing them requires sophisticated computational techniques. The goal of these techniques is to extract meaningful knowledge about the image content. For example, the automated identification of objects in the image (such as cells, intracellular components, or a cancer tumor), tracking cells, organelles, or cancer cells in consecutive video frames, or phenotype identification by object properties (size, light intensity, shape, etc.). Images also require large volumes of storage, raising the need for efficient compression algorithms, in order to store them.
In this part of the book, we will introduce the basics of the programming language Python (chapter 1), and learn how to use it efficiently, taking into consideration a program’s running time and memory allocation requirements (chapter 2). Later on, Python programs will accompany every topic presented throughout this book.
In many contexts, we are required to handle a large collection of objects in a way that supports inserting a new object, finding if an object is present, and possibly deleting an object. These operations typically appear in a series of arbitrary length. We want all these operations to be done as efficiently as possible. Consider a search engine (and its underlying infrastructure) like Google or Bing. One makes a query (e.g., “who is the King of Asteria”) and gets a response in about 945,000 results (0.44 seconds). One of the basic techniques behind such efficient implementations of search is called hashing.