We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
A useful software-engineering principle is information hiding or encapsulation. A module may provide values of a given type, but the representation of that type is known only to the module. Clients of the module may manipulate the values only through operations provided by the module. In this way, the module can assure that the values always meet consistency requirements of its own choosing.
Object-oriented programming languages are designed to support information hiding. Because the “values” may have internal state that the operations will modify, it makes sense to call them objects. Because a typical “module” manipulates only one type of object, we can eliminate the notion of module and (syntactically) treat the operations as fields of the objects, where they are called methods.
Another important characteristic of object-oriented languages is the notion of extension or inheritance. If some program context (such as the formal parameter of a function or method) expects an object that supports methods m1, m2, m3, then it will also accept an object that supports m1, m2, m3, m4.
CLASSES
To illustrate the techniques of compiling object-oriented languages I will use a simple class-based object-oriented language called Object-Tiger.
Over the past decade, there have been several shifts in the way compilers are built. New kinds of programming languages are being used: object-oriented languages with dynamic methods, functional languages with nested scope and first-class function closures; and many of these languages require garbage collection. New machines have large register sets and a high penalty for memory access, and can often run much faster with compiler assistance in scheduling instructions and managing instructions and data for cache locality.
This book is intended as a textbook for a one- or two-semester course in compilers. Students will see the theory behind different components of a compiler, the programming techniques used to put the theory into practice, and the interfaces used to modularize the compiler. To make the interfaces and programming examples clear and concrete, I have written them in the C programming language. Other editions of this book are available that use the Java and ML languages.
Implementation project. The “student project compiler” that I have outlined is reasonably simple, but is organized to demonstrate some important techniques that are now in common use: abstract syntax trees to avoid tangling syntax and semantics, separation of instruction selection from register allocation, copy propagation to give flexibility to earlier phases of the compiler, and containment of target-machine dependencies. Unlike many “student compilers” found in textbooks, this one has a simple but sophisticated back end, allowing good register allocation to be done after instruction selection.
ca-non-i-cal: reduced to the simplest or clearest schema possible
Webster's Dictionary
The trees generated by the semantic analysis phase must be translated into assembly or machine language. The operators of the Tree language are chosen carefully to match the capabilities of most machines. However, there are certain aspects of the tree language that do not correspond exactly with machine languages, and some aspects of the Tree language interfere with compile-time optimization analyses.
For example, it's useful to be able to evaluate the subexpressions of an expression in any order. But the subexpressions of Tree.exp can contain side effects – eseq and call nodes that contain assignment statements and perform input/output. If tree expressions did not contain eseq and call nodes, then the order of evaluation would not matter.
Some of the mismatches between Trees and machine-language programs are:
The cjump instruction can jump to either of two labels, but real machines' conditional jump instructions fall through to the next instruction if the condition is false.
eseq nodes within expressions are inconvenient, because they make different orders of evaluating subtrees yield different results.
call nodes within expressions cause the same problem.
call nodes within the argument-expressions of other call nodes will cause problems when trying to put arguments into a fixed set of formal-parameter registers.
Why does the Tree language allow eseq and two-way cjump, if they are so troublesome?
The front end of the compiler translates programs into an intermediate language with an unbounded number of temporaries. This program must run on a machine with a bounded number of registers. Two temporaries a and b can fit into the same register, if a and b are never “in use” at the same time. Thus, many temporaries can fit in few registers; if they don't all fit, the excess temporaries can be kept in memory.
Therefore, the compiler needs to analyze the intermediate-representation program to determine which temporaries are in use at the same time. We say a variable is live if it holds a value that may be needed in the future, so this analysis is called liveness analysis.
To perform analyses on a program, it is often useful to make a control-flow graph. Each statement in the program is a node in the flow graph; if statement x can be followed by statement y, there is an edge from x to y. Graph 10.1 shows the flow graph for a simple loop.
Let us consider the liveness of each variable (Figure 10.2). A variable is live if its current value will be used in the future, so we analyze liveness by working from the future to the past. Variable b is used in statement 4, so b is live on the 3 → 4 edge.
func-tion: a mathematical correspondence that assigns exactly one element of one set to each element of the same or another set
Webster's Dictionary
The mathematical notion of function is that if f(x) = a “this time,” then f(x) = a “next time”; there is no other value equal to f(x). This allows the use of equational reasoning familiar from algebra: that if a = f(x) then g(f(x), f(x)) is equivalent to g(a, a). Pure functional programming languages encourage a kind of programming in which equational reasoning works, as it does in mathematics.
Imperative programming languages have similar syntax: a ← f(x). But if we follow this by b ← f(x) there is no guarantee that a = b; the function f can have side effects on global variables that make it return a different value each time. Furthermore, a program might assign into variable x between calls to f(x), so f(x) really means a different thing each time.
Higher-order functions. Functional programming languages also allow functions to be passed as arguments to other functions, or returned as results. Functions that take functional arguments are called higher-order functions.
Higher-order functions become particularly interesting if the language also supports nested functions with lexical scope (also called block structure). As in Tiger, lexical scope means that each function can refer to variables and parameters of any function in which it is nested.
sched-ule: a procedural plan that indicates the time and sequence of each operation
Webster's Dictionary
A simple computer can process one instruction at a time. First it fetches the instruction, then decodes it into opcode and operand specifiers, then reads the operands from the register bank (or memory), then performs the arithmetic denoted by the opcode, then writes the result back to the register back (or memory); and then fetches the next instruction.
Modern computers can execute parts of many different instructions at the same time. At the same time the processor is writing results of two instructions back to registers, it may be doing arithmetic for three other instructions, reading operands for two more instructions, decoding four others, and fetching yet another four. Meanwhile, there may be five instructions delayed, awaiting the results of memory-fetches.
Such a processor usually fetches instructions from a single flow of control; it's not that several programs are running in parallel, but the adjacent instructions of a single program are decoded and executed simultaneously. This is called instruction-level parallelism (ILP), and is the basis for much of the astounding advance in processor speed in the last decade of the twentieth century.
A pipelined machine performs the write-back of one instruction in the same cycle as the arithmetic “execute” of the next instruction and the operand-read of the previous one, and so on.
func-tion: a mathematical correspondence that assigns exactly one element of one set to each element of the same or another set
Webster's Dictionary
The mathematical notion of function is that if f(x) = a “this time,” then f(x) = a “next time”; there is no other value equal to f(x). This allows the use of equational reasoning familiar from algebra: that if a = f(x) then g(f(x), f(x)) is equivalent to g(a, a). Pure functional programming languages encourage a kind of programming in which equational reasoning works, as it does in mathematics.
Imperative programming languages have similar syntax: a ← f(x). But if we follow this by b ← f(x) there is no guarantee that a = b; the function f can have side effects on global variables that make it return a different value each time. Furthermore, a program might assign into variable x between calls to f(x), so f(x) really means a different thing each time.
Higher-order functions. Functional programming languages also allow functions to be passed as arguments to other functions, or returned as results. Functions that take functional arguments are called higher-order functions.
Higher-order functions become particularly interesting if the language also supports nested functions with lexical scope (also called block structure). As in Tiger, lexical scope means that each function can refer to variables and parameters of any function in which it is nested. A higher-order functional language is one with nested scope and higher-order functions.
dom-i-nate: to exert the supreme determining or guiding influence on
Webster's Dictionary
Many dataflow analyses need to find the use-sites of each defined variable or the definition-sites of each variable used in an expression. The def-use chain is a data structure that makes this efficient: for each statement in the flow graph, the compiler can keep a list of pointers to all the use sites of variables defined there, and a list of pointers to all definition sites of the variables used there. In this way the compiler can hop quickly from use to definition to use to definition.
An improvement on the idea of def-use chains is static single-assignment form, or SSA form, an intermediate representation in which each variable has only one definition in the program text. The one (static) definition-site may be in a loop that is executed many (dynamic) times, thus the name static singleassignment form instead of single-assignment form (in which variables are never redefined at all).
The SSA form is useful for several reasons:
Dataflow analysis and optimization algorithms can be made simpler when each variable has only one definition.
If a variable has N uses and M definitions (which occupy about N + M instructions in a program), it takes space (and time) proportional to N · M to represent def-use chains – a quadratic blowup (see Exercise 19.8). For almost all realistic programs, the size of the SSA form is linear in the size of the original program (but see Exercise 19.9).
Heap-allocated records that are not reachable by any chain of pointers from program variables are garbage. The memory occupied by garbage should be reclaimed for use in allocating new records. This process is called garbage collection, and is performed not by the compiler but by the runtime system (the support programs linked with the compiled code).
Ideally, we would say that any record that is not dynamically live (will not be used in the future of the computation) is garbage. But, as Section 10.1 explains, it is not always possible to know whether a variable is live. So we will use a conservative approximation: we will require the compiler to guarantee that any live record is reachable; we will ask the compiler to minimize the number of reachable records that are not live; and we will preserve all reachable records, even if some of them might not be live.
Figure 13.1 shows a Tiger program ready to undergo garbage collection (at the point marked garbage-collect here). There are only three program variables in scope: p, q, and r.
MARK-AND-SWEEP COLLECTION
Program variables and heap-allocated records form a directed graph. The variables are roots of this graph. A node n is reachable if there is a path of directed edges r → ··· → n starting at some root r.
A compiler was originally a program that “compiled” subroutines [a link-loader]. When in 1954 the combination “algebraic compiler” came into use, or rather into misuse, the meaning of the term had already shifted into the present one.
Bauer and Eickel [1975]
This book describes techniques, data structures, and algorithms for translating programming languages into executable code. A modern compiler is often organized into many phases, each operating on a different abstract “language.” The chapters of this book follow the organization of a compiler, each covering a successive phase.
To illustrate the issues in compiling real programming languages, I show how to compile Tiger, a simple but nontrivial language of the Algol family, with nested scope and heap-allocated records. Programming exercises in each chapter call for the implementation of the corresponding phase; a student who implements all the phases described in Part I of the book will have a working compiler. Tiger is easily modified to be functional or object-oriented (or both), and exercises in Part II show how to do this. Other chapters in Part II cover advanced techniques in program optimization. Appendix A describes the Tiger language.
The interfaces between modules of the compiler are almost as important as the algorithms inside the modules. To describe the interfaces concretely, it is useful to write them down in a real programming language. This book uses ML – a strict, statically typed functional programming language with modular structure.
ab-stract: disassociated from any specific instance
Webster's Dictionary
A compiler must do more than recognize whether a sentence belongs to the language of a grammar – it must do something useful with that sentence. The semantic actions of a parser can do useful things with the phrases that are parsed.
In a recursive-descent parser, semantic action code is interspersed with the control flow of the parsing actions. In a parser specified in Yacc, semantic actions are fragments of C program code attached to grammar productions.
SEMANTIC ACTIONS
Each terminal and nonterminal may be associated with its own type of semantic value. For example, in a simple calculator using Grammar 3.35, the type associated with exp and INT might be int; the other tokens would not need to carry a value. The type associated with a token must, of course, match the type that the lexer returns with that token.
For a rule A → B C D, the semantic action must return a value whose type is the one associated with the nonterminal A. But it can build this value from the values associated with the matched terminals and nonterminals B, C, D.
RECURSIVE DESCENT
In a recursive-descent parser, the semantic actions are the values returned by parsing functions, or the side effects of those functions, or both.
loop: a series of instructions that is repeated until a terminating condition is reached
Webster's Dictionary
Loops are pervasive in computer programs, and a great proportion of the execution time of a typical program is spent in one loop or another. Hence it is worthwhile devising optimizations to make loops go faster. Intuitively, a loop is a sequence of instructions that ends by jumping back to the beginning. But to be able to optimize loops effectively we will use a more precise definition.
A loop in a control-flow graph is a set of nodes S including a header node h with the following properties:
From any node in S there is a path of directed edges leading to h.
There is a path of directed edges from h to any node in S.
There is no edge from any node outside S to any node in S other than h.
Thus, the dictionary definition (from Webster's) is not the same as the technical definition.
Figure 18.1 shows some loops. A loop entry node is one with some predecessor outside the loop; a loop exit node is one with a successor outside the loop. Figures 18.1c, 18.1d, and 18.1e illustrate that a loop may have multiple exits, but may have only one entry. Figures 18.1e and 18.1f contain nested loops.