To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Whereas the String and StringBuilder classes provide a set of methods that can be used to process string-based data, the RegEx and its supporting classes provide much more power for string-processing tasks. String processing mostly involves looking for patterns in strings (pattern matching) and it is performed via a special language called a regular expression. In this chapter, we look at how to form regular expressions and how to use them to solve common text processing tasks.
AN INTRODUCTION TO REGULAR EXPRESSIONS
A regular expression is a language that describes patterns of characters in strings, along with descriptors for repeating characters, alternatives, and groupings of characters. Regular expressions can be used to perform both searches in strings and substitutions in strings.
A regular expression itself is just a string of characters that define a pattern you want to search for in another string. Generally, the characters in a regular expression match themselves, so that the regular expression “the” matches that sequence of characters wherever they are found in a string.
A regular expression can also include special characters that are called metacharacters. Metacharacters are used to signify repetition, alternation, or grouping. We will examine how these metacharacters are used shortly.
Most experienced computer users have used regular expressions in their work, even if they weren't aware they were doing so at the time.
Data organize naturally as lists. We have already used the Array and ArrayList classes for handling data organized as a list. Although those data structures helped us group the data in a convenient form for processing, neither structure provides a real abstraction for actually designing and implementing problem solutions.
Two list-oriented data structures that provide easy-to-understand abstractions are stacks and queues. Data in a stack are added and removed from only one end of the list, whereas data in a queue are added at one end and removed from the other end of a list. Stacks are used extensively in programming language implementations, from everything from expression evaluation to handling function calls. Queues are used to prioritize operating system processes and to simulate events in the real world, such as teller lines at banks and the operation of elevators in buildings.
C# provides two classes for using these data structures: the Stack class and the Queue class. We'll discuss how to use these classes and look at some practical examples in this chapter.
STACKS, A STACK IMPLEMENTATION AND THE STACK CLASS
The stack is one of the most frequently used data structures, as we just mentioned. We define a stack as a list of items that are accessible only from the end of the list, which is called the top of the stack. The standard model for a stack is the stack of trays at a cafeteria.
For many applications, data are best stored as lists, and lists occur naturally in day-to-day life: to-do lists, grocery lists, and top-ten lists. In this chapter, we explore one particular type of list, the linked list. Although the .NET Framework class library contains several list-based collection classes, the linked list is not among them. The chapter starts with an explanation of why we need linked lists, then we explore two different implementations of the data structure—object-based linked lists and array-based linked lists. The chapter finishes up with several examples of how linked lists can be used for solving computer programming problems you may run across.
THE PROBLEM WITH ARRAYS
The array is the natural data structure to use when working with lists. Arrays provide fast access to stored items and are easy to loop through. And, of course, the array is already part of the language and you don't have to use extra memory and processing time using a user-defined data structure.
But as we've seen, the array is not the perfect data structure. Searching for an item in an unordered array is slow because you have to possibly visit every element in the array before finding the element you're searching for. Ordered (sorted) arrays are much more efficient for searching, but insertions and deletions are slow because you have to shift the elements up or down to either make space for an insertion or remove space with a deletion.
The BitArray class is used to represent sets of bits in a compact fashion. Bit sets can be stored in regular arrays, but we can create more efficient programs if we use data structures specifically designed for bit sets. In this chapter, we'll look at how to use this data structure and examine some problems that can be solved using sets of bits. The chapter also includes a review of the binary numbers, the bitwise operators, and the bitshift operators.
A MOTIVATING PROBLEM
Let's look at a problem we will eventually solve using the BitArray class. The problem involves finding prime numbers. An ancient method, discovered by the third-century b.c. Greek philosopher Eratosthenes, is called the sieve of Eratosthenes. This method involves filtering numbers that are multiples of other numbers, until the only numbers left are primes. For example, let's determine the prime numbers in the set of the first 100 integers. We start with 2, which is the first prime. We move through the set removing all numbers that are multiples of 2. Then we move to 3, which is the next prime. We move through the set again, removing all numbers that are multiples of 3. Then we move to 5, and so on. When we are finished, all that will be left are prime numbers.
A set is a collection of unique elements. The elements of a set are called members. The two most important properties of sets are that the members of a set are unordered and no member can occur in a set more than once. Sets play a very important role in computer science but are not included as a data structure in C#.
This chapter discusses the development of a Set class. Rather than providing just one implementation, however, we provide two. For nonnumeric items, we provide a fairly simple implementation using a hash table as the underlying data store. The problem with this implementation is its efficiency. A more efficient Set class for numeric values utilizes a bit array as its data store. This forms the basis of our second implementation.
FUNDAMENTAL SET DEFINITIONS, OPERATIONS AND PROPERTIES
A set is defined as an unordered collection of related members in which no member occurs more than once. A set is written as a list of members surrounded by curly braces, such as {0,1,2,3,4,5,6,7,8,9}. We can write a set in any order, so the previous set can be written as {9,8,7,6,5,4,3,2,1,0} or any other combination of the members so that all members are written just once.
The array is the most common data structure, present in nearly all programming languages. Using an array in C# involves creating an array object of System.Array type, the abstract base type for all arrays. The Array class provides a set of methods for performing tasks such as sorting and searching that programmers had to build by hand in the past.
An interesting alternative to using arrays in C# is the ArrayList class. An arraylist is an array that grows dynamically as more space is needed. For situations where you can't accurately determine the ultimate size of an array, or where the size of the array will change quite a bit over the lifetime of a program, an arraylist may be a better choice than an array.
In this chapter, we'll quickly touch on the basics of using arrays in C#, then move on to more advanced topics, including copying, cloning, testing for equality and using the static methods of the Array and ArrayList classes.
ARRAY BASICS
Arrays are indexed collections of data. The data can be of either a built-in type or a user-defined type. In fact, it is probably the simplest just to say that array data are objects. Arrays in C# are actually objects themselves because they derive from the System.Array class. Since an array is a declared instance of the System.Array class, you have the use of all the methods and properties of this class when using arrays.
Hashing is a very common technique for storing data in such a way the data can be inserted and retrieved very quickly. Hashing uses a data structure called a hash table. Although hash tables provide fast insertion, deletion, and retrieval, operations that involve searching, such as finding the minimum or maximum value, are not performed very quickly. For these types of operations, other data structures are preferred (see, for example, Chapter 12 on binary search trees).
The .NET Framework library provides a very useful class for working with hash tables, the Hashtable class. We will examine this class in the chapter, but we will also discuss how to implement a custom hash table. Building hash tables is not very difficult and the programming techniques used are well worth knowing.
AN OVERVIEW OF HASHING
A hash table data structure is designed around an array. The array consists of elements 0 through some predetermined size, though we can increase the size later if necessary. Each data item is stored in the array based on some piece of the data, called the key. To store an element in the hash table, the key is mapped into a number in the range of 0 to the hash table size using a function called a hash function.
The two most common operations performed on data stored in a computer are sorting and searching. This has been true since the beginning of the computing industry, which means that sorting and searching are also two of the most studied operations in computer science. Many of the data structures discussed in this book are designed primarily to make sorting and/or searching easier and more efficient on the data stored in the structure.
This chapter introduces you to the fundamental algorithms for sorting and searching data. These algorithms depend on only the array as a data structure and the only “advanced” programming technique used is recursion. This chapter also introduces you to the techniques we'll use throughout the book to informally analyze different algorithms for speed and efficiency.
SORTING ALGORITHMS
Most of the data we work with in our day-to-day lives is sorted. We look up definitions in a dictionary by searching alphabetically. We look up a phone number by moving through the last names in the book alphabetically. The post office sorts mail in several ways—by zip code, then by street address, and then by name. Sorting is a fundamental process in working with data and deserves close study.
As was mentioned earlier, there has been quite a bit of research performed on different sorting techniques. Although some very sophisticated sorting algorithms have been developed, there are also several simple sorting algorithms you should study first. These sorting algorithms are the insertion sort, the bubble sort, and the selection sort.
This book discusses the development and implementation of data structures and algorithms using C#. The data structures we use in this book are found in the .NET Framework class library System.Collections. In this chapter, we develop the concept of a collection by first discussing the implementation of our own Collection class (using the array as the basis of our implementation) and then by covering the Collection classes in the .NET Framework.
An important addition to C# 2.0 is generics. Generics allow the C# programmer to write one version of a function, either independently or within a class, without having to overload the function many times to allow for different data types. C# 2.0 provides a special library, System.Collections.Generic, that implements generics for several of the System.Collections data structures. This chapter will introduce the reader to generic programming.
Finally, this chapter introduces a custom-built class, the Timing class, which we will use in several chapters to measure the performance of a data structure and/or algorithm. This class will take the place of Big O analysis, not because Big O analysis isn't important, but because this book takes a more practical approach to the study of data structures and algorithms.
COLLECTIONS DEFINED
A collection is a structured data type that stores data and provides operations for adding data to the collection, removing data from the collection, updating data in the collection, as well as operations for setting and returning the values of different attributes of the collection.
In this chapter, we look at two advanced topics: dynamic programming and greedy algorithms. Dynamic programming is a technique that is often considered to be the reverse of recursion—a recursive solution starts at the top and breaks the problem down solving all small problems until the complete problem is solved; a dynamic programming solution starts at the bottom, solving small problems and combining them to form an overall solution to the big problem.
A greedy algorithm is an algorithm that looks for “good solutions” as it works toward the complete solution. These good solutions, called local optima, will hopefully lead to the correct final solution, called the global optimum. The term “greedy” comes from the fact these algorithms take whatever solution looks best at the time. Often, greedy algorithms are used when it is almost impossible to find a complete solution, due to time and/or space considerations, yet a suboptimal solution is acceptable.
DYNAMIC PROGRAMMING
Recursive solutions to problems are often elegant but inefficient. The C# compiler, along with other language compilers, will not efficiently translate the recursive code to machine code, resulting in an inefficient, though elegant computer program.
Many programming problems that have recursive solutions can be rewritten using the techniques of dynamic programming. A dynamic programming solution builds a table, usually using an array, which holds the results of the different subsolutions. Finally, when the algorithm is complete, the solution is found in a distinct spot in the table.
In this chapter, we present a set of advanced data structures and algorithms for performing searching. The data structures we cover include the red–black tree, the splay tree, and the skip list. AVL trees and red–black trees are two solutions to the problem of handling unbalanced binary search trees. The skip list is an alternative to using a tree-like data structure that foregoes the complexity of the red–black and splay trees.
AVL TREES
Another solution to maintaining balanced binary trees is the AVL tree. The name AVL comes from the two computer scientists who discovered this data structure, G. M. Adelson-Velskii and E. M. Landis, in 1962. The defining characteristic of an AVL tree is that the difference between the height of the right and left subtrees can never be more than one.
AVL Tree Fundamentals
By continually comparing the heights of the left and right subtrees of a tree, the AVL tree is guaranteed to always stay “in balance.” AVL trees utilize a technique, called a rotation, to keep them in balance.
To understand how a rotation works, let's look at a simple example that builds a binary tree of integers. Starting with the tree shown in Figure 15.1, if we insert the value 10 into the tree, the tree becomes unbalanced, as shown in Figure 15.2. The left subtree now has a height of 2, but the right subtree has a height of 0, violating the rule for AVL trees.
Consider two graphs G1 and G2 on the same vertex set V and suppose that Gi has mi edges. Then there is a bipartition of V into two classes A and B so that, for both i = 1, 2, we have . This gives an approximate answer to a question of Bollobás and Scott. We also prove results about partitions into more than two vertex classes. Our proofs yield polynomial algorithms.
We introduce the minor-closed, dual-closed class of multi-path matroids. We give a polynomial-time algorithm for computing the Tutte polynomial of a multi-path matroid, we describe their basis activities, and we prove some basic structural properties. Key elements of this work are two complementary perspectives we develop for these matroids: on the one hand, multi-path matroids are transversal matroids that have special types of presentations; on the other hand, the bases of multi-path matroids can be viewed as sets of lattice paths in certain planar diagrams.
The partition functions of the Ising and Potts models in statistical mechanics are well known to be partial evaluations of the Tutte–Whitney polynomial of the appropriate graph. The Ashkin–Teller model generalizes the Ising model and the four-state Potts model, and has been extensively studied since its introduction in 1943. However, its partition function (even in the symmetric case) is not a partial evaluation of the Tutte–Whitney polynomial. In this paper, we show that the symmetric Ashkin–Teller partition function can be obtained from a generalized Tutte–Whitney function which is intermediate in a precise sense between the usual Tutte–Whitney polynomialof the graph and that of its dual.
Consider the following communication problem, which leads to a new notion of edge colouring. The communication network is represented by a bipartite multigraph, where the nodes on one side are the transmitters and the nodes on the other side are the receivers. The edges correspond to messages, and every edge e is associated with an integer c(e), corresponding to the time it takes the message to reach its destination. A proper k-edge-colouring with delays is a function f from the edges to {0, 1, . . ., k − 1}, such that, for every two edges e1 and e2 with the same transmitter, f(e1) ≠ f(e2), and for every two edges e1 and e2 with the same receiver, f(e1) + c(e1) ≢ f(e2) + c(e2) (mod k). Ross, Bambos, Kumaran, Saniee and Widjaja [18] conjectured that there always exists a proper edge colouring with delays using k = Δ + o(Δ) colours, where Δ is the maximum degree of the graph. Haxell, Wilfong and Winkler [11] conjectured that a stronger result holds: k = Δ + 1 colours always suffice. We prove the weaker conjecture for simple bipartite graphs, using a probabilistic approach, and further show that the stronger conjectureholds for some multigraphs, applying algebraic tools.
We show that a random 4-regular graph asymptotically almost surely (a.a.s.) has chromatic number 3. The proof uses an efficient algorithm which a.a.s. 3-colours a random 4-regular graph. The analysis includes use of the differential equation method, and exponential bounds on the tail of random variables associated with branching processes. A substantial part of the analysis applies to random d-regular graphs in general.
Let A be a set of N matrices. Let g(A) ≔ |A + A| + |A · A|, where A + A = {a1 + a2 ∣ ai ∈ A} and A · A = {a1a2 ∣ ai ∈ A} are the sum set and product set. We prove that if the determinant of the difference of any two distinct matrices in A is nonzero, then g(A) cannot be bounded below by cN for any constant c. We also prove that if A is a set of d × d symmetric matrices, then there exists ϵ = ϵ(d)>0 such that g(A)>N1+ϵ. For the first result, we use the bound on the number of factorizations in a generalized progression. For the symmetric case, we use a technical proposition which provides an affine space V containing a large subset E of A, with the property that if an algebraic property holds for a large subset of E, then it holds for V. Then we show that the system a2 : a ∈ V is commutative, allowing us to decompose as eigenspaces simultaneously, so we can finish the proof with induction and a variant of the Erdős–Szemerédi argument.