To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter illustrates the use of words to derive enumeration results and algorithms for sampling and coding.
Given a family C of combinatorial structures, endowed with a size such that the subset Cn of objects of size n is finite, we consider three problems:
(i) Counting: determine for all n ≥ 0, the cardinal Card (Cn) of the set Cn of objects with size n.
(ii) Sampling: design an algorithm RandC that, for any n, produces a random object uniformly chosen in Cn: in other terms, the algorithm must satisfy P(RandC(n) = O) = 1/Card (Cn) for any object O ∊ Cn.
(iii) Optimal coding: construct a function φ that maps injectively objects of C on words of {0, 1}* in such a way that an object O of size n is coded by a word φ(O) of length roughly bounded above by log2 Card (Cn).
These three problems have in common an enumerative flavour, in the sense that they are immediately solved if a list of all objects of size n is available. However, since in general there is an exponential number of objects of size n in the families in which we are interested, this solution is in no way satisfying. For a wide class of so-called decomposable combinatorial structures, including nonambiguous algebraic languages, algorithms with polynomial complexity can be derived from the rather systematic recursive method. Our aim is to explore classes of structures for which an even tighter link exists between counting, sampling, and coding.
This chapter introduces various mathematical models and combinatorial algorithms that are used to infer network expressions which appear repeated in a word or are common to a set of words, where by network expression is meant a regular expression without Kleene closure on the alphabet of the input word(s). A network expression on such an alphabet is therefore any expression built up of concatenation and union operators. For example, the expression A(C + G)T concatenates A with the union (C + G) and with T. Inferring network expressions means discovering such expressions which are initially unknown. The only input is the word(s) where the repeated (or common) expressions will be sought. This is in contrast with another problem, we shall not be concerned with, which searches for a known expression in a word(s) both of which are in this case part of the input. The inference of network expressions has many applications, notably in molecular biology, system security, text mining, etc. Because of the richness of the mathematical and algorithmic problems posed by molecular biology, we concentrate on applications in this area. The network expressions considered may therefore contain spacers where by spacer is meant any number of don't care symbols (a don't care is a symbol that matches anything). Constrained spacers are consecutive don't care symbols whose number ranges over a fixed interval of values. Network expressions with don't care symbols but no spacers are called “simple” while network expressions with spacers are called “flexible” if the spacers are unconstrained, and “structured” otherwise.
Repetitions (periodicities) in words are important objects that play a fundamental role in combinatorial properties of words and their applications to string processing, such as compression or biological sequence analysis. Using properties of repetitions allows one to speed up pattern matching algorithms.
The problem of efficiently identifying repetitions in a given word is one of the classical pattern matching problems. Recently, searching for repetitions in strings received a new motivation, due to the biosequence analysis. In DNA sequences, successively repeated fragments often bear important biological information and their presence is characteristic for many genomic structures (such as telomer regions). From a practical view-point, satellites and alu-repeats are involved in chromosome analysis and genotyping, and thus are of major interest to genomic researchers. Thus, different biological studies based on the analysis of tandem repeats have been done, and even databases of tandem repeats in certain species have been compiled.
In this chapter, we present a general efficient approach to computing different periodic structures in words. It is based on two main algorithmic techniques – a special factorization of the word and so-called longest extension functions – described in Section 8.3. Different applications of this method are described in Sections 8.4, 8.5, 8.6, 8.7, and 8.8. These sections are preceded by Section 8.2 devoted to combinatorial enumerative properties of repetitions. Bounding the maximal number of repetitions is necessary for proving complexity bounds of corresponding search algorithms.
Statistical and probabilistic properties of words in sequences have been of considerable interest in many fields, such as coding theory and reliability theory, and most recently in the analysis of biological sequences. The latter will serve as the key example in this chapter. We only consider finite words.
Two main aspects of word occurrences in biological sequences are: where do they occur and how many times do they occurfi An important problem, for instance, was to determine the statistical significance of a word frequency in a DNA sequence. The naive idea is the following: a word may be significantly rare in a DNA sequence because it disrupts replication or gene expression, (perhaps a negative selection factor), whereas a significantly frequent word may have a fundamental activity with regard to genome stability. Well-known examples of words with exceptional frequencies in DNA sequences are certain biological palindromes corresponding to restriction sites avoided, for instance in E. coli, and the Cross-over Hotspot Instigator sites in several bacteria. Identifying over- and underrepresented words in a particular genome is a very common task in genome analysis.
Statistical methods of studying the distribution of the word locations along a sequence and word frequencies have also been an active field of research; the goal of this chapter is to provide an overview of the state of this research.
Probably the most important data type after vectors and free text is that of symbol strings of varying lengths. This type of data is commonplace in bioinformatics applications, where it can be used to represent proteins as sequences of amino acids, genomic DNA as sequences of nucleotides, promoters and other structures. Partly for this reason a great deal of research has been devoted to it in the last few years. Many other application domains consider data in the form of sequences so that many of the techniques have a history of development within computer science, as for example in stringology, the study of string algorithms.
Kernels have been developed to compute the inner product between images of strings in high-dimensional feature spaces using dynamic programming techniques. Although sequences can be regarded as a special case of a more general class of structures for which kernels have been designed, we will discuss them separately for most of the chapter in order to emphasise their importance in applications and to aid understanding of the computational methods. In the last part of the chapter, we will show how these concepts and techniques can be extended to cover more general data structures, including trees, arrays, graphs and so on.
Certain kernels for strings based on probabilistic modelling of the data-generating source will not be discussed here, since Chapter 12 is entirely devoted to these kinds of methods. There is, however, some overlap between the structure kernels presented here and those arising from probabilistic modelling covered in Chapter 12.
The last decade has seen an explosion of readily available digital text that has rendered attempts to analyse and classify by hand infeasible. As a result automatic processing of natural language text documents has become a main research interest of Artificial Intelligence (AI) and computer science in general. It is probably fair to say that after multivariate data, natural language text is the most important data format for applications. Its particular characteristics therefore deserve specific attention.
We will see how well-known techniques from Information Retrieval (IR), such as the rich class of vector space models, can be naturally reinterpreted as kernel methods. This new perspective enriches our understanding of the approach, as well as leading naturally to further extensions and improvements. The approach that this perspective suggests is based on detecting and exploiting statistical patterns of words in the documents. An important property of the vector space representation is that the primal–dual dialectic we have developed through this book has an interesting counterpart in the interplay between term-based and document-based representations.
The goal of this chapter is to introduce the Vector Space family of kernel methods highlighting their construction and the primal–dual dichotomy that they illustrate. Other kernel constructions can be applied to text, for example using probabilistic generative models and string matching, but since these kernels are not specific to natural language text, they will be discussed separately in Chapters 11 and 12.
The previous chapter saw the development of some basic tools for working in a kernel-defined feature space resulting in some useful algorithms and techniques. The current chapter will extend the methods in order to understand the spread of the data in the feature space. This will be followed by examining the problem of identifying correlations between input vectors and target values. Finally, we discuss the task of identifying covariances between two different representations of the same object.
All of these important problems in kernel-based pattern analysis can be reduced to performing an eigen- or generalised eigen-analysis, that is the problem of finding solutions of the equation Aw = λBw given symmetric matrices A and B. These problems range from finding a set of k directions in the embedding space containing the maximum amount of variance in the data (principal components analysis (PCA)), through finding correlations between input and output representations (partial least squares (PLS)), to finding correlations between two different representations of the same data (canonical correlation analysis (CCA)). Also the Fisher discriminant analysis from Chapter 5 can be cast as a generalised eigenvalue problem.
The importance of this class of algorithms is that the generalised eigenvectors problem provides an efficient way of optimising an important family of cost functions; it can be studied with simple linear algebra and can be solved or approximated efficiently using a number of well-known techniques from computational algebra.
In this chapter we conclude our presentation of kernel-based pattern analysis algorithms by discussing three further common tasks in data analysis: ranking, clustering and data visualisation.
Ranking is the problem of learning a ranking function from a training set of ranked data. The number of ranks need not be specified though typically the training data comes with a relative ordering specified by assignment to one of an ordered sequence of labels.
Clustering is perhaps the most important and widely used method of unsupervised learning: it is the problem of identifying groupings of similar points that are relatively ‘isolated’ from each other, or in other words to partition the data into dissimilar groups of similar items. The number of such clusters may not be specified a priori. As exact solutions are often computationally hard to find, effective approximations via relaxation procedures need to be sought.
Data visualisation is often overlooked in pattern analysis and machine learning textbooks, despite being very popular in the data mining literature. It is a crucial step in the process of data analysis, enabling an understanding of the relations that exist within the data by displaying them in such a way that the discovered patterns are emphasised. These methods will allow us to visualise the data in the kernel-defined feature space, something very valuable for the kernel selection process. Technically it reduces to finding low-dimensional embeddings of the data that approximately retain the relevant information.