To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Too much faith should not be put in the powers of induction, even when aided by intelligent heuristics, to discover the right grammar. After all, stupid people learn to talk, but even the brightest apes do not.
Noam Chomsky, 1963
It seems a miracle that young children easily learn the language of any environment into which they were born. The generative approach to grammar, pioneered by Chomsky, argues that this is only explicable if certain deep, universal features of this competence are innate characteristics of the human brain. Biologically speaking, this hypothesis of an inheritable capability to learn any language means that it must somehow be encoded in the DNA of our chromosomes. Should this hypothesis one day be verified, then linguistics would become a branch of biology.
Niels Jerne, Nobel Lecture, 1984
Context-free languages correspond to the second ‘easiest’ level of the Chomsky hierarchy. They comprise the languages generated by context-free grammars (see Chapter 4).
All regular languages are context-free but the converse is not true. Among the languages that are context-free but not regular, some ‘typical’ ones are:
• {anbn: n ≥ 0}. This is the classical text book language used to show that automata cannot count in an unrestricted way.
We cannot seriously propose that a child learns the values of 109 parameters in a childhood lasting only 108 seconds.
George A. Miller and Noam Chomsky (Miller & Chomsky, 1963).
Par exemple, il arrive qu'après les douze chiffres du milieu sortent les douze derniers chiffres; deux fois, mettons, le coup porte sur ces douze derniers chiffres et passe aux douze premiers. Une fois qu'il est tombé sur les douze premiers, il revient sur les douze du milieu; trois, quatre fois de suite, les chiffres du milieu sortent, puis ce sont de nouveau les douze derniers; après deux tours, on retombe sur les premiers, qui ne sortent qu'une fois, et les chiffres du milieu sortent trois fois de suite; cela continue ainsi pendant une heure et demie ou deux heures. Un, trois et deux; un, trois et deux. C'est très curieux.
Fedor Dostoïevski, Le joueur.
Let us suppose we are given a sample and an automaton. By automaton we mean the structure or at least some constraints on the number of states and some restrictive syntactical conditions on the transitions we are allowed to use. We are interested in finding a systematic way of converting the automaton into a probabilistic generator such as those we studied in Chapter 5. It would also be interesting to be able to do something similar for grammars instead of automata.
No quería componer otro Quijote -lo cual es fácil- sino el Quijote. Inútil agregar que no encaró nunca una transcriptión mecánica del original; no se proponía copiarlo. Su admirable ambición era producir unas páginas que coincidieran palabra por palabra y línea por línea con las de Miguel de Cervantes. Mi empresa no es difícil, esencialmente leo en otro lugar de la carta. Me bastaría ser inmortal para llevarla a cabo.
Apart from the fascinating (and phoney) linguistic challenge (could a computer, like the young Tarzan of the Apes, learn a language by simply reading books written in it?), it has an interesting position in syntactic pattern recognition.
Laurent Miclet (on grammatical inference) on (Miclet, 1990)
Learning from text consists of inferring from a presentation of examples that all come from the target language. The learner is asked to somehow generalise from the data it sees while not having counter-examples that would help it refrain from over-generalising.
Identification in the limit from text
Learning from text is considered by many to be the essence of language learning. It is in a sense the initial problem, the one with least constraints, and the one that, once we show it cannot be solved, allows us to consider making the problem easier by adding some helpful information like negative examples, knowledge about the structure or the possibility to interrogate an Oracle.
No one can tell you how to do it. The technique must be learned the way I did it, by failures.
John Steinbeck, Travels with Charlie
Similarly, a responsive informant could answer questions involving non-terminals, or instead of responding ‘No’ could give the closest valid string.
Jim Horning (Horning, 1969)
There are several situations where the learning algorithm can actively interact with its environment. Instead of using given data, the algorithm may be able to perform tests, create new strings, and find out how far he may be from the solution. The mathematical setting to do this is called active learning, where queries are made to an Oracle.
In this chapter we cover positive and negative aspects of this important paradigm in grammatical inference, but also in machine learning, with again a special focus on the case of learning deterministic finite automata.
About learning with queries
In Section 7.5 we introduced the model of learning from queries (or active learning) in order to produce negative results (which could then also apply to situations where we have less control over the examples) and also to find new inference algorithms in a more helpful but credible learning setting.
Why learn with queries?
Active learning is a paradigm first introduced with theoretical motivations but that for a number of reasons can today be considered also as a pragmatic approach.
Errors using inadequate data are much less than those using no data at all.
Charles Babbage
It is a capital mistake to theorise before one has data.
Sir Arthur Conan Doyle, Scandal in Bohemia
Strings are a very natural way to encode information: they appear directly with linguistic data (they will then be words or sentences), or with biological data. Computer scientists have for a long time organised information into tree-like data structures. It is reasonable, therefore, that trees arise in a context where the data has been preprocessed. Typical examples are the parse trees of a program or the parse trees of natural language sentences.
Graphs will appear in settings where the information is more complex: images will be encoded into graphs, and first-order logical formulae also require graphs when one wants to associate a semantic.
Grammatical inference is a task where the goal is to learn or infer a grammar (or some device that can generate, recognise or describe strings) for a language and from all sorts of information about this language.
Grammatical inference consists of finding the grammar or automaton for a language of which we are given an indirect presentation through strings, sequences, trees, terms or graphs.
As what characterises grammatical inference is at least as much the data from which we are asked to learn, as the sort of result, we turn to presenting some possible examples of data.
On two occasions I have been asked [by members of Parliament], ‘Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’ I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
Charles Babbage
“Was it all inevitable, John?” Reeve was pushing his fingers across the floor of the cell, seated on his haunches. I was lying on the mattress. “Yes,” I said. “I think it was. Certainly, it's written that way. The end of the book is there before the beginning's hardly started.”
Ian Rankin, Knots and Crosses
When ending this manuscript, the author decided that a certain number of things had been left implicit in the text, and could perhaps be written out clearly in some place where this would not affect the mathematical reading of the rest.
Let us discuss these points briefly here.
About convergence
Let us suppose, for the sake of argument, that the task we were developing algorithms for was the construction of random number generators. Suppose now that we had constructed such a generator, that given a seed s would return an endless series of numbers. Some of the questions we may be faced with might be:
It is no coincidence that in no known language does the phrase ‘As pretty as an airport’ appear.
Douglas Adams
Learning languages requires, for the process to be of any practical value, agreement on a representation of these languages. We turn to formal language theory to provide us with such meaningful representations, and adapt these classical definitions to the particular task of grammatical inference only when needed.
Automata and finite state machines
Automata are finite state machines used to recognise strings. They correspond to a simplified and limited version of Turing machines: a string is written on the input tape, the string is then read from left to right and, at each step, the next state of the system is chosen depending only on the previous state and the letter or symbol being read. The fact that this is the only information that can be used to parse the string makes the system powerful enough to accept just a limited class of languages called regular languages. The recognition procedure can be made deterministic by allowing only one action to be possible at each step (therefore for each state and each symbol). It is usually nicer and easier to manipulate these deterministic machines (called deterministic finite automata) because parsing is then performed in a much more convenient and economic way, and also because a number of theoretical results only apply to these.
The biscuit tree. This remarkable vegetable production has never yet been described or delineated.
Edward Lear, Flora Nonsensica
Man muss immer generalisieren.
Carl Jacobi
Formal language theory has been developed and studied consistently over the past 50 years. Because of their importance in so many fields, strings and tools to manipulate them have been studied with special care, leading to the specific topic of stringology. Usual definitions and results can therefore be found in several text books.
We re-visit these objects here from a pragmatic point of view: grammatical inference is about learning grammars and automata which are then supposed to be used by programs to deal with strings. Their advantage is that we can parse with them, compare them, compute distances… Therefore, we are primarily interested in studying how the strings are organised: knowing that a string is in a language (or perhaps more importantly out of the language) is not enough. We will also want to know how it belongs or why it doesn't belong. Other questions might be about finding close strings or building a kernel taking into account their properties. The goal is therefore going to be to organise the strings, to put some topology over them.
Notations
We start by introducing here some general notations used throughout the book.
In a definition ifdef is a definition ‘if’.
The main mathematical objects used in this book are letters and strings.
Among the more interesting remaining theoretical questions are: inference in the presence of noise, general strategies for interactive presentation and the inference of systems with semantics.
Jerome Feldman (Feldman, 1972)
La simplicité n'a pas besoin d'être simple, mais du complexe resserré et synthétisé.
Alfred Jarry
We describe algorithm LSTAR, introduced by Dana Angluin, which has inspired several variants and adaptations to other classes of languages.
The minimally adequate teacher
A minimally adequate teacher (MAT) is an Oracle that can give answers to membership queries and strong equivalence queries. We analysed in Section 9.2 the case where you want to learn with less.
The main algorithm that works in this setting is called LSTAR. The general idea of LSTAR is:
• find a consistent observation table (representing a DFA),
• submit it as an equivalence query,
• use the counter-example to update the table,
• submit membership queries to make the table closed and complete,
• iterate until the Oracle, upon an equivalence query, tells us that the correct language has been reached.
The observation table we use is analogous to that described in Section 12.3, so we will use the same formalism here.
An observation table
An observation table is a specific tabular representation of an automaton. An example is given in Table 13.1(a).
El azar tiene muy mala leche y muchas ganas de broma.
Arturo Perez Reverte
All knowledge degenerates into probability.
David Hume, A Treatise on Human Nature, 1740
If we suppose that the data have been obtained through sampling, that means we have (or at least believe in) an underlying probability over the strings. In most cases we do not have a description of this distribution, and we describe three plausible learning settings.
The first possibility is that the data are sampled according to an unknown distribution, and that whatever we learn from, the data will be measured with respect to this distribution. This corresponds to the well-known PAC-learning setting (probably approximately correct).
The second possibility is that the data are sampled according to a distribution itself defined by a grammar or an automaton. The goal will now no longer be to classify strings but to learn this distribution. The quality of the learning process can then be measured either while accepting a small error (most of the time, since a particular sampling can have been completely corrupted!), or in the limit, with probability one. One can even hope for a combination of both these criteria.
There are other possible related settings that we only mention briefly here: an important one concerns the case where the distribution in the PAC model is computable, without being generated by a grammar or an automaton. The problem remains a classification question for which we have only restricted the class of admissible distributions.
If your experiment needs statistics, you ought to have done a better experiment.
Ernest Rutherford
‘I think you're begging the question,’ said Haydock, ‘and I can see looming ahead one of those terrible exercises in probability where six men have white hats and six men have black hats and you have to work it out by mathematics how likely it is that the hats will get mixed up and in what proportion. If you start thinking about things like that, you would go round the bend. Let me assure you of that!’
Instead of defining a language as a set of strings, there are good reasons to consider the seemingly more complex idea of defining a distribution over strings. The distribution can be regular, in which case the strings are then generated by a probabilistic regular grammar or a probabilistic finite automaton. We are also interested in the special case where the automaton is deterministic.
Once distributions are defined, distances between the distributions and the syntactic objects they represent can be defined and in some cases they can be conveniently computed.
Distributions over strings
Given a finite alphabet Σ, the set Σ* of all strings over Σ is enumerable, and therefore a distribution can be defined.
Understanding is compression, comprehension is compression!
Greg Chaitin (Chaitin, 2007)
Comprendo. Habla de un juego donde las reglas no sean la línea de salida, sino el punto de llegada ¿No?
Arturo Pérez-Reverte, el pintor de batallas
‘Learning from an informant’ is the setting in which the data consists of labelled strings, each label indicating whether or not the string belongs to the target language.
Of all the issues which grammatical inference scientists have worked on, this is probably the one on which most energy has been spent over the years. Algorithms have been proposed, competitions have been launched, theoretical results have been given. On one hand, the problem has been proved to be on a par with mighty theoretical computer science questions arising from combinatorics, number theory and cryptography, and on the other hand cunning heuristics and techniques employing ideas from artificial intelligence and language theory have been devised.
There would be a point in presenting this theme with a special focus on the class of context-free grammars with a hope that the theory for the particular class of the finite automata would follow, but the history and the techniques tell us otherwise. The main focus is therefore going to be on the simpler yet sufficiently rich question of learning deterministic finite automata from positive and negative examples.
Young men should prove theorems, old men should write books.
Godfrey H. Hardy
There is nothing to writing. All you do is sit down at a typewriter and bleed.
Ernest Hemingway
A zillion grammatical inference years ago, some researchers in grammatical inference thought of writing a book about their favourite topic. If there was no agreement about the notations, the important algorithms, the central theorems or the fact that the chapter about learning from text had to come before or after the one dealing with learning from an informant, there were no protests when the intended title was proposed: the art of inferring grammars. The choice of the word art is meaningful: like in other areas of machine learning, what counted were the ideas, the fact that one was able to do something complicated like actually building an automaton from strings, and that it somehow fitted the intuition that biology and images (some typical examples) could be explained through language. This ‘artistic’ book was never written, and since then the field has matured.
When writing this book, I hoped to contribute to the idea that the field of grammatical inference has now established itself as a scientific area of research. But I also felt I would be happy if the reader could grasp those appealing artistic aspects of the field.
The artistic essence of grammatical inference was not the only problem needing to be tackled; other questions also required answers…
Die Mathematiker sind eine Art Franzosen: Redet man zu ihnen, so übersetzen sie es in ihre Sprache, und dann ist es alsbald etwas anderes.
Johann Wolfgang von Goethe, Maximen und Reflexionen
Bilanguages
There are many cases where the function one wants to learn doesn't just associate a label or a probability with a given string, but should be able to return another string, perhaps even written using another alphabet. This is the case in translation, of course, between two ‘natural’ languages, but also of situations where the syntax of a text is used to extract some semantics. And it can be the situation in many other tasks where machine or human languages intervene.
There are a number of books and articles dealing with machine translation, but we will only deal here with a very simplified setting consistent with the types of finite state machines used in the previous chapters; more complex translation models based on context-free or lexicalised grammars are beyond the scope of this book.
The goal is therefore to infer special finite transducers, those representing subsequential functions.
Rational transducers
Even if in natural language translation tasks the alphabet is often the same for the two languages, this needs not be so. For the sake of generality, we will therefore manipulate two alphabets, typically denoted by Σ for the input alphabet and Γ for the output one.