To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Faced with the necessity of saying, in a finite space and in an extremely finite time, what I believe the thesaurus theory of language to be, I have decided on the following procedure.
First, I give, in logical and mathematical terms, what I believe to be the abstract outlines of the theory. This account may sound abstract, but it is being currently put to practical use. That is to say, with its help an actual thesaurus to be used for medium-scale mechanical translation (MT) tests, and consisting of specifications in terms of archeheads, heads and syntax markers, made upon words, is being constructed straight on to punched cards. The cards are multiply punched; a nuisance, but they have to be, since the thesaurus in question has 800 heads. There is also an engineering bottleneck about interpreting them; at present, if we wish to reproduce the pack, every reproduced card has to be written on by hand, which makes the reproduction an arduous business; a business also that will become more and more arduous as the pack grows larger. If this interpreting difficulty can be overcome, however, we hope to be able to offer to reproduce this punched-card thesaurus mechanically, as we finish it, for any other MT group that is interested, so that, at last, repeatable, thesauric translations (or mistranslations) can be obtained.
This chapter examines a first-stage translation from Latin into English with the aid of Roget's Thesaurus of a passage from Virgil's Georgics.
The essential feature of this program is the use of a thesaurus as an interlingua: the translation operations are carried out on a head language into which the input text is transformed and from which an output is obtained. The notion of ‘heads’ is taken from the concepts or topics under which Roget classified words in his thesaurus. These operations are of three kinds: semantic, syntactic and grammatical.
The general arrangement of the program is as follows:
Dictionary matching: the chunks of the input language are matched with the entries in a Latin interlingual dictionary giving the raw material of the head language; this consists of heads representing the semantic, syntactic and grammatical elements of the input.
Operations on the semantic heads: these give a first-stage translation.
Operations on the syntactic heads: giving a syntactically complete, though unparsed, translation.
Operations on the grammatical heads: giving a parsed and correctly ordered output.
Cleaning up operations: the output is ‘trimmed’ by, e.g., insertion of capital letters, removal of repetitions like ‘farmer-er’.
Only Stage 2 of the procedure is given in detail here.
Information obtained from stage 1
The Latin sentence to be translated was chunked as follows:
AGRI-COL-A IN-CURV-O TERR-AM DI-MOV-IT AR-ATRO
A number of these generated syntactic heads only. Those with semantic head entries are AGRI-COL-IN-CURV-TERR-DI-MOV-AR-.
Margaret Masterman was ahead of her time by some twenty years: many of her beliefs and proposals for language processing by computer have now become part of the common stock of ideas in the artificial intelligence (AI) and machine translation (MT) fields. She was never able to lay adequate claim to them because they were unacceptable when she published them, and so when they were written up later by her students or independently ‘discovered’ by others, there was no trace back to her, especially in these fields where little or nothing over ten years old is ever reread. Part of the problem, though, lay in herself: she wrote too well, which is always suspicious in technological areas. Again, she was a pupil of Wittgenstein, and a proper, if eccentric, part of the whole Cambridge analytical movement in philosophy, which meant that it was always easier and more elegant to dissect someone else's ideas than to set out one's own in a clear way. She therefore found her own critical articles being reprinted (e.g. chapter 11, below) but not the work she really cared about: her theories of language structure and processing.
The core of her beliefs about language processing was that it must reflect the coherence of language, its redundancy as a signal.
The purpose of the paper that I want here to present is to make a suggestion for computing semantic paragraph patterns.
I had thought that just putting forward this suggestion would involve putting forward a way of looking at language so different from that of everyone else present, either from the logical side or the linguistic side, that I would get bogged down in peripheral controversy to the extent of never getting to the point. I was going to start by saying, ‘Put on my tomb: “This is what she was trying for”.’ But it is not so.
I don't know what has happened, but I don't disagree with Yehoshua Bar-Hillel as much as I did.
And on the linguistic side I owe this whole colloquium an apology and put forward the excuse that I was ill. I ought to have mastered the work of Weinreich (1971). I am trying to. But it is not just that simple a matter to master a complex work in a discipline quite different from that which one ordinarily follows.
I may misinterpret, but it seems to me that the kind of suggestion I put forward in this paper could be construed as a crude way of doing the kind of thing Weinreich has asked for. But Yehoshua Bar-Hillel is actually very right when he wants to question all the time what real use the computer can be in this field. So don't be misled by the size of the output in this paper.
The logical effect that adopting the logical unit of the MT chunk, instead of the free word, has on the problem of compiling a dictionary.
Dictionary trees: an example of the tree of uses of the Italian chunk PIANT-.
Outline of a mechanical translation programme using a thesaurus.
Examples of trials made with a model procedure for testing this: translations of ESSENZ-E, GERWOGL-I and SI PRESENT-A from the Cambridge Languages Unit's current pilot project. The simplifications that the use of a thesaurus makes in the research needed to achieve idiomatic machine translation.
Some preliminary remarks on the problem of coding a thesaurus.
1. In MT literature: it is usually assumed that compiling an MT dictionary is, for the linguist, a matter of routine; that the main problem lies in providing sufficient computer storage to accommodate it. Such judgements fail to take account either of the unpredictability of language (Reifler, 1954) or of the profound change in the conception of a dictionary produced by the substitution of the MT chunk for the free word.
By chunk is meant here the smallest significant language unit that (1) can exist in more than one context and (2) that, for practical purposes, it pays to insert as an entry by itself in an MT dictionary. Extensive linguistic data are often required to decide when it is, and when it is not, worthwhile to enter a language unit by itself as a separate chunk.
Let us take a word, its sign, and its set of uses.
Let us, in the simplest case, relate these to reality; about which we shall say no more than, since it develops in time, the uses of the word must also develop in time.
Let us denote the word by a point; its sign by an ideograph (Masterman 1954), and its set of uses – all linking on to reality at unknown but different points, but all radiating out from the original point denoting the word, because they are all the set of uses of that same word, by a set of spokes radiating from that point.
Let us call the logical unit so constructed, a FAN, and the figure will give the essential idea of such a fan (see Figure 2).
From Figure 2 several facts can be made clear. The first is that, however many uses the word may have (however many spokes the fan may have), they will always be marked with the same sign. But it does not follow from this that all the uses of the word mean the same thing; that they all have the same meaning in use.
This book is a posthumous tribute to Margaret Masterman and the influence of her ideas and life on the development of the processing of language by computers, a part of what would now be called artificial intelligence. During her lifetime she did not publish a book, and this volume is intended to remedy that by reprinting some of her most influential papers, many of which never went beyond research memoranda from the Cambridge Language Research Unit (CLRU), which she founded and which became a major centre in that field. However, the style in which she wrote, and the originality of the structures she presented as the basis of language processing by machine, now require some commentary and explanation in places if they are to be accessible today, most particularly by relating them to more recent and more widely publicised work where closely related concepts occur.
In this volume, eleven of Margaret Masterman's papers are grouped by topic, and in a general order reflecting their intellectual development. Three are accompanied by a commentary by the editor where this was thought helpful plus a fourth with a commentary by Karen Spärck Jones, which she wrote when reissuing that particular paper and which is used by permission. The themes of the papers recur, and some of the commentaries touch on the content of a number of the papers.
There are two reasons why I am writing this preface to a presentation of Peter Guberina's hypothesis that there exists a single formula for semantic progression at the basis of all human communication. I think, firstly, that this hypothesis, whether universally true in its present form or not, represents a new generative idea of the first magnitude in the basic research of the mechanical translation field.
It is insufficiently appreciated by workers in other fields how many fundamental new basic hypotheses of the nature and characteristics of human communication MT research has already thrown up. There is the CLRU (and others') idea that a semantic system of thesaurus type can be mathematically represented by a lattice, algorithms done on it, and a mathematics of semantic classification built up from it; there is Yngve's hypothesis of the ‘limit in depth’, which must occur in the grouping on linguistic units within sentences; there is A. F. Parker-Rhodes' (and my) idea of the applicability of the mathematical notion of lattice centrality to the notion of exocentric syntactic form; there is Ida Rhodes' idea that quite simple conditional probability chains can be used in doing syntactic analysis (because that is what her idea really is); there is Chomsky's idea that a full language can be mechanically constructed by deriving it mathematically from a small set of kernels; and now there is this Guberina hypothesis that there must exist one and only one basic form of semantic progression (which is both quite simple and also formalisable at the basis of all human communication). There is current widespread detraction of the MT field because of the false claims that have been made on behalf of the present state of the art in political quarters and in the popular press.
This paper describes the Linguistic Annotation Framework under development within ISO TC37 SC4 WG1. The Linguistic Annotation Framework is intended to serve as a basis for harmonizing existing language resources as well as developing new ones.
In this paper we present recent work on GATE, a widely-used framework and graphical development environment for creating and deploying Language Engineering components and resources in a robust fashion. The GATE architecture has facilitated the development of a number of successful applications for various language processing tasks (such as Information Extraction, dialogue and summarisation), the building and annotation of corpora and the quantitative evaluations of LE applications. The focus of this paper is on recent developments in response to new challenges in Language Engineering: Semantic Web, integration with Information Retrieval and data mining, and the need for machine learning support.
Every building, and every computer program, has an architecture: structural and organisational principles that underpin its design and construction. The garden shed once built by one of the authors had an ad hoc architecture, extracted (somewhat painfully) from the imagination during a slow and non-deterministic process that, luckily, resulted in a structure which keeps the rain on the outside and the mower on the inside (at least for the time being). As well as being ad hoc (i.e. not informed by analysis of similar practice or relevant science or engineering) this architecture is implicit: no explicit design was made, and no records or documentation kept of the construction process.
We present the RAGS (Reference Architecture for Generation Systems) framework, a specification of an abstract Natural Language Generation (NLG) system architecture to support sharing, re-use, comparison and evaluation of NLG technologies. We argue that the evidence from a survey of actual NLG systems calls for a different emphasis in a reference proposal from that seen in similar initiatives in information extraction and multimedia interfaces. We introduce the framework itself, in particular the two-level data model that allows us to support the complex data requirements of NLG systems in a flexible and coherent fashion, and describe our efforts to validate the framework through a range of implementations.
IBM Research has over 200 people working on Unstructured Information Management (UIM) technologies with a strong focus on Natural Language Processing (NLP). These researchers are engaged in activities ranging from natural language dialog, information retrieval, topic-tracking, named-entity detection, document classification and machine translation to bioinformatics and open-domain question answering. An analysis of these activities strongly suggested that improving the organization's ability to quickly discover each other's results and rapidly combine different technologies and approaches would accelerate scientific advance. Furthermore, the ability to reuse and combine results through a common architecture and a robust software framework would accelerate the transfer of research results in NLP into IBM's product platforms. Market analyses indicating a growing need to process unstructured information, specifically multilingual, natural language text, coupled with IBM Research's investment in NLP, led to the development of middleware architecture for processing unstructured information dubbed UIMA. At the heart of UIMA are powerful search capabilities and a data-driven framework for the development, composition and distributed deployment of analysis engines. In this paper we give a general introduction to UIMA focusing on the design points of its analysis engine architecture and we discuss how UIMA is helping to accelerate research and technology transfer.