To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Computational linguistics is the study of computer systems for understanding and generating natural language. In this volume we shall be particularly interested in the structure of such systems, and the design of algorithms for the various components of such systems.
Why should we be interested in such systems? Although the objectives of research in computational linguistics are widely varied, a primary motivation has always been the development of specific practical systems which involve natural language. Three classes of applications which have been central in the development of computational linguistics are
Machine translation. Work on machine translation began in the late 1950s with high hopes and little realization of the difficulties involved. Problems in machine translation stimulated work in both linguistics and computational linguistics, including some of the earliest parsers. Extensive work was done in the early 1960s, but a lack of success, and in particular a realization that fully-automatic high-quality translation would not be possible without fundamental work on text ‘understanding’, led to a cutback in funding. Only a few of the current projects in computational linguistics in the United States are addressed toward machine translation, although there are substantial projects in Europe and Japan (Slocum 1984, 1985; Tucker 1984).
Information retrieval. Because so much of the information we use appears in natural language form – books, journals, reports another application in which interest developed was automatic information retrieval from natural language texts. […]
Up to now, we have restricted ourselves to determining the structure and meaning of individual sentences. Although we have used limited extrasentential information (for anaphora resolution), we have not examined the structure of entire texts. Yet the information conveyed by a text is clearly more than the sum of its parts – more than the meanings of its individual sentences. If a text tells a story, describes a procedure, or offers an argument, we must understand the connections between the component sentences in order to have fully understood the story. These connections are needed both per se (to answer questions about why an event occurred, for example) and to resolve ambiguities in the meanings of individual sentences. Discourse analysis is the study of these connections. Because these connections are usually implicit in the text, identifying them may be a difficult task.
As a simple example of the problems we face, consider the following brief description of a naval encounter:
Just before dawn, the Valiant sighted the Zwiebel and fired two torpedoes. It sank swiftly, leaving few survivors.
The most evident linguistic problem we face is finding an antecedent for ‘it’. There are four candidates in the first sentence: ‘dawn’, ‘Valiant’, ‘Zwiebel’, and ‘torpedoes’. Semantic classification should enable us to exclude ‘dawn’ (*‘dawn sinks’), and number agreement will exclude ‘torpedoes’, but that still leaves us with two candidates: ‘the Valiant’ and ‘the Zwiebel’ (which are presumably both ships of some sort).
As we noted in the first chapter, language generation has generally taken second place to language analysis in computational linguistics research. This imbalance reflects a basic property of language, namely, that there are many ways of saying the same thing. In order for a natural language interface to be fluent, it should be able to accept most possible paraphrases of the information or commands the user wishes to transmit. On the other hand, it will suffice to generate one form of each message the system wishes to convey to the user.
As a result, many systems have combined sophisticated language analysis procedures with rudimentary generation components. Often generation involves nothing more than ‘filling in the blanks’ in a set of predefined message formats. This has been adequate for the simple messages many systems need to express: values retrieved from a data base, error messages, instructions to the user.
More sophisticated systems, however, have more complex messages to convey. People querying a data base in natural language often begin by asking about the structure or general content of the data base rather than asking for specific data values (Malhotra 1975); we would like to extend natural language data base interfaces so that they can answer such questions. For systems employing lengthy sequences of inferences, such as those for medical diagnosis (e.g., Shortliffe 1976), user acceptance and system improvement depend critically on the ability of the system to explain its reasoning.
Syntax analysis performs two main functions in analyzing natural language input:
Determining the structure of the input. In particular, syntax analysis should identify the subject and objects of each verb and determine what each modifying word or phrase modifies. This is most often done by assigning a tree structure to the input, in a process referred to as parsing.
Regularizing the syntactic structure. Subsequent processing (i.e., semantic analysis) can be simplified if we map the large number of possible input structures into a smaller number of structures. For example, some material in sentences (enclosed in brackets in the examples below) can be omitted or ‘zeroed’:
John ate cake and Mary [ate] cookies.
… five or more [than five] radishes …
He talks faster than John [talks].
Sentence structure can be regularized by restoring such zeroed information. Other transformations can relate sentences with normal word order (‘I crushed those grapes. That I like wine is evident.’) to passive (‘Those grapes were crushed by me.’) and cleft (‘It is evident that I like wine.’) constructions, and can relate nominal (‘the barbarians' destruction of Rome’) and verbal (‘the barbarians destroyed Rome’) constructions. Such transformations will permit subsequent processing to concern itself with a much smaller number of structures. […]
What is the objective of semantic analysis? We could say that it is to determine what a sentence means, but by itself this is not a very helpful answer. It may be more enlightening to say that, for declarative sentences, semantics seeks to determine the conditions under which a sentence is true or, almost equivalently, what the inference rules are among sentences of the language. Characterizing the semantics of questions and imperatives is a bit more problematic, but we can see the connection with declaratives by noting that, roughly speaking, questions are requests to be told whether a sentence is true (or to be told the values for which a certain sentence is true) and imperatives are requests to make a sentence true.
People who study natural language semantics find it desirable (or even necessary) to define a formal language with a simple semantics, thus changing the problem to one of determining the mapping from natural language into this formal language. What properties should this formal language have (which natural language does not)? It should
*be unambiguous
*have simple rules of interpretation and inference, and in particular
*have a logical structure determined by the form of the sentence
We shall examine some such languages, the languages of the various logics, shortly.
Of course, when we build a practical natural language system our interest is generally not just finding out if sentences are true or false.
This paper starts by tracing the architecture of document preparation systems. Two basic types of document representations appear : at the page level or at logical level. The paper then focuses on logical level representations and tries to survey three existing formalisms : SGML, Interscript and ODA.
Introduction
Document preparation systems might be now the most commonly used computer systems, ranging from stand-alone text processing individual machines to highly sophisticated systems running on mainframe computers. All of those systems internally use a more or less formal system for representing documents. Document representation formalisms are very different according to their goals. Some of them define the interface with the printing device, they are oriented towards a precise geometric description of the contents of each page in a document. Others are used internally in systems as a memory representation. Yet others have to be learned by users ; they are symbolic languages used to control document processing.
The trouble is that there are today nearly as many representation formalisms as document preparation systems. This makes it nearly impossible, first to interchange documents among heterogeneous systems, second to have standard programming interfaces for developping systems. Standardization organizations and large companies are now trying to establish standards in the field in order to stop proliferation of formalisms and facilitate document interchange.
This paper focuses in the last sections on three document representation formalisms often called ‘revisable formats’, namely SGML [SGML], ODA [ODA], and Interscript [Ayers & al.], [Joloboff & al.]. In order to better understand what is a revisable format, the paper starts with a look at the evolution of the architecture of document preparation systems.
The paper presents the design of a document preparation system that allows users to make use of existing batch formatters and yet provides an interactive user interface with what-you-see-is-almost-what-you-get feedback.
Introduction
Increasing numbers of people are using computers for the preparation of documents. Many of these new computer users are not “computer types”; they have a problem (to produce a neatly formatted document), they know the computer can help them, and they want the result with a minimum of (perceived) fuss and bother. The terms in which they present the problem to the computer should be “theirs” – easy for them to use and understand and based on previous document experience.
Many powerful document preparation tools exist that are capable of producing high quality output. However, they are often awkward (some would say difficult) to use, especially for the novice or casual user, and a substantial amount of training is usually necessary before they can be used intelligently.
This paper presents the design of a document preparation system that allows users to make use of existing formatters and yet makes document entry relatively easy. The following topics are discussed:
the requirements and overall design for such a system, and
some of the issues to be resolved in constructing the system.
First, some terminology is clarified.
Terms and Concepts
We use Shaw's model for documents [Shaw80, Puruta82, Kimura84]. A document is viewed as a hierarchy of objects, where each object is an instance of a class that defines the possible components and other attributes of its instances. Typical (low level) classes are document components such as sections, paragraphs, headings, footnotes, figures, and tables.
For many years text preparation and document manipulation have been poor relations in the computing world, and it is only recently that they have taken their rightful place in the mainstream of computer research and development. Everyone has their own favourite reason for this change: word processors, workstations with graphics screens, nonimpact printers, or authors preparing their own manuscripts.
Whatever the reason, people in computing have suddenly found themselves using the same equipment and fighting the same problems as those in printing and publishing. It would be nice to say that we are all working happily together, but there are still plenty of disputes (which is healthy) and plenty of indifference (which is not). There is no doubt, however, that this coming together of different disciplines has brought new life and enthusiasm with it.
The international conference on Text Processing and Document Manipulation at Nottingham, is not the first conference to focus on this field of computing. It follows in the footsteps of Research and Trends in Document Preparation Systems at Lausanne in 1981, the Symposium on Text Manipulation at Portland in 1981, La Manipulation de Documents at Rennes in 1983, and the recent PROTEXT conferences in Dublin. We hope, however, that it marks the beginning of a regular series of international conferences that will bring top researchers and practitioners together to exchange ideas and share their enthusiasm with a wide audience.
As the papers for this conference started to come in, a number of themes began to emerge. The dominant theme (in number of papers) was document structures for interactive editing.
Computer text processing is still in the assembly-language era, to use an analogy to program development. The low-level tools available have sufficient power, but control is lacking. The result is that documents produced with computer assistance are often of lower quality than those produced by hand: they look beautiful, but the content and organization suffer. Two promising ideas for correcting this situation are explored: (1) adapting methods of modern, high-level program development (stepwise refinement and iterative enhancement) to document preparation; (2) using a writing environment controlled by a rule-based editor, in which structure is enforced and mistakes more difficult to make.
Wonderful Appearance–Wretched Content
With the advent of relatively inexpensive laser printers, computer output is being routinely typeset. It can be expected that there will be a revolution in the way business and technical documents are created, based on the use of low-cost typesetters. Easy typesetting and graphics is an extension of word-processing capability, which is already widespread. The essential feature of word processing is its ability to quickly reproduce a changed document with mechanical perfection. However, as the appearance improves, the quality of writing seems to fall in proportion. These forces are probably at work: (1) More people can (attempt to) write using better technology, and because writing is hard, novices often produce poor work. (2) With improved technology, projects are attempted that were previously avoided; now they are done, badly. These factors are familiar from programming, and suggest an analogy between creating a document and developing a program. The current word-processing situation corresponds to the undisciplined use of programming languages that preceded so-called “modern programming practices.”
This paper describes both the use and the implementation of W, an interactive text formatter. In W, a document is interactively defined as a hierarchy of nested components. Such a hierarchy may be system- or user-defined. The hierarchy is used both by the W full-screen editor, and by the W formatting process, absolving the user from providing any layout commands as such. W manipulates text, such non-text items as mathematical formulae, and has provision for the inclusion of general graphical items.
Introduction
W is an interactive text-editor and document preparation facility being developed within the department of Computer Science at Manitoba. A working prototype of W, known as W-p, has been described elsewhere [King84]. W is a considerable development of that earlier system, but retains the same basic philosophy:
W is an interactive, extensible, integrated editor and formatter;
W adheres as closely as possible to the “what you see is what you get” (wysiwyg) philosophy;
W encompasses a wide range of document items, incuding text, tables, mathematical formulae, and provision for general graphical items;
W is portable and adaptable; that is, several versions of W are being produced to run on different architectures; although the user interface will differ in its detail, the underlying system will be common;
W is user extensible in a variety of ways.
The remainder of this paper is organised as follows. Section 2 describes W from the user's viewpoint and gives some details of its implementation. For the most part, it is a review of material which is covered in greater depth in [King84].
The paper presents a new system for automatic matching of bibliographic data corresponding to items of full textual electronic documents. The problem can otherwise be expressed as the identification of similar or duplicate items existing in different bibliographic databases. A primary objective is the design of an interactive system where matches and near misses are displayed on the user's terminal so that he can confirm or reject the identification before the associated full electronic versions are located and processed further.
Introduction
There is no doubt that ‘electronic publishing’ and other computer based tools for the production and dissemination of printed material open up new horizons for efficient communication. The problems currently faced by the designers of such systems are enormous. One problem area is the identification of duplicate material especially when there is more than one source generating similar documents. Abstracting is a good example here. Another problem area is the linkage between full text and bibliographic databases.
As part of its attempt to establish collaboration between different countries, the European Economic Community initiated the DOCDEL programme of research and a series of studies such as DOCOLSYS which investigate the present situation in Europe regarding document identification, location and ordering with particular reference to electronic ordering and delivery of documents.
The majority of DOCDEL systems likely to be developed fall under one of the following areas:
As the cost of paper and library space increases, so does the necessity for alternative forms of book storage. Computers seem the obvious answer and already much work has been done into various on-line text reading and writing systems. These systems are very effective within their own domains, yet remain essentially for computer users, rather than the ordinary man-in-the-street.
Real paper books may not actually be the best way of presenting information, but they are certainly the most familiar. It seems logical therefore to design a reading system that can be made more widely accessible because it resembles a real book as much as possible both in appearance and use – a sort of generic advance organiser [Ausubel60].
The system described here – VORTEXT – is an attempt to do precisely that.
How people read books
Books are rarely read completely linearly; mystery novels almost are, but how many people let their curiosity get the better of them and sneak a look at the last page to see who dunnit?. A text book is more likely to be dipped-into looking for a particular section, and a journal article tends to be read in full only if the reader considers it useful and relevant after having read the title, then the abstract, conclusion and finally the references [Maude85, Line82].
“The printed article is well-adapted to speedy rejection an inestimable virtue”[Line82]
An important goal of document preparation systems is that they be deviceindependent, which is to say that their output can be produced on a variety of printing devices. One way of achieving that goal is to devise a device-independent page description language, which can describe precisely the appearance of a formatted page, and to produce software that prints the required image on each variety of printer. Most attempts at device-independent page description languages have failed, resulting either in schemes that are only partially device-independent or in proclamations from researchers that device independence is a bad idea [2, 4].
A new generation of procedural page description languages promises a solution. The PostScript language, and to a lesser extent the Interpress language, offers a means of describing a printed page with an executable program; the page is printed by loading the program into the printer and running it.
Page Description Languages
An imaging device, such as a typesetter, laser printer, or display, must have some way of knowing what image it is being asked to show. The two traditional means of providing it with that information have been to describe the image to the imager in terms of a bitmap or character map or describe the image to the imager by means of a sequence of control commands to the imager's electronics.
The bit-map or character-map schemes are the simplest and oldest. For example a line printer is provided with a character map (in this spot put the character in this spot put the character and so forth).
An advanced catalogue production system is described which has three elements: creating and structuring a database; assembling or transforming data; and publication. The major points examined are the design of the system so that compilers of information can access and update from various starting points, the use of dictionaries for multiple language publications, the use of publication parameters to allow different devices. Also emphasized is the use of publishing tools which enable subject and marketing experts to maintain direct control over the publication process.
Introduction
This paper deals with a recently completed project to develop a publishing system which provides users with responsive methods for the collection of information, and flexible ways for producing different publications.
It concerns a major supplier of replacement car parts which uses a range of publications to enable dealers and individuals to identify parts to fit cars. While it is obviously a specialised application, it represents an important market sector and also demonstrates a number of issues concerning the structuring of data and passing control directly to the users.
Context
Unipart is the largest supplier in Europe of replacement car parts and accessories for all makes of cars. It has been a leader in the development of systems for the distribution of product information, and has been using computerised publication systems for a number of years. As an indication of scale, it provides 12,000 parts for 3,500 vehicles and produces 135 major publications.