To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Up to now, we have restricted ourselves to determining the structure and meaning of individual sentences. Although we have used limited extrasentential information (for anaphora resolution), we have not examined the structure of entire texts. Yet the information conveyed by a text is clearly more than the sum of its parts – more than the meanings of its individual sentences. If a text tells a story, describes a procedure, or offers an argument, we must understand the connections between the component sentences in order to have fully understood the story. These connections are needed both per se (to answer questions about why an event occurred, for example) and to resolve ambiguities in the meanings of individual sentences. Discourse analysis is the study of these connections. Because these connections are usually implicit in the text, identifying them may be a difficult task.
As a simple example of the problems we face, consider the following brief description of a naval encounter:
Just before dawn, the Valiant sighted the Zwiebel and fired two torpedoes. It sank swiftly, leaving few survivors.
The most evident linguistic problem we face is finding an antecedent for ‘it’. There are four candidates in the first sentence: ‘dawn’, ‘Valiant’, ‘Zwiebel’, and ‘torpedoes’. Semantic classification should enable us to exclude ‘dawn’ (*‘dawn sinks’), and number agreement will exclude ‘torpedoes’, but that still leaves us with two candidates: ‘the Valiant’ and ‘the Zwiebel’ (which are presumably both ships of some sort).
As we noted in the first chapter, language generation has generally taken second place to language analysis in computational linguistics research. This imbalance reflects a basic property of language, namely, that there are many ways of saying the same thing. In order for a natural language interface to be fluent, it should be able to accept most possible paraphrases of the information or commands the user wishes to transmit. On the other hand, it will suffice to generate one form of each message the system wishes to convey to the user.
As a result, many systems have combined sophisticated language analysis procedures with rudimentary generation components. Often generation involves nothing more than ‘filling in the blanks’ in a set of predefined message formats. This has been adequate for the simple messages many systems need to express: values retrieved from a data base, error messages, instructions to the user.
More sophisticated systems, however, have more complex messages to convey. People querying a data base in natural language often begin by asking about the structure or general content of the data base rather than asking for specific data values (Malhotra 1975); we would like to extend natural language data base interfaces so that they can answer such questions. For systems employing lengthy sequences of inferences, such as those for medical diagnosis (e.g., Shortliffe 1976), user acceptance and system improvement depend critically on the ability of the system to explain its reasoning.
Syntax analysis performs two main functions in analyzing natural language input:
Determining the structure of the input. In particular, syntax analysis should identify the subject and objects of each verb and determine what each modifying word or phrase modifies. This is most often done by assigning a tree structure to the input, in a process referred to as parsing.
Regularizing the syntactic structure. Subsequent processing (i.e., semantic analysis) can be simplified if we map the large number of possible input structures into a smaller number of structures. For example, some material in sentences (enclosed in brackets in the examples below) can be omitted or ‘zeroed’:
John ate cake and Mary [ate] cookies.
… five or more [than five] radishes …
He talks faster than John [talks].
Sentence structure can be regularized by restoring such zeroed information. Other transformations can relate sentences with normal word order (‘I crushed those grapes. That I like wine is evident.’) to passive (‘Those grapes were crushed by me.’) and cleft (‘It is evident that I like wine.’) constructions, and can relate nominal (‘the barbarians' destruction of Rome’) and verbal (‘the barbarians destroyed Rome’) constructions. Such transformations will permit subsequent processing to concern itself with a much smaller number of structures. […]
What is the objective of semantic analysis? We could say that it is to determine what a sentence means, but by itself this is not a very helpful answer. It may be more enlightening to say that, for declarative sentences, semantics seeks to determine the conditions under which a sentence is true or, almost equivalently, what the inference rules are among sentences of the language. Characterizing the semantics of questions and imperatives is a bit more problematic, but we can see the connection with declaratives by noting that, roughly speaking, questions are requests to be told whether a sentence is true (or to be told the values for which a certain sentence is true) and imperatives are requests to make a sentence true.
People who study natural language semantics find it desirable (or even necessary) to define a formal language with a simple semantics, thus changing the problem to one of determining the mapping from natural language into this formal language. What properties should this formal language have (which natural language does not)? It should
*be unambiguous
*have simple rules of interpretation and inference, and in particular
*have a logical structure determined by the form of the sentence
We shall examine some such languages, the languages of the various logics, shortly.
Of course, when we build a practical natural language system our interest is generally not just finding out if sentences are true or false.
This paper starts by tracing the architecture of document preparation systems. Two basic types of document representations appear : at the page level or at logical level. The paper then focuses on logical level representations and tries to survey three existing formalisms : SGML, Interscript and ODA.
Introduction
Document preparation systems might be now the most commonly used computer systems, ranging from stand-alone text processing individual machines to highly sophisticated systems running on mainframe computers. All of those systems internally use a more or less formal system for representing documents. Document representation formalisms are very different according to their goals. Some of them define the interface with the printing device, they are oriented towards a precise geometric description of the contents of each page in a document. Others are used internally in systems as a memory representation. Yet others have to be learned by users ; they are symbolic languages used to control document processing.
The trouble is that there are today nearly as many representation formalisms as document preparation systems. This makes it nearly impossible, first to interchange documents among heterogeneous systems, second to have standard programming interfaces for developping systems. Standardization organizations and large companies are now trying to establish standards in the field in order to stop proliferation of formalisms and facilitate document interchange.
This paper focuses in the last sections on three document representation formalisms often called ‘revisable formats’, namely SGML [SGML], ODA [ODA], and Interscript [Ayers & al.], [Joloboff & al.]. In order to better understand what is a revisable format, the paper starts with a look at the evolution of the architecture of document preparation systems.
The paper presents the design of a document preparation system that allows users to make use of existing batch formatters and yet provides an interactive user interface with what-you-see-is-almost-what-you-get feedback.
Introduction
Increasing numbers of people are using computers for the preparation of documents. Many of these new computer users are not “computer types”; they have a problem (to produce a neatly formatted document), they know the computer can help them, and they want the result with a minimum of (perceived) fuss and bother. The terms in which they present the problem to the computer should be “theirs” – easy for them to use and understand and based on previous document experience.
Many powerful document preparation tools exist that are capable of producing high quality output. However, they are often awkward (some would say difficult) to use, especially for the novice or casual user, and a substantial amount of training is usually necessary before they can be used intelligently.
This paper presents the design of a document preparation system that allows users to make use of existing formatters and yet makes document entry relatively easy. The following topics are discussed:
the requirements and overall design for such a system, and
some of the issues to be resolved in constructing the system.
First, some terminology is clarified.
Terms and Concepts
We use Shaw's model for documents [Shaw80, Puruta82, Kimura84]. A document is viewed as a hierarchy of objects, where each object is an instance of a class that defines the possible components and other attributes of its instances. Typical (low level) classes are document components such as sections, paragraphs, headings, footnotes, figures, and tables.
For many years text preparation and document manipulation have been poor relations in the computing world, and it is only recently that they have taken their rightful place in the mainstream of computer research and development. Everyone has their own favourite reason for this change: word processors, workstations with graphics screens, nonimpact printers, or authors preparing their own manuscripts.
Whatever the reason, people in computing have suddenly found themselves using the same equipment and fighting the same problems as those in printing and publishing. It would be nice to say that we are all working happily together, but there are still plenty of disputes (which is healthy) and plenty of indifference (which is not). There is no doubt, however, that this coming together of different disciplines has brought new life and enthusiasm with it.
The international conference on Text Processing and Document Manipulation at Nottingham, is not the first conference to focus on this field of computing. It follows in the footsteps of Research and Trends in Document Preparation Systems at Lausanne in 1981, the Symposium on Text Manipulation at Portland in 1981, La Manipulation de Documents at Rennes in 1983, and the recent PROTEXT conferences in Dublin. We hope, however, that it marks the beginning of a regular series of international conferences that will bring top researchers and practitioners together to exchange ideas and share their enthusiasm with a wide audience.
As the papers for this conference started to come in, a number of themes began to emerge. The dominant theme (in number of papers) was document structures for interactive editing.
Computer text processing is still in the assembly-language era, to use an analogy to program development. The low-level tools available have sufficient power, but control is lacking. The result is that documents produced with computer assistance are often of lower quality than those produced by hand: they look beautiful, but the content and organization suffer. Two promising ideas for correcting this situation are explored: (1) adapting methods of modern, high-level program development (stepwise refinement and iterative enhancement) to document preparation; (2) using a writing environment controlled by a rule-based editor, in which structure is enforced and mistakes more difficult to make.
Wonderful Appearance–Wretched Content
With the advent of relatively inexpensive laser printers, computer output is being routinely typeset. It can be expected that there will be a revolution in the way business and technical documents are created, based on the use of low-cost typesetters. Easy typesetting and graphics is an extension of word-processing capability, which is already widespread. The essential feature of word processing is its ability to quickly reproduce a changed document with mechanical perfection. However, as the appearance improves, the quality of writing seems to fall in proportion. These forces are probably at work: (1) More people can (attempt to) write using better technology, and because writing is hard, novices often produce poor work. (2) With improved technology, projects are attempted that were previously avoided; now they are done, badly. These factors are familiar from programming, and suggest an analogy between creating a document and developing a program. The current word-processing situation corresponds to the undisciplined use of programming languages that preceded so-called “modern programming practices.”
This paper describes both the use and the implementation of W, an interactive text formatter. In W, a document is interactively defined as a hierarchy of nested components. Such a hierarchy may be system- or user-defined. The hierarchy is used both by the W full-screen editor, and by the W formatting process, absolving the user from providing any layout commands as such. W manipulates text, such non-text items as mathematical formulae, and has provision for the inclusion of general graphical items.
Introduction
W is an interactive text-editor and document preparation facility being developed within the department of Computer Science at Manitoba. A working prototype of W, known as W-p, has been described elsewhere [King84]. W is a considerable development of that earlier system, but retains the same basic philosophy:
W is an interactive, extensible, integrated editor and formatter;
W adheres as closely as possible to the “what you see is what you get” (wysiwyg) philosophy;
W encompasses a wide range of document items, incuding text, tables, mathematical formulae, and provision for general graphical items;
W is portable and adaptable; that is, several versions of W are being produced to run on different architectures; although the user interface will differ in its detail, the underlying system will be common;
W is user extensible in a variety of ways.
The remainder of this paper is organised as follows. Section 2 describes W from the user's viewpoint and gives some details of its implementation. For the most part, it is a review of material which is covered in greater depth in [King84].
The paper presents a new system for automatic matching of bibliographic data corresponding to items of full textual electronic documents. The problem can otherwise be expressed as the identification of similar or duplicate items existing in different bibliographic databases. A primary objective is the design of an interactive system where matches and near misses are displayed on the user's terminal so that he can confirm or reject the identification before the associated full electronic versions are located and processed further.
Introduction
There is no doubt that ‘electronic publishing’ and other computer based tools for the production and dissemination of printed material open up new horizons for efficient communication. The problems currently faced by the designers of such systems are enormous. One problem area is the identification of duplicate material especially when there is more than one source generating similar documents. Abstracting is a good example here. Another problem area is the linkage between full text and bibliographic databases.
As part of its attempt to establish collaboration between different countries, the European Economic Community initiated the DOCDEL programme of research and a series of studies such as DOCOLSYS which investigate the present situation in Europe regarding document identification, location and ordering with particular reference to electronic ordering and delivery of documents.
The majority of DOCDEL systems likely to be developed fall under one of the following areas:
As the cost of paper and library space increases, so does the necessity for alternative forms of book storage. Computers seem the obvious answer and already much work has been done into various on-line text reading and writing systems. These systems are very effective within their own domains, yet remain essentially for computer users, rather than the ordinary man-in-the-street.
Real paper books may not actually be the best way of presenting information, but they are certainly the most familiar. It seems logical therefore to design a reading system that can be made more widely accessible because it resembles a real book as much as possible both in appearance and use – a sort of generic advance organiser [Ausubel60].
The system described here – VORTEXT – is an attempt to do precisely that.
How people read books
Books are rarely read completely linearly; mystery novels almost are, but how many people let their curiosity get the better of them and sneak a look at the last page to see who dunnit?. A text book is more likely to be dipped-into looking for a particular section, and a journal article tends to be read in full only if the reader considers it useful and relevant after having read the title, then the abstract, conclusion and finally the references [Maude85, Line82].
“The printed article is well-adapted to speedy rejection an inestimable virtue”[Line82]
An important goal of document preparation systems is that they be deviceindependent, which is to say that their output can be produced on a variety of printing devices. One way of achieving that goal is to devise a device-independent page description language, which can describe precisely the appearance of a formatted page, and to produce software that prints the required image on each variety of printer. Most attempts at device-independent page description languages have failed, resulting either in schemes that are only partially device-independent or in proclamations from researchers that device independence is a bad idea [2, 4].
A new generation of procedural page description languages promises a solution. The PostScript language, and to a lesser extent the Interpress language, offers a means of describing a printed page with an executable program; the page is printed by loading the program into the printer and running it.
Page Description Languages
An imaging device, such as a typesetter, laser printer, or display, must have some way of knowing what image it is being asked to show. The two traditional means of providing it with that information have been to describe the image to the imager in terms of a bitmap or character map or describe the image to the imager by means of a sequence of control commands to the imager's electronics.
The bit-map or character-map schemes are the simplest and oldest. For example a line printer is provided with a character map (in this spot put the character in this spot put the character and so forth).
An advanced catalogue production system is described which has three elements: creating and structuring a database; assembling or transforming data; and publication. The major points examined are the design of the system so that compilers of information can access and update from various starting points, the use of dictionaries for multiple language publications, the use of publication parameters to allow different devices. Also emphasized is the use of publishing tools which enable subject and marketing experts to maintain direct control over the publication process.
Introduction
This paper deals with a recently completed project to develop a publishing system which provides users with responsive methods for the collection of information, and flexible ways for producing different publications.
It concerns a major supplier of replacement car parts which uses a range of publications to enable dealers and individuals to identify parts to fit cars. While it is obviously a specialised application, it represents an important market sector and also demonstrates a number of issues concerning the structuring of data and passing control directly to the users.
Context
Unipart is the largest supplier in Europe of replacement car parts and accessories for all makes of cars. It has been a leader in the development of systems for the distribution of product information, and has been using computerised publication systems for a number of years. As an indication of scale, it provides 12,000 parts for 3,500 vehicles and produces 135 major publications.
This paper describes the BIOSTATION, a generalized document preparation system, developed to guide an interactive editing of biological sequences by taking into account their semantics. This paper also focusses on the use of a document preparation system as the mediator for a larger application.
Introduction
The BIOSTATION is a generalized document preparation system, developed for the CRBM** and in use since May 85, able to guide an interactive editing of biological sequences by taking into account their semantics. This semantic is extracted at editing time from the document itself by an integrated expert system, and is used to express the structuration. This paper also focusses on the use of a document preparation system as the mediator for a larger application.
Genetic sequences are observed in this approach as generalized documents. This choice allows to associate convenient, and so more legible, visual representations to the abstract aspects of biological sequences semantic.
At first, we explain how semantic information on the sequences is obtained and used to guide editing. The biostation architecture is presented in a second section.
Problem position
The genetic information which allows organic cells to synthesize proteins is kept in genes. These genes are linear strings built with four types of molecules (Adenin, Thymin, Guanin, Cytosin) called nucleotids. The non biologist readers can refer to [Hélène 84]. The studied length of such strings can be up to 30000 atoms.
A gene can be analysed by the biologists to explicit its formula as a word on (A, T, G, C), and operations can be done on the gene (in vitro or in vivo) to modify it by insertions or deletions of some parts, at precise positions.
This paper presents a comprehensive survey of the typographic issues for laying out information within two-dimensional tables. Early typesetting systems formatted tables by coding the table style and layout into the program, and later systems provided a limited range of typographic features. The typographic issues include table structure, alignment of rows and columns simultaneously, formatting styles, treatment of whitespace within a table, graphical embellishments, placement of footnotes, various readability issues, and the problems of breaking large tables. Extending the table formatting problem to both page layout and arrangement of mathematical notation is highlighted, as is the need for interactive design tools for table layout.
Introduction
This paper presents a comprehensive survey of the typographic issues for laying out information within a two-dimensional table. Tables are a concentrated form of the more general layout problem; one can find table formatting analogies in both the larger-scale problem of page makeup, and the smaller-scale problem of aligning notation within a mathematical equation.
Few table formatting tools have addressed all the issues raised by this paper. In fact, it was a challenge to identify the various issues that typographers, compositors, and graphic designers have managed with great skill through the traditional graphic arts processes. Thus this paper provides a checklist for the designs and implementations of new table formatting tools, algorithms, and structures.
Document storage and retrieval systems should possess fast string search capabilities. The access paths needed to reduce the search times require substantial amounts of storage in addition to the very large storage requirements for the documents themselves. In this paper we investigate a technique that supports access paths on compressed documents, so that the total storage requirements for the access paths and the compressed documents are less than that for the original documents.
Introduction
Advances in hardware technology are unlikely to keep pace with the increasing growth of on-line document storage. In an environment where the trend is towards local and wide area networks (there is the promise of an interconnected society around the corner), a large number of documents would be transmitted between nodes. Document storage, their communication along network paths and between peripherals and processors requires, for the provision of a satisfactory service at reasonable cost, that the documents be held more compactly than at present. Natural language being highly redundant a suitable encoding scheme could be utilized with any resultant compression reducing both storage and communication cost. In an online environment the compression and decompression schemes must not involve excessive overheads in either time or space; since the documents would need to be compressed only once for storage while decompressed (or retrieved) more often, it is possible to tolerate higher levels of overhead during the compression stage.
Document retrieval requires fast string search capabilities, and it is usual to provide additional access paths to reduce the search times e.g. by providing inverted lists on words. In [Goyal83] a scheme was proposed that made use of inverted indexes associated with compressed documents.