We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Two important recent trends in natural language generation are (i) probabilistic techniques and (ii) comprehensive approaches that move away from traditional strictly modular and sequential models. This paper reports experiments in which pcru – a generation framework that combines probabilistic generation methodology with a comprehensive model of the generation space – was used to semi-automatically create five different versions of a weather forecast generator. The generators were evaluated in terms of output quality, development time and computational efficiency against (i) human forecasters, (ii) a traditional handcrafted pipelined nlg system and (iii) a halogen-style statistical generator. The most striking result is that despite acquiring all decision-making abilities automatically, the best pcru generators produce outputs of high enough quality to be scored more highly by human judges than forecasts written by experts.
Feature unification in parsing has previously used either inefficient Prolog programs, or LISP programs implementing early pre-WAM Prolog models of unification involving searches of binding lists, and the copying of rules to generate edges: features within rules and edges have traditionally been expressed as lists or functions, with clarity being preferred to speed of processing. As a result, parsing takes about 0·5 seconds for a 7-word sentence. Our earlier work produced an optimised chart parser for a non-unification context-free-grammar that achieved 5 ms parses, with high-ambiguity sentences involving hundreds of edges, using the grammar and sentences from Tomita's work on shift-reduce parsing with multiple stack branches. A parallel logic card design resulted that would speed this by a further factor of at least 17. The current paper extends this parser to treat a much more complex unification grammar with structures, using extensive indexing of rules and edges and the optimisations of top-down filtering and look-ahead, to demonstrate where unification occurs during parsing. Unification in parsing is distinguished from that in Prolog, and four alternative schemes for storing features and performing unification are considered, including the traditional binding-list method and three other methods optimised for speed for which overall unification times are calculated. Parallelisation of unification using cheap logic hardware is considered, and estimates show that unification will negligibly increase the parse time of our parallel parser card. Preliminary results are reported from a prototype serial parser that uses the fourth most efficient unification method, and achieves 7 ms for 7-word sentences, and under 1 s for a 36-word 360-way ambiguous sentence with 10,000 edges, on a conventional workstation.
Artificial languages for person-machine communication seldom display the most characteristic properties of natural languages, such as the use of anaphoric or other referring expressions, or ellipsis. This paper argues that useful use could be made of such devices in artificial languages, and proposes a mechanism for the resolution of ellipsis and anaphora in them using finite state transduction techniques. This yields an interpretation system displaying many desirable properties: easily implementable, efficient, incremental and reversible.
Linguists in general, and computational linguists in particular, do well to employ finite state devices wherever possible. They are theoretically appealing because they are computationally weak and best understood from a mathematical point of view. They are computationally appealing because they make for simple, elegant, and highly efficient implementations. In this paper, I hope I have shown how they can be applied to a problem… which seems initially to require heavier machinery.
Text-to-speech systems are currently designed to work on complete sentences and paragraphs, thereby allowing front end processors access to large amounts of linguistic context. Problems with this design arise when applications require text to be synthesized in near real time, as it is being typed. How does the system decide which incoming words should be collected and synthesized as a group when prior and subsequent word groups are unknown? We describe a rule-based parser that uses a three cell buffer and phrasing rules to identify break points for incoming text. Words up to the break point are synthesized as new text is moved into the buffer; no hierarchical structure is built beyond the lexical level. The parser was developed for use in a system that synthesizes written telecommunications by Deaf and hard of hearing people. These are texts written entirely in upper case, with little or no punctuation, and using a nonstandard variety of English (e.g. WHEN DO I WILL CALL BACK YOU). The parser performed well in a three month field trial utilizing tens of thousands of texts. Laboratory tests indicate that the parser exhibited a low error rate when compared with a human reader.
This paper identifies some linguistic properties of technical terminology, and uses them to formulate an algorithm for identifying technical terms in running text. The grammatical properties discussed are preferred phrase structures: technical terms consist mostly of noun phrases containing adjectives, nouns, and occasionally prepositions; rerely do terms contain verbs, adverbs, or conjunctions. The discourse properties are patterns of repetition that distinguish noun phrases that are technical terms, especially those multi-word phrases that constitute a substantial majority of all technical vocabulary, from other types of noun phrase.
The paper presents a terminology indentification algorithm that is motivated by these linguistic properties. An implementation of the algorithm is described; it recovers a high proportion of the technical terms in a text, and a high proportaion of the recovered strings are vaild technical terms. The algorithm proves to be effective regardless of the domain of the text to which it is applied.
This contribution focuses on a dialogue model using an intelligent working memory that aims at facilitating a robust human-machine dialogue in written natural language. The model has been designed as the core of an information seeking dialogue application. The particularity of this project is to rely on the potent interpretation and behaviour capabilities of pragmatic knowledge. Within this framework, the designed dialogue model appears as a kind of ‘forum’ for various facets, impersonated by different models extracted from both intentional and structural approaches of conversation. The approach is based on assuming that multiple expertise is the key to flexibility and robustness. Also, an intelligent memory that keeps track of all events and links them together from as many angles as necessary is crucial for multiple expertise management. This idea is developed by presenting an intelligent dialogue history which is able to complement the wide coverage of the co-operating models. It is no longer a simple chronological record, but a communication area, common to all processes. We illustrate our topic through examples brought out from collected corpora.
Inside parsing is a best parse parsing method based on the Inside algorithm that is often used in estimating probabilistic parameters of stochastic context free grammars. It gives a best parse in O(N3G3) time where N is the input size and G is the grammar size. Earley algorithm can be made to return best parses with the same complexity in N.
By way of experiments, we show that Inside parsing can be more efficient than Earley parsing with sufficiently large grammar and sufficiently short input sentences. For instance, Inside parsing is better with sentences of 16 or less words for a grammar containing 429 states. In practice, parsing can be made efficient by employing the two methods selectively.
The redundancy of Inside algorithm can be reduced by the topdown filtering using the chart produced by Earley algorithm, which is useful in training the probabilistic parameters of a grammar. Extensive experiments on Penn Tree corpus show that the efficiency of Inside computation can be improved by up to 55%.
Morphological analysis, which is at the heart of the processing of natural language requires computationally effective morphological processors. In this paper an approach to the organization of an inflectional morphological model and its application for the Russian language are described. The main objective of our morphological processor is not the classification of word constituents, but rather an efficient computational recognition of morpho-syntactic features of words and the generation of words according to requested morpho-syntactic features. Another major concern that the processor aims to address is the ease of extending the lexicon. The templated word-paradigm model used in the system has an engineering flavour: paradigm formation rules are of a bottom-up (word specific) nature rather than general observations about the language, and word formation units are segments of words rather than proper morphemes. This approach allows us to handle uniformly both general cases and exceptions, and requires extremely simple data structures and control mechanisms which can be easily implemented as a finite-state automata. The morphological processor described in this paper is fully implemented for a substantial subset of Russian (more then 1,500,000 word-tokens – 95,000 word paradigms) and provides an extensive list of morpho-syntactic features together with stress positions for words utilized in its lexicon. Special dictionary management tools were built for browsing, debugging and extension of the lexicon. The actual implementation was done in C and C++, and the system is available for the MS-DOS, MS-Windows and UNIX platforms.
Shannon (1948) showed that a wide range of practical problems can be reduced to the problem of estimating probability distributions of words and ngrams in text. It has become standard practice in text compression, speech recognition, information retrieval and many other applications of Shannon's theory to introduce a “bag-of-words” assumption. But obviously, word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph. The proposed Poisson mixture captures much of this heterogeneous structure by allowing the Poisson parameter θ to vary over documents subject to a density function φ. φ is intended to capture dependencies on hidden variables such genre, author, topic, etc. (The Negative Binomial is a well-known special case where φ is a Г distribution.) Poisson mixtures fit the data better than standard Poissons, producing more accurate estimates of the variance over documents (σ2), entropy (H), inverse document frequency (IDF), and adaptation (Pr(x ≥ 2/x ≥ 1)).
Every word in the lexicon of a natural language is used distinctly from all the other words. A word expert is a small expert system-like module for processing a particular word based on other words in its vicinity. A word expert exploits the idiosyncratic nature of a word by using a set of context testing decision rules that test the identity and placement of context words to infer the word's role in the passage.
The main application of word experts is disambiguating words. Work on word experts has never fully recognized previous related work, and a comprehensive review of that work would therefore contribute to the field. This paper both provides such a review, and describes guidelines and considerations useful in the design and construction of word expert based systems.