To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
There are currently two philosophies for building grammars and parsers: hand-crafted, wide coverage grammars; and statistically induced grammars and parsers. Aside from the methodological differences in grammar construction, the linguistic knowledge which is overt in the rules of handcrafted grammars is hidden in the statistics derived by probabilistic methods, which means that generalizations are also hidden and the full training process must be repeated for each domain. Although handcrafted wide coverage grammars are portable, they can be made more efficient when applied to limited domains, if it is recognized that language in limited domains is usually well constrained and certain linguistic constructions are more frequent than others. We view a domain-independent grammar as a repository of portable grammatical structures whose combinations are to be specialized for a given domain. We use Explanation-Based Learning (EBL) to identify the relevant subset of a handcrafted general purpose grammar (XTAG) needed to parse in a given domain (ATIS). We exploit the key properties of Lexicalized Tree-Adjoining Grammars to view parsing in a limited domain as finite state transduction from strings to their dependency structures.
Many of the processing steps in natural language engineering can be performed using finite state transducers. An optimal way to create such transducers is to compile them from regular expressions. This paper is an introduction to the regular expression calculus, extended with certain operators that have proved very useful in natural language applications ranging from tokenization to light parsing. The examples in the paper illustrate in concrete detail some of these applications.
A Lexical Transducer (LT) as defined by Karttunen, Kaplan, Zaenen 1992 is a specialized finite state transducer (FST) that relates citation forms of words and their morphological categories to inflected surface forms. Using LTs is advantageous because the same structure and algorithms can be used for morphological analysis (stemming) and generation. Morphological processing (analysis and generation) is computationally faster, and the data for the process can be compacted more tightly than with other methods. The standard way to construct an LT consists of three steps: (1) constructing a simple finite state source lexicon LA which defines all valid canonical citation forms of the language; (2) describing morphological alternations by means of two-level rules, compiling the rules to FSTs, and intersecting them to form a single rule transducer RT; and (3) composing LA and RT.
Finite state cascades represent an attractive architecture for parsing unrestricted text. Deterministic parsers specified by finite state cascades are fast and reliable. They can be extended at modest cost to construct parse trees with finite feature structures. Finally, such deterministic parsers do not necessarily involve trading off accuracy against speed — they may in fact be more accurate than exhaustive search stochastic context free parsers.
A source of potential systematic errors in information retrieval is identified and discussed. These errors occur when base form reduction is applied with a (necessarily) finite dictionary. Formal methods for avoiding this error source are presented, along with some practical complexities met in its implementation.
Researchers and industry are actively developing Software Agents (SAs), autonomous software that will assist users in achieving various tasks, collaborate with them, or even act on their behalf. To explore new interaction modes for SAs which need to be more sophisticated than simple exchanges of messages, we have analysed human conversations and elaborated an interaction approach for SAs based on a conversation model. Using this approach, we have developed a multi-agent system that simulates conversations involving SAs. We assume that SAs perform communicative acts to negotiate about mental states, such as beliefs and goals, turn-taking and special conversational sequences. We also assume that SAs respect communication protocols when they negotiate. In this paper, we describe the conceptual structure of communicative acts, the knowledge structures used to model a conversation, and the communication protocols. We show how an inference engine using ‘conversation-managing rules’ can be integrated in a conversational agent responsible for interpreting communicative acts, and we discuss the different kinds of rules that we propose. The prototype PSICO was implemented to simulate conversations on a computer platform.
Language tools that help people with their writing are now usually included in today's word processors. Although these various tools provide increasing support to native speakers of a language, they are much less useful to non-native speakers who are writing in their second language (e.g. French speakers writing in English). Real errors may go undetected and potential errors or non-errors that are flagged by the system may be taken to be genuine errors by the non-native speaker. In this paper, we present the prototype of an English writing tool which is aimed at helping speakers of French write in English. We first discuss the kind of problems non-native speakers have when writing in a second language. We then explain how we collected a corpus of errors which we used to build a typology of errors needed in the various stages of the project. This is followed by an overview of the prototype which contains a number of writing aids (dictionaries, on-line grammar helps, verb conjugator, etc.) and two checking tools: a problem word highlighter which lists all the potentially difficult words that cannot be dealt with correctly by the system (false friends, confusions, etc.) and a grammar checker which detects and corrects morphological and syntactic errors. We describe in detail the automata formalism we use to extract linguistic information, test syntactic environments and detect and correct errors. Finally, we present a first evaluation of the correction capacity of our grammar checker as compared to that of commercially available systems.
This paper discusses different issues in the construction and knowledge representation of an intelligent dictionary help system. The Intelligent Dictionary Help System (IDHS) is conceived as a monolingual (explanatory) dictionary system for human use (Artola and Evrard, 1992). The fact that it is intended for people instead of automatic processing distinguishes it from other systems dealing with the acquisition of semantic knowledge from conventional dictionaries. The system provides various access possibilities to the data, allowing to deduce implicit knowledge from the explicit dictionary information. IDHS deals with reasoning mechanisms analogous to those used by humans when they consult a dictionary. User level functionality of the system has been specified and a prototype has been implemented (Agirre et al., 1994a). A methodology for the extraction of semantic knowledge from a conventional dictionary is described. The method followed in the construction of the phrasal pattern hierarchies required by the parser (Alshawi, 1989) is based on an empirical study carried out on the structure of definition sentences. The results of its application to a real dictionary has shown that the parsing method is particularly suited to the analysis of short definition sentences, as it was the case of the source dictionary. As a result of this process, the characterization of the different lexical-semantic relations between senses is established by means of semantic rules (attached to the patterns); these rules are used for the initial construction of the Dictionary Knowledge Base (DKB). The representation schema proposed for the DKB (Agirre et al., 1994b) is basically a semantic network of frames representing word senses. After construction of the initial DKB, several enrichment processes are performed on the DKB to add new facts to it; these processes are based on the exploitation of the properties of lexical-semantic relations, and also on specially conceived deduction mechanisms. The result of the enrichment processes show the suitability of the representation schema chosen to deduce implicit knowledge. Erroneous deductions are mainly due to incorrect word sense disambiguation.
Operating system command languages assist the user in executing commands for a significant number of common everyday tasks. On the other hand, the introduction of textual command languages for robots has provided the opportunity to perform some important functions that leadthrough programming cannot readily accomplish. However, such command languages assume the user to be expert enough to carry out a specific task in these application domains. On the contrary, a natural language interface to such command languages, apart from being able to be integrated into a future speech interface, can facilitate and broaden the use of these command languages to a larger audience. In this paper, advanced techniques are presented for an adaptive natural language interface that can (a) be portable to a large range of command languages, (b) handle even complex commands thanks to an embedded linguistic parser, and (c) be expandable and customizable by providing the casual user with the opportunity to specify some types of new words as well as the system developer with the ability to introduce new tasks in these application domains. Finally, to demonstrate the above techniques in practice, an example of their application to a Greek natural language interface to the MS-DOS operating system is given.
Words unknown to the lexicon present a substantial problem to part-of-speech tagging. In this paper we present a technique for fully unsupervised acquisition of rules which guess possible parts of speech for unknown words. This technique does not require specially prepared training data, and uses instead the lexicon supplied with a tagger and word frequencies collected from a raw corpus. Three complimentary sets of word-guessing rules are statistically induced: prefix morphological rules, suffix morphological rules and ending guessing rules. The acquisition process is strongly associated with guessing-rule evaluation methodology which is solely dedicated to the performance of part-of-speech guessers. Using the proposed technique a guessing-rule induction experiment was performed on the Brown Corpus data and rule-sets, with a highly competitive performance, were produced and compared with the state-of-the-art.To evaluate the impact of the word-guessing component on the overall tagging performance, it was integrated into a stochastic and a rule-based tagger and applied to texts with unknown words.
This paper describes NL-OOPS, a CASE tool that supports requirements analysis by generating object oriented models from natural language requirements documents. The full natural language analysis is obtained using as a core system the Natural Language Processing System LOLITA. The object oriented analysis module implements an algorithm for the extraction of the objects and their associations for use in creating object models.
In this article, we describe AIMS (Assisted Indexing at Mississippi State), a system intended to aid human document analysts in the assignment of indexes to physical chemistry journal articles. The two major components of AIMS are a natural language processing (NLP) component and an index generation (IG) component. We provide an overview of what each of these components does and how it works. We also present the results of a recent evaluation of our system in terms of recall and precision. The recall rate is the proportion of the ‘correct’ indexes (i.e. those produced by human document analysts) generated by AIMS. The precision rate is the proportion of the generated indexes that is correct. Finally, we describe some of the future work planned for this project.
Recently, most part-of-speech tagging approaches, such as rule-based, probabilistic and neural network approaches, have shown very promising results. In this paper, we are particularly interested in probabilistic approaches, which usually require lots of training data to get reliable probabilities. We alleviate such a restriction of probabilistic approaches by introducing a fuzzy network model to provide a method for estimating more reliable parameters of a model under a small amount of training data. Experiments with the Brown corpus show that the performance of the fuzzy network model is much better than that of the hidden Markov model under a limited amount of training data.
We describe new applications of the theory of automata to natural language processing: the representation of very large scale dictionaries and the indexation of natural language texts. They are based on new algorithms that we introduce and describe in detail. In particular, we give pseudocodes for the determinisation of string to string transducers, the deterministic union of p-subsequential string to string transducers, and the indexation by automata. We report on several experiments illustrating the applications.
This paper addresses the problem of distribution of words and phrases in text, a problem of great general interest and of importance for many practical applications. The existing models for word distribution present observed sequences of words in text documents as an outcome of some stochastic processes; the corresponding distributions of numbers of word occurrences in the documents are modelled as mixtures of Poisson distributions whose parameter values are fitted to the data. We pursue a linguistically motivated approach to statistical language modelling and use observable text characteristics as model parameters. Multi-word technical terms, intrinsically content entities, are chosen for experimentation. Their occurrence and the occurrence dynamics are investigated using a 100-million word data collection consisting of a variety of about 13,000 technical documents. The derivation of models describing word distribution in text is based on a linguistic interpretation of the process of text formation, with the probabilities of word occurrence being functions of observable and linguistically meaningful text characteristics. The adequacy of the proposed models for the description of actually observed distributions of words and phrases in text is confirmed experimentally. The paper has two focuses: one is modelling of the distributions of content words and phrases among different documents; and another is word occurrence dynamics within documents and estimation of corresponding probabilities. Accordingly, among the application areas for the new modelling paradigm are information retrieval and speech recognition.