To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This paper discusses the nature, history and current characteristics of Language Engineering, which is contrasted with Natural Language Processing and Computational Linguistics, and which is shown to have attained its own distinct identity in recent years. Major trends in the field are examined, including its focus on large-scale practical tasks and on quantitative evaluation of progress, and its willingness to embrace a diverse range of techniques. The importance of software engineering in this context is noted, as are some sociological aspects of the practitioner group.
Currently, word tokenization and segmentation are still a hot topic in natural language processing, especially for languages like Chinese in which there is no blank space for word delimitation. Three major problems are faced: (1) tokenizing direction and efficiency; (2) insufficient tokenization dictionary and new words; and (3) ambiguity of tokenization and segmentation. Most existing tokenization and segmentation methods have not dealt with the above problems together. To tackle the three problems in one basket, this paper presents a novel dictionary-based method called the Splitting-Merging Model (SMM) for Chinese word tokenization and segmentation. It uses the mutual information of Chinese characters to find the boundaries and the non-boundaries of Chinese words, and finally leads to a word segmentation by resolving ambiguities and detecting new words.
Semantically tagging a corpus is useful for many intermediate NLP tasks such as: acquisition of word argument structures in sublanguages; acquisition of syntactic disambiguation cues; terminology learning; etc. The general idea is that semantic tags allow the generalization of observed word patterns, and facilitate the discovery of recurrent sublanguage phenomena and selectional rules of various types. Yet, as opposed to POS tags in morphology, there is no consensus in the literature about the type and granularity of the semantic tags to be used. In this paper, we argue that an appropriate selection of semantic tags should be domain-dependent. We propose a method by which we select from WordNet an inventory of semantic tags that are ‘optimal’ for a given corpus, according to a scoring function defined as a linear combination of general and corpus-dependent performance factors. We believe that an optimal selection of a category inventory is a necessary premise for obtaining better results in all lexically learning algorithms that are based on, or concerned with, semantic categorization of words. Furthermore, an adequate inventory (one which intuitively ‘fits’ with the semantics of a domain, e.g. phenomenon for Natural Science, or part, piece for a technical handbook) may facilitate the manual annotation of large corpora.
Quantified expressions present an interesting case for understanding pronominal reference because the quantifier which appears in the expression affects how the pronoun will be understood. The quantifier determines which sets will be placed in focus and predicts the relation between the sentence containing the quantifier and the following sentence (Moxey and Sanford 1993a). This information is essential to understanding the pronominal reference. In this paper, we discuss how a natural language processing system can take advantage of this information to understand pronominal references to quantified expressions.
This paper presents a number of linguistic and computational issues identified during the implementation of a general use grammar checker for contemporary Brazilian Portuguese, ReGra, that has been incorporated in the word processor REDATOR by Itautec/Philco (Brazil). Two main strategies were employed in the implementation of correction rules: an error-driven, localist approach based on the identification of patterns indicative of grammatical mistakes; and a more generic approach that requires automatic syntactic analysis. In this discussion, particular emphasis is given to the development of a parser based on a phrase structure grammar comprising over 600 production rules. As for the computational performance, ReGra permits texts to be revised at a rate of ca. 200 words per second.
Most existing Natural Language Database Interfaces (NLDB) were designed to be used with database systems that provide very limited facilities for manipulating time-dependent data, and they do not support adequately temporal linguistic mechanisms (verb tenses, temporal adverbials, temporal subordinate clauses, etc.). The database community is becoming increasingly interested in temporal database systems, which are intended to store and manipulate in a principled manner information not only about the present, but also about the past and future. When interfacing to temporal databases, supporting temporal linguistic mechanisms becomes crucial.
We present a framework for constructing Natural Language Interfaces for Temporal Databases (NLTDB), which draws on research in tense and aspect theories, temporal logics and temporal databases. The framework consists of a temporal intermediate representation language, called TOP, an HPSG grammar that maps a wide range of questions involving temporal mechanisms to appropriate TOP expressions, and a provably correct method for translating from TOP to TSQL2, TSQL2 being a recently proposed temporal extension of the SQL database language. This framework was employed to implement a prototype NLTDB.
This paper describes the work carried out at the Center for Sprogteknologi in Copenhagen to validate the LE evaluation methodology developed by the LRE project TEMAA. TEMAA has developed a framework for the evaluation of LE products, implemented in a Parameterisable Testbed (PTB). The framework allows for a modular, formal and exible description of user requirements and objects of evaluation, it accommodates test methods of various kind and provides a methodology for assessing test results in the light of the requirements expressed by different user types. While the fundamentals of the TEMAA framework are meant to apply to adequacy evaluation of LE products in general, a detailed methodology has been worked out for the evaluation of spelling and grammar checkers, and applied to the concrete evaluation of Danish and Italian spelling checkers. The main focus of this paper is on showing that the general methodology provides a valid model for designing and carrying out a concrete evaluation, as in the case study on Danish spelling checkers.
The statistical induction of stochastic context free grammars from bracketed corpora with the Inside Outside Algorithm is an appealing method for grammar learning, but the computational complexity of this algorithm has made it impossible to generate a large scale grammar. Researchers from natural language processing and speech recognition have suggested various methods to reduce the computational complexity and, at the same time, guide the learning algorithm towards a solution by, for example, placing constraints on the grammar. We suggest a method that strongly reduces that computational cost of the algorithm without placing constraints on the grammar. This method can in principle be combined with any of the constraints on grammars that have been suggested in earlier studies. We show that it is feasible to achieve results equivalent to earlier research, but with much lower computational effort. After creating a small grammar, the grammar is incrementally increased while rules that have become obsolete are removed at the same time. We explain the modifications to the algorithm, give results of experiments and compare these to results reported in other publications.