Search results for Artificial Intelligence and Natural Language Processing

Combining n-grams and deep convolutional features for language variety classification
Matej Martinc, Senja Pollak
Journal:

Natural Language Engineering / Volume 25 / Issue 5 / September 2019

Published online by Cambridge University Press:

18 July 2019, pp. 607-632
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
This paper presents a novel neural architecture capable of outperforming state-of-the-art systems on the task of language variety classification. The architecture is a hybrid that combines character-based convolutional neural network (CNN) features with weighted bag-of-n-grams (BON) features and is therefore capable of leveraging both character-level and document/corpus-level information. We tested the system on the Discriminating between Similar Languages (DSL) language variety benchmark data set from the VarDial 2017 DSL shared task, which contains data from six different language groups, as well as on two smaller data sets (the Arabic Dialect Identification (ADI) Corpus and the German Dialect Identification (GDI) Corpus, from the VarDial 2016 ADI and VarDial 2018 GDI shared tasks, respectively). We managed to outperform the winning system in the DSL shared task by a margin of about 0.4 percentage points and the winning system in the ADI shared task by a margin of about 0.2 percentage points in terms of weighted F1 score without conducting any language group-specific parameter tweaking. An ablation study suggests that weighted BON features contribute more to the overall performance of the system than the CNN-based features, which partially explains the uncompetitiveness of deep learning approaches in the past VarDial DSL shared tasks. Finally, we have implemented our system in a workflow, available in the ClowdFlows platform, in order to make it easily available also to the non-programming members of the research community.

Neural morphosyntactic tagging for Rusyn
Yves Scherrer, Achim Rabus
Journal:

Natural Language Engineering / Volume 25 / Issue 5 / September 2019

Published online by Cambridge University Press:

18 July 2019, pp. 633-650
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
The paper presents experiments on part-of-speech and full morphological tagging of the Slavic minority language Rusyn. The proposed approach relies on transfer learning and uses only annotated resources from related Slavic languages, namely Russian, Ukrainian, Slovak, Polish, and Czech. It does not require any annotated Rusyn training data, nor parallel data or bilingual dictionaries involving Rusyn. Compared to earlier work, we improve tagging performance by using a neural network tagger and larger training data from the neighboring Slavic languages. We experiment with various data preprocessing and sampling strategies and evaluate the impact of multitask learning strategies and of pretrained word embeddings. Overall, while genre discrepancies between training and test data have a negative impact, we improve full morphological tagging by 9% absolute micro-averaged F1 as compared to previous research.

Designing a virtual patient dialogue system based on terminology-rich resources: Challenges and evaluation
Leonardo Campillos-Llanos, Catherine Thomas, Éric Bilinski, Pierre Zweigenbaum, Sophie Rosset
Journal:

Natural Language Engineering / Volume 26 / Issue 2 / March 2020

Published online by Cambridge University Press:

15 July 2019, pp. 183-220
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Virtual patient software allows health professionals to practise their skills by interacting with tools simulating clinical scenarios. A natural language dialogue system can provide natural interaction for medical history-taking. However, the large number of concepts and terms in the medical domain makes the creation of such a system a demanding task. We designed a dialogue system that stands out from current research by its ability to handle a wide variety of medical specialties and clinical cases. To address the task, we designed a patient record model, a knowledge model for the task and a termino-ontological model that hosts structured thesauri with linguistic, terminological and ontological knowledge. We used a frame- and rule-based approach and terminology-rich resources to handle the medical dialogue. This work focuses on the termino-ontological model, the challenges involved and how the system manages resources for the French language. We adopted a comprehensive approach to collect terms and ontological knowledge, and dictionaries of affixes, synonyms and derivational variants. Resources include domain lists containing over 161,000 terms, and dictionaries with over 959,000 word/concept entries. We assessed our approach by having 71 participants (39 medical doctors and 32 non-medical evaluators) interact with the system and use 35 cases from 18 specialities. We conducted a quantitative evaluation of all components by analysing interaction logs (11,834 turns). Natural language understanding achieved an F-measure of 95.8%. Dialogue management provided on average 74.3 (±9.5)% of correct answers. We performed a qualitative evaluation by collecting 171 five-point Likert scale questionnaires. All evaluated aspects obtained mean scores above the Likert mid-scale point. We analysed the vocabulary coverage with regard to unseen cases: the system covered 97.8% of their terms. Evaluations showed that the system achieved high vocabulary coverage on unseen cases and was assessed as relevant for the task.

Detecting light verb constructions across languages
István Nagy T., Anita Rácz, Veronika Vincze
Journal:

Natural Language Engineering / Volume 26 / Issue 3 / May 2020

Published online by Cambridge University Press:

15 July 2019, pp. 319-348
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Light verb constructions (LVCs) are verb and noun combinations in which the verb has lost its meaning to some degree and the noun is used in one of its original senses, typically denoting an event or an action. They exhibit special linguistic features, especially when regarded in a multilingual context. In this paper, we focus on the automatic detection of LVCs in raw text in four different languages, namely, English, German, Spanish, and Hungarian. First, we analyze the characteristics of LVCs from a linguistic point of view based on parallel corpus data. Then, we provide a standardized (i.e., language-independent) representation of LVCs that can be used in machine learning experiments. After, we experiment on identifying LVCs in different languages: we exploit language adaptation techniques which demonstrate that data from an additional language can be successfully employed in improving the performance of supervised LVC detection for a given language. As there are several annotated corpora from several domains in the case of English and Hungarian, we also investigate the effect of simple domain adaptation techniques to reduce the gap between domains. Furthermore, we combine domain adaptation techniques with language adaptation techniques for these two languages. Our results show that both out-domain and additional language data can improve performance. We believe that our language adaptation method may have practical implications in several fields of natural language processing, especially in machine translation.

Term evaluation metrics in imbalanced text categorization
Behzad Naderalvojoud, Ebru Akcapinar Sezer
Journal:

Natural Language Engineering / Volume 26 / Issue 1 / January 2020

Published online by Cambridge University Press:

12 July 2019, pp. 31-47
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
This paper proposes four novel term evaluation metrics to represent documents in the text categorization where class distribution is imbalanced. These metrics are achieved from the revision of the four common term evaluation metrics: chi-square, information gain, odds ratio, and relevance frequency. While the common metrics require a balanced class distribution, our proposed metrics evaluate the document terms under an imbalanced distribution. They calculate the degree of relatedness of terms with respect to minor and major classes by considering their imbalanced distribution. Using these metrics in the document representation makes a better distinction between the documents of the minor and major classes and improves the performance of machine learning algorithms. The proposed metrics are assessed over three popular benchmarks (two subsets of Reuters-21578 and WebKB) by using four classification algorithms: support vector machines, naive Bayes, decision trees, and centroid-based classifiers. Our empirical results indicate that the proposed metrics outperform the common metrics in the imbalanced text categorization.

NLP commercialisation in the last 25 years
Robert Dale
Journal:

Natural Language Engineering / Volume 25 / Issue 3 / May 2019

Published online by Cambridge University Press:

15 May 2019, pp. 419-426
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
The Journal of Natural Language Engineering is now in its 25th year. The editorial preface to the first issue emphasised that the focus of the journal was to be on the practical application of natural language processing (NLP) technologies: the time was ripe for a serious publication that helped encourage research ideas to find their way into real products. The commercialisation of NLP technologies had already started by that point, but things have advanced tremendously over the last quarter-century. So, to celebrate the journal’s anniversary, we look at how commercial NLP products have developed over the last 25 years.

NLE volume 25 issue 3 Cover and Back matter
Journal:

Natural Language Engineering / Volume 25 / Issue 3 / May 2019

Published online by Cambridge University Press:

15 May 2019, pp. b1-b2
- Article
- - You have access
- PDF
- Export citation

Annotation projection for temporal information extraction
Chris R. Giannella, Ransom K. Winder, Joseph P. Jubinski
Journal:

Natural Language Engineering / Volume 25 / Issue 3 / May 2019

Published online by Cambridge University Press:

15 May 2019, pp. 385-403
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Approaches to building temporal information extraction systems typically rely on large, manually annotated corpora. Thus, porting these systems to new languages requires acquiring large corpora of manually annotated documents in the new languages. Acquiring such corpora is difficult owing to the complexity of temporal information extraction annotation. One strategy for addressing this difficulty is to reduce or eliminate the need for manually annotated corpora through annotation projection. This technique utilizes a temporal information extraction system for a source language (typically English) to automatically annotate the source language side of a parallel corpus. It then uses automatically generated word alignments to project the annotations, thereby creating noisily annotated target language training data. We developed an annotation projection technique for producing target language temporal information extraction systems. We carried out an English (source) to French (target) case study wherein we compared a French temporal information extraction system built using annotation projection with one built using a manually annotated French corpus. While annotation projection has been applied to building other kinds of Natural Language Processing tools (e.g., Named Entity Recognizers), to our knowledge, this is the first paper examining annotation projection as applied to temporal information extraction where no manual corrections of the target language annotations were made. We found that, even using manually annotated data to build a temporal information extraction system, F-scores were relatively low (<0.35), which suggests that the problem is challenging even with manually annotated data. Our annotation projection approach performed well (relative to the system built from manually annotated data) on some aspects of temporal information extraction (e.g., event–document creation time temporal relation prediction), but it performed poorly on the other kinds of temporal relation prediction (e.g., event–event and event–time).

NLE volume 25 issue 3 Cover and Front matter
Journal:

Natural Language Engineering / Volume 25 / Issue 3 / May 2019

Published online by Cambridge University Press:

15 May 2019, pp. f1-f2
- Article
- - You have access
- PDF
- Export citation

Anniversary article: Then and now: 25 years of progress in natural language engineering
John Tait, Yorick Wilks
Journal:

Natural Language Engineering / Volume 25 / Issue 3 / May 2019

Published online by Cambridge University Press:

15 May 2019, pp. 405-418
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
The paper reviews the state of the art of natural language engineering (NLE) around 1995, when this journal first appeared, and makes a critical comparison with the current state of the art in 2018, as we prepare the 25th Volume. Specifically the then state of the art in parsing, information extraction, chatbots, and dialogue systems, speech processing and machine translation are briefly reviewed. The emergence in the 1980s and 1990s of machine learning (ML) and statistical methods (SM) is noted. Important trends and areas of progress in the subsequent years are identified. In particular, the move to the use of n-grams or skip grams and/or chunking with part of speech tagging and away from whole sentence parsing is noted, as is the increasing dominance of SM and ML. Some outstanding issues which merit further research are briefly pointed out, including metaphor processing and the ethical implications of NLE.

Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU Corpus
Olga Uryupina, Ron Artstein, Antonella Bristot, Federica Cavicchio, Francesca Delogu, Kepa J. Rodriguez, Massimo Poesio
Journal:

Natural Language Engineering / Volume 26 / Issue 1 / January 2020

Published online by Cambridge University Press:

07 May 2019, pp. 95-128
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
This paper presents the second release of arrau, a multigenre corpus of anaphoric information created over 10 years to provide data for the next generation of coreference/anaphora resolution systems combining different types of linguistic and world knowledge with advanced discourse modeling supporting rich linguistic annotations. The distinguishing features of arrau include the following: treating all NPs as markables, including non-referring NPs, and annotating their (non-) referentiality status; distinguishing between several categories of non-referentiality and annotating non-anaphoric mentions; thorough annotation of markable boundaries (minimal/maximal spans, discontinuous markables); annotating a variety of mention attributes, ranging from morphosyntactic parameters to semantic category; annotating the genericity status of mentions; annotating a wide range of anaphoric relations, including bridging relations and discourse deixis; and, finally, annotating anaphoric ambiguity. The current version of the dataset contains 350K tokens and is publicly available from LDC. In this paper, we discuss in detail all the distinguishing features of the corpus, so far only partially presented in a number of conference and workshop papers, and we also discuss the development between the first release of arrau in 2008 and this second one.

A new approach for textual feature selection based on N-composite isolated labels
Samir Elloumi
Journal:

Natural Language Engineering / Volume 26 / Issue 2 / March 2020

Published online by Cambridge University Press:

29 April 2019, pp. 221-243
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Textual Feature Selection (TFS) aims to extract relevant parts or segments from text as being the most relevant ones w.r.t. the information it expresses. The selected features are useful for automatic indexing, summarization, document categorization, knowledge discovery, so on. Regarding the huge amount of electronic textual data daily published, many challenges related to the semantic aspect as well as the processing efficiency are addressed. In this paper, we propose a new approach for TFS based on Formal Concept Analysis background. Mainly, we propose to extract textual features by exploring the regularities in a formal context where isolated points exist. We introduce the notion of N-composite isolated points as a set of N words to be considered as a unique textual feature. We show that a reduced value of N (between 1 and 3) allows extracting significant textual features compared with existing approaches even for non-completely covering an initial formal context.

Query-based summarization of discussion threads
Suzan Verberne, Emiel Krahmer, Sander Wubben, Antal van den Bosch
Journal:

Natural Language Engineering / Volume 26 / Issue 1 / January 2020

Published online by Cambridge University Press:

16 April 2019, pp. 3-29
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
In this paper, we address query-based summarization of discussion threads. New users can profit from the information shared in the forum, Please check if the inserted city and country names in the affiliations are correct. if they can find back the previously posted information. However, discussion threads on a single topic can easily comprise dozens or hundreds of individual posts. Our aim is to summarize forum threads given real web search queries. We created a data set with search queries from a discussion forum’s search engine log and the discussion threads that were clicked by the user who entered the query. For 120 thread–query combinations, a reference summary was made by five different human raters. We compared two methods for automatic summarization of the threads: a query-independent method based on post features, and Maximum Marginal Relevance (MMR), a method that takes the query into account. We also compared four different word embeddings representations as alternative for standard word vectors in extractive summarization. We find (1) that the agreement between human summarizers does not improve when a query is provided that: (2) the query-independent post features as well as a centroid-based baseline outperform MMR by a large margin; (3) combining the post features with query similarity gives a small improvement over the use of post features alone; and (4) for the word embeddings, a match in domain appears to be more important than corpus size and dimensionality. However, the differences between the models were not reflected by differences in quality of the summaries created with help of these models. We conclude that query-based summarization with web queries is challenging because the queries are short, and a click on a result is not a direct indicator for the relevance of the result.

Robust stylometric analysis and author attribution based on tones and rimes
Renkui Hou, Chu-Ren Huang
Journal:

Natural Language Engineering / Volume 26 / Issue 1 / January 2020

Published online by Cambridge University Press:

10 April 2019, pp. 49-71
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
In this article, we propose an innovative and robust approach to stylometric analysis without annotation and leveraging lexical and sub-lexical information. In particular, we propose to leverage the phonological information of tones and rimes in Mandarin Chinese automatically extracted from unannotated texts. The texts from different authors were represented by tones, tone motifs, and word length motifs as well as rimes and rime motifs. Support vector machines and random forests were used to establish the text classification model for authorship attribution. From the results of the experiments, we conclude that the combination of bigrams of rimes, word-final rimes, and segment-final rimes can discriminate the texts from different authors effectively when using random forests to establish the classification model. This robust approach can in principle be applied to other languages with established phonological inventory of onset and rimes.

Integrating LSA-based hierarchical conceptual space and machine learning methods for leveling the readability of domain-specific texts
Hou-Chiang Tseng, Berlin Chen, Tao-Hsing Chang, Yao-Ting Sung
Journal:

Natural Language Engineering / Volume 25 / Issue 3 / May 2019

Published online by Cambridge University Press:

05 April 2019, pp. 331-361
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Text readability assessment is a challenging interdisciplinary endeavor with rich practical implications. It has long drawn the attention of researchers internationally, and the readability models since developed have been widely applied to various fields. Previous readability models have only made use of linguistic features employed for general text analysis and have not been sufficiently accurate when used to gauge domain-specific texts. In view of this, this study proposes a latent-semantic-analysis (LSA)-constructed hierarchical conceptual space that can be used to train a readability model to accurately assess domain-specific texts. Compared with a baseline reference using a traditional model, the new model improves by 13.88% to achieve 68.98% of accuracy when leveling social science texts, and by 24.61% to achieve 73.96% of accuracy when assessing natural science texts. We then combine the readability features developed for the current study with general linguistic features, and the accuracy of leveling social science texts improves by an even higher degree of 31.58% to achieve 86.68%, and that of natural science texts by 26.56% to achieve 75.91%. These results indicate that the readability features developed in this study can be used both to train a readability model for leveling domain-specific texts and also in combination with the more common linguistic features to enhance the efficacy of the model. Future research can expand the generalizability of the model by assessing texts from different fields and grade levels using the proposed method, thus enhancing the practical applications of this new method.

NLE volume 25 issue 2 Cover and Front matter
Journal:

Natural Language Engineering / Volume 25 / Issue 2 / March 2019

Published online by Cambridge University Press:

01 April 2019, pp. f1-f2
- Article
- - You have access
- PDF
- Export citation

GANs vs. good enough
Kenneth Ward Church
Journal:

Natural Language Engineering / Volume 25 / Issue 2 / March 2019

Published online by Cambridge University Press:

01 April 2019, pp. 323-329
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
General Adversarial Networks are hot. Given Murphy’s Law, it is prudent to be paranoid. Best not to design for the average case. There is a long tradition of designing for the hundred-year flood (and five 9s reliability). What is good enough? Historically, the market hasn’t been willing to pay for five 9s. Hard to justify upfront costs for future benefits that will only payoff under unlikely scenarios, and might not work when needed. If the market isn’t willing to pay for five 9s, can we afford to design for the worst case?

Effectiveness of data-driven induction of semantic spaces and traditional classifiers for sarcasm detection
Mattia Antonino Di Gangi, Giosué Lo Bosco, Giovanni Pilato
Journal:

Natural Language Engineering / Volume 25 / Issue 2 / March 2019

Published online by Cambridge University Press:

01 April 2019, pp. 257-285
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Irony and sarcasm are two complex linguistic phenomena that are widely used in everyday language and especially over the social media, but they represent two serious issues for automated text understanding. Many labeled corpora have been extracted from several sources to accomplish this task, and it seems that sarcasm is conveyed in different ways for different domains. Nonetheless, very little work has been done for comparing different methods among the available corpora. Furthermore, usually, each author collects and uses their own datasets to evaluate his own method. In this paper, we show that sarcasm detection can be tackled by applying classical machine-learning algorithms to input texts sub-symbolically represented in a Latent Semantic space. The main consequence is that our studies establish both reference datasets and baselines for the sarcasm detection problem that could serve the scientific community to test newly proposed methods.

NLE volume 25 issue 2 Cover and Back matter
Journal:

Natural Language Engineering / Volume 25 / Issue 2 / March 2019

Published online by Cambridge University Press:

01 April 2019, pp. b1-b2
- Article
- - You have access
- PDF
- Export citation

Weighted finite-state transducers for normalization of historical texts
Izaskun Etxeberria, Iñaki Alegria, Larraitz Uria
Journal:

Natural Language Engineering / Volume 25 / Issue 2 / March 2019

Published online by Cambridge University Press:

01 April 2019, pp. 307-321
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
This paper presents a study about methods for normalization of historical texts. The aim of these methods is learning relations between historical and contemporary word forms. We have compiled training and test corpora for different languages and scenarios, and we have tried to read the results related to the features of the corpora and languages. Our proposed method, based on weighted finite-state transducers, is compared to previously published ones. Our method learns to map phonological changes using a noisy channel model; it is a simple solution that can use a limited amount of supervision in order to achieve adequate performance. The compiled corpora are ready to be used for other researchers in order to compare results. Concerning the amount of supervision for the task, we investigate how the size of training corpus affects the results and identify some interesting factors to anticipate the difficulty of the task.

Artificial Intelligence and Natural Language Processing

Refine search

Refine search

Actions for selected content:

3241 results in Artificial Intelligence and Natural Language Processing

Combining n-grams and deep convolutional features for language variety classification

Neural morphosyntactic tagging for Rusyn

Designing a virtual patient dialogue system based on terminology-rich resources: Challenges and evaluation

Detecting light verb constructions across languages

Term evaluation metrics in imbalanced text categorization

NLP commercialisation in the last 25 years

NLE volume 25 issue 3 Cover and Back matter

Annotation projection for temporal information extraction

NLE volume 25 issue 3 Cover and Front matter

Anniversary article: Then and now: 25 years of progress in natural language engineering

Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU Corpus

A new approach for textual feature selection based on N-composite isolated labels

Query-based summarization of discussion threads

Robust stylometric analysis and author attribution based on tones and rimes

Integrating LSA-based hierarchical conceptual space and machine learning methods for leveling the readability of domain-specific texts

NLE volume 25 issue 2 Cover and Front matter

GANs vs. good enough

Effectiveness of data-driven induction of semantic spaces and traditional classifiers for sarcasm detection

NLE volume 25 issue 2 Cover and Back matter

Weighted finite-state transducers for normalization of historical texts

Artificial Intelligence and Natural Language Processing

Refine search

Refine search

Actions for selected content:

Save Search

3241 results in Artificial Intelligence and Natural Language Processing