3 results
Annotation projection for temporal information extraction
- Chris R. Giannella, Ransom K. Winder, Joseph P. Jubinski
-
- Journal:
- Natural Language Engineering / Volume 25 / Issue 3 / May 2019
- Published online by Cambridge University Press:
- 15 May 2019, pp. 385-403
-
- Article
- Export citation
-
Approaches to building temporal information extraction systems typically rely on large, manually annotated corpora. Thus, porting these systems to new languages requires acquiring large corpora of manually annotated documents in the new languages. Acquiring such corpora is difficult owing to the complexity of temporal information extraction annotation. One strategy for addressing this difficulty is to reduce or eliminate the need for manually annotated corpora through annotation projection. This technique utilizes a temporal information extraction system for a source language (typically English) to automatically annotate the source language side of a parallel corpus. It then uses automatically generated word alignments to project the annotations, thereby creating noisily annotated target language training data. We developed an annotation projection technique for producing target language temporal information extraction systems. We carried out an English (source) to French (target) case study wherein we compared a French temporal information extraction system built using annotation projection with one built using a manually annotated French corpus. While annotation projection has been applied to building other kinds of Natural Language Processing tools (e.g., Named Entity Recognizers), to our knowledge, this is the first paper examining annotation projection as applied to temporal information extraction where no manual corrections of the target language annotations were made. We found that, even using manually annotated data to build a temporal information extraction system, F-scores were relatively low (<0.35), which suggests that the problem is challenging even with manually annotated data. Our annotation projection approach performed well (relative to the system built from manually annotated data) on some aspects of temporal information extraction (e.g., event–document creation time temporal relation prediction), but it performed poorly on the other kinds of temporal relation prediction (e.g., event–event and event–time).
Dropped personal pronoun recovery in Chinese SMS*
- CHRIS GIANNELLA, RANSOM WINDER, STACY PETERSEN
-
- Journal:
- Natural Language Engineering / Volume 23 / Issue 6 / November 2017
- Published online by Cambridge University Press:
- 30 May 2017, pp. 905-927
-
- Article
- Export citation
-
In written Chinese, personal pronouns are commonly dropped when they can be inferred from context. This practice is particularly common in informal genres like Short Message Service messages sent via cell phones. Restoring dropped personal pronouns can be a useful preprocessing step for information extraction. Dropped personal pronoun recovery can be divided into two subtasks: (1) detecting dropped personal pronoun slots and (2) determining the identity of the pronoun for each slot. We address a simpler version of restoring dropped personal pronouns wherein only the person numbers are identified. After applying a word segmenter, we used a linear-chain conditional random field to predict which words were at the start of an independent clause. Then, using the independent clause start information, as well as lexical and syntactic information, we applied a conditional random field or a maximum-entropy classifier to predict whether a dropped personal pronoun immediately preceded each word and, if so, the person number of the dropped pronoun. We conducted a series of experiments using a manually annotated corpus of Chinese Short Message Service. Our approaches substantially outperformed a rule-based approach based partially on rules developed by Chung and Gildea (2010, Effects of Empty Categories on Machine Translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. pp. 636–45). Our approaches also outperformed (though by a considerably smaller margin) a machine-learning approach based closely on work by Yang, Liu, and Xue in (2015, Recovering Dropped Pronouns from Chinese Text Messages. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics. pp. 309–13). Features derived from parsing largely did not help our approaches. We conclude that, given independent clause start information, the parse information we used was largely superfluous for identifying dropped personal pronouns.
(Un/Semi-)supervised SMS text message SPAM detection
- CHRIS R. GIANNELLA, RANSOM WINDER, BRANDON WILSON
-
- Journal:
- Natural Language Engineering / Volume 21 / Issue 4 / August 2015
- Published online by Cambridge University Press:
- 15 October 2014, pp. 553-567
-
- Article
- Export citation
-
We address the problem of unsupervised and semi-supervised SMS (Short Message Service) text message SPAM detection. We develop a content-based Bayesian classification approach which is a modest extension of the technique discussed by Resnik and Hardisty in 2010. The approach assumes that the bodies of the SMS messages arise from a probabilistic generative model and estimates the model parameters by Gibbs sampling using an unlabeled, or partially labeled, SMS training message corpus. The approach classifies new SMS messages as SPAM or HAM (non-SPAM) by zero-thresholding their logit estimates. We tested the approach on a publicly available SMS corpora collected from the UK. Used in semi-supervised fashion, the approach clearly outperformed a competing algorithm, Semi-Boost. Used in unsupervised fashion, the approach outperformed a fully supervised classifier, an SVM (Support Vector Machine), when the number of training messages used by the SVM was small and performed comparably otherwise. We believe the approach works well and is a useful tool for SMS SPAM detection.