Skip to main content
    • Aa
    • Aa

A Semi-automatic and low-cost method to learn patterns for named entity recognition*

  • M. MARRERO (a1) and J. URBANO (a2)

Named Entity Recognition is a basic task in Information Extraction that aims at identifying entities of interest within full text documents. The patterns used to recognize entities can be rule based, as in the popular JAPE system. However, hand-crafting effective patterns is often difficult, and yet there is little research devoted to methods capable of learning human-readable patterns, possibly with arbitrary sets of features. In this paper, we present a semi-automatic method to generate both regular expressions and a subset of the JAPE language. It does not need a corpus annotated beforehand. Instead, it employs active learning and combines clustering with an algorithm that finds alignments between symbols present in the entities discovered during the learning process. The method currently supports a fixed set of character features and an arbitrary set of token features, but it can incorporate other kinds of features as well. Through several experiments with an English corpus, we show the ability of the method to generate effective patterns at a low annotation cost, and how it can successfully help in the annotation of brand new corpora.

Hide All

This work was partially supported by the Spanish Government through a Juan de la Cierva fellowship and project MDM-2015-0502. We specially thank Jorge Morato and Sonia Sánchez for their advice, as well as the anonymous reviewers for their suggestions.

Hide All
B. K. Boguraev 2004. Annotation-based finite state processing in a large-scale NLP architecture. In Nikolov et al. (eds.), Recent Advances in Natural Language Processing III: Selected Papers from RANLP 2003, John Benjamins Publishing, Amsterdam, pp. 6177.

W. H. E. Day , and H. Edelsbrunner , 1984. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification 1 (1): 724.

O. Etzioni , et al. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artificial Intelligence 165 (1): 91134.

E. Fersini , E. Messina , G. Felici , and D. Roth , 2014. Soft-constrained inference for named entity recognition. Information Processing and Management 50 (5): 807–19.

P. Kluegl , M. Toepfer , P.-D. Beck , G. Fette , and F. Puppe 2015. Uima ruta: rapid development of rule-based information extraction applications. Natural Language Engineering 22 (1), 140.

Y. Li , K. Bontcheva , and H. Cunningham , 2009. Adapting SVM for data sparseness and imbalance: a case study in information extraction. Natural Language Engineering 15 (2): 241–71.

X. Liu , F. Wei , S. Zhang , and M. Zhou , 2013. Named entity recognition for tweets. ACM Transactions on Intelligent Systems and Technology 4 (1): 3.

A. Maedche , and S. Staab , 2001. Ontology learning for the semantic web. IEEE Intelligent Systems 16 (2): 72–9.

M. Marrero , J. Urbano , S. Sánchez-Cuadrado , J. Morato , and J. M. Gómez-Berbís , 2013. Named entity recognition: fallacies, challenges and opportunities. Journal of Computer Standards and Interfaces 35 (5): 482–9.

B. Pang , and L. Lee , 2007. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2 (1–2): 1135.

S. Sarawagi , 2008. Information extraction. Foundations and Trends in Databases 1 (3): 261377.

B. Settles , 2012. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6 (1): 1114.

S. Soderland , 1999. Learning information extraction rules for semi-structured and free text. Machine Learning 34 (1): 233–72.

V. S. Uren , et al. 2006. Semantic annotation for knowledge management: requirements and a survey of the state of the art. Journal of Web Semantics 4 (1): 1428.

A. Vlachos , 2008. A stopping criterion for active learning. Computer Speech & Language 22 (3): 295312.

T. Wu , and W. M. Pottenger , 2005. A semi-supervised active learning algorithm for information extraction from textual data. Journal of the American Society for Information Science and Technology 56 (3): 258–71.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 4
Total number of PDF views: 46 *
Loading metrics...

Abstract views

Total abstract views: 209 *
Loading metrics...

* Views captured on Cambridge Core between 15th June 2017 - 16th October 2017. This data will be updated every 24 hours.