Skip to main content

A Semi-automatic and low-cost method to learn patterns for named entity recognition*

  • M. MARRERO (a1) and J. URBANO (a2)

Named Entity Recognition is a basic task in Information Extraction that aims at identifying entities of interest within full text documents. The patterns used to recognize entities can be rule based, as in the popular JAPE system. However, hand-crafting effective patterns is often difficult, and yet there is little research devoted to methods capable of learning human-readable patterns, possibly with arbitrary sets of features. In this paper, we present a semi-automatic method to generate both regular expressions and a subset of the JAPE language. It does not need a corpus annotated beforehand. Instead, it employs active learning and combines clustering with an algorithm that finds alignments between symbols present in the entities discovered during the learning process. The method currently supports a fixed set of character features and an arbitrary set of token features, but it can incorporate other kinds of features as well. Through several experiments with an English corpus, we show the ability of the method to generate effective patterns at a low annotation cost, and how it can successfully help in the annotation of brand new corpora.

Hide All

This work was partially supported by the Spanish Government through a Juan de la Cierva fellowship and project MDM-2015-0502. We specially thank Jorge Morato and Sonia Sánchez for their advice, as well as the anonymous reviewers for their suggestions.

Hide All
Alfonseca, E., and Manandhar, S. 2002. An unsupervised method for general named entity recognition and automated concept discovery. In Proceedings of the 1st International Conference on General WordNet, Mysore, India, pp. 34–43.
Appelt, D. E., and Onyshkevych, B. 1998. The common pattern specification language. In Proceedings of the TIPSTER Text Program: Phase III, Baltimore, Maryland, pp. 23–30.
Asahara, M., and Matsumoto, Y. 2003. Japanese named entity extraction with redundant morphological analysis. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Canada: Edmonton, vol. 1, pp. 8–15.
Bikel, D. M., Miller, S., Schwartz, R., and Weischedel, R. 1997. Nymble: a high-performance learning name-finder. In Proceedings of the 5th Conference on Applied Natural Language Processing, Washington, DC, pp. 194–201.
Boguraev, B. K. 2004. Annotation-based finite state processing in a large-scale NLP architecture. In , Nikolov et al. (eds.), Recent Advances in Natural Language Processing III: Selected Papers from RANLP 2003, John Benjamins Publishing, Amsterdam, pp. 6177.
Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R. 1998. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the 6th Workshop on Very Large Corpora, Montreal, Canada, pp. 152–160.
Brauer, F., Rieger, R., Mocan, A., and Barczynski, W. M. 2011. Enabling information extraction by inference of regular expressions from sample entities. In Proceedings of the 20th Conference on Information and Knowledge Management, Glasgow, United Kindgdom, pp. 1285–94.
Califf, M. E. 1998. Relational Learning Techniques for Natural Language Information Extraction. PhD Thesis, The University of Texas at Austin.
Chiticariu, L., Krishnamurthy, R., Li, Y., Reiss, F., and Vaithyanathan, S. 2010. Domain adaptation of rule-based annotators for named-entity recognition tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Massachusetts, USA, pp. 1002–12.
Chiticariu, L., and Reiss, F. R. 2013. Rule-based information extraction is dead! Long live rule-based information extraction systems! In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Seattle, USA, pp. 827–32.
Ciravegna, F., and Wilks, Y. 2003. Designing adaptive information extraction for the semantic web in amilcare. In Handschuh, S., and Staab, S., , S. (eds.), Annotation for the Semantic Web, Frontiers in Artificial Intelligence and Applications series, vol. 96, pp. 112–27. IOS Press.
Culotta, A., and Mccallum, A. 2005. Reducing labeling effort for structured prediction tasks. In Proceedings of the 20th National Conference on Artificial Intelligence, Pittsburgh, Pennsylvania, pp. 746–51.
Cunningham, H., et al. 2013. Developing language processing components with GATE (a user gGuide). Technical Report, University of Sheffield Department of Computer Science.
Day, W. H. E., and Edelsbrunner, H., 1984. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification 1 (1): 724.
Drozdzynski, W., Krieger, H.-U., Piskorski, J., Schäfer, U., and Xu, F., 2004. Shallow processing with unification and typed feature structures: foundations and applications. Künstliche Intelligenz 1 (1): 1723.
Etzioni, O., et al. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artificial Intelligence 165 (1): 91134.
Fersini, E., Messina, E., Felici, G., and Roth, D., 2014. Soft-constrained inference for named entity recognition. Information Processing and Management 50 (5): 807–19.
Finkel, J. R., Grenager, T., and Manning, C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Ann Arbor, Michigan, pp. 363–70.
Freitag, D. 1998. Toward general-purpose learning for information extraction retargetability. In Proceedings of the 17th International Conference on Computational Linguistics, Montreal, Canada, pp. 404–8.
Gantz, J., and Reinsel, D. 2012. The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. Technical Report, IDC.
Gupta, S., and Manning, C. D. 2014. Improved pattern learning for bootstrapped entity extraction. In Proceedings of the 18th Conference on Computational Natural Language Learning, Baltimore, USA, pp. 98–108.
Hachey, B., Alex, B., and Becker, M. 2005. Investigating the effects of selective sampling on the annotation task. In Proceedings of the 9th Conference on Computational Natural Language Learning, Ann Arbor, Michigan, pp. 144–51.
Haertel, R. A., Seppi, K. D., Ringger, E. K., and Carroll, J. L. 2008. Return on investment for active learning. NIPS Workshop on Cost-Sensitive Learning.
Irmak, U., and Kraft, R. 2010. A scalable machine-learning approach for semi-structured named entity recognition. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, USA, pp. 461–70.
Jones, R. 2005. Learning to Extract Entities from Labelled and Unlabelled Text. PhD Thesis, Carnegie Mellon University.
Kazama, J., and Torisawa, K. 2007. A new perceptron algorithm for sequence labeling with non-local features. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Prague, Czech Republic, pp. 315–24.
Kluegl, P., Toepfer, M., Beck, P.-D., Fette, G., and Puppe, F. 2015. Uima ruta: rapid development of rule-based information extraction applications. Natural Language Engineering 22 (1), 140.
Lavelli, A., Califf, M. E., Ciravegna, F., Freitag, D., Giuliano, C., Kushmerick, N., and Romano, L. 2004. IE evaluation: criticisms and recommendations. In AAAI Workshop on Adaptive Text Extraction and Mining, San Jose, California.
Levenshtein, V. I., 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10 (8): 707–10.
Li, Y., Bontcheva, K., and Cunningham, H., 2009. Adapting SVM for data sparseness and imbalance: a case study in information extraction. Natural Language Engineering 15 (2): 241–71.
Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., and Jagadish, H. 2008. Regular expression learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Waikiki, Hawaii, pp. 21–30.
Liu, X., Wei, F., Zhang, S., and Zhou, M., 2013. Named entity recognition for tweets. ACM Transactions on Intelligent Systems and Technology 4 (1): 3.
Maedche, A., and Staab, S., 2001. Ontology learning for the semantic web. IEEE Intelligent Systems 16 (2): 72–9.
Marrero, M., Sánchez-Cuadrado, S., Morato, J., and Andreadakis, G., 2009. Evaluation of named entity extraction systems. Research in Computing Science 41: 4758.
Marrero, M., Sánchez-Cuadrado, S., Urbano, J., Morato, J., and Moreiro, J. A. 2012. Information retrieval systems adapted to the biomedical domain. arXiv:1203.6845 [cs.CL].
Marrero, M., and Urbano, J. 2015. Information Extraction Grammars. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.), Advances in Information Retrieval. ECIR 2015. Lecture Notes in Computer Science, vol. 9022. Springer, Cham.
Marrero, M., Urbano, J., Sánchez-Cuadrado, S., Morato, J., and Gómez-Berbís, J. M., 2013. Named entity recognition: fallacies, challenges and opportunities. Journal of Computer Standards and Interfaces 35 (5): 482–9.
McCallum, A., and Li, W. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning, Edmonton, Canada, pp. 188–91.
Nadeau, D. 2007. Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision. PhD Thesis, School of Information Technology and Engineering, University of Ottawa.
Nagesh, A., and Chiticariu, L. 2012. Towards efficient named-entity rule induction for customizability. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Jeju Island, Korea, pp. 128–38.
Nédellec, C., et al. 2013. Overview of BioNLP shared task 2013. ACL Workshop on BioNLP, Sofia, Bulgaria, pp. 17.
Nouvel, D., Antoine, J. Y., Friburger, N., and Soulet, A. 2012. Coupling knowledge-based and data-driven systems for named entity recognition. In Proceedings of the ACL Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, Avignon, France, pp. 69–77.
Pang, B., and Lee, L., 2007. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2 (1–2): 1135.
Pasca, M., Lin, D., Bigham, J., Lifchits, A., and Jain, A. 2006. Organizing and searching the world wide web of facts-step one: the one million fact extraction challenge. In Proceedings of the 21st National Conference on Artificial Intelligence, Boston, Massachusetts, pp. 1400–5.
Popescu, A.-M., and Etzioni, O. 2005. Extracting product features and opinions from reviews. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Vancouver, Canada, pp. 339–46.
Ratinov, L., and Roth, D. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Conference on Natural Language Learning, Boulder, Colorado, pp. 147–55.
Reeve, L. H., and Han, H. 2005. Survey of Semantic Annotation Platforms. ACM Symposium on Applied Computing, Santa Fe, USA, pp. 1634–8.
Rinaldi, F., et al. 2005. CAFETIERE: conceptual annotations for facts, events, terms, individual entities, and RElations. Technical Report TR-U4.3.1, Parmenides Project IST-2001-39023.
Ringger, E., et al. 2008. Assessing the costs of machine-assisted corpus annotation through a user study. In Proceedings of the International Conference on Language Resources and Evaluation, Marrakech, Morocco, pp. 3318–24.
Ritter, A., Clark, S., and Etzioni, O. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Edinburgh, United Kingdom, pp. 1524–34.
Sarawagi, S., 2008. Information extraction. Foundations and Trends in Databases 1 (3): 261377.
Sekine, S., Grishman, R., and Shinnou, H. 1998. A decision tree method for finding and classifying names in japanese texts. In Proceedings of the 6th Workshop on Very Large Corpora, Montreal, Canada, pp. 171–8.
Settles, B., 2012. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6 (1): 1114.
Shen, D., Zhang, J., Su, J., Zhou, G., and Tan, C.-L. 2004. Multi-criteria-based active learning for named entity recognition. In Proceedings of the Annual Meeting of the ACL, Barcelona, Spain, pp. 589–96.
Shinyama, Y., and Sekine, S. 2004. Named entity discovery using comparable news articles. In Proceedings of the International Conference on Computational Linguistics, Geneva, Switzerland, p. 848.
Silberztein, M. 2005. NooJ: a linguistic annotation system for corpus processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Vancouver, Canada, pp. 10–11.
Siniakov, P. 2008. GROPUS-an Adaptive Rule Based Algorithm for Information Extraction. PhD Thesis, Free University of Berlin.
Soderland, S., 1999. Learning information extraction rules for semi-structured and free text. Machine Learning 34 (1): 233–72.
Srihari, R. K., and Li, W. 1999. Information extraction supported question answering. Technical Report, Cymfony Inc.
Srikant, R., and Agrawal, R. 1996. Mining sequential patterns: generalizations and performance improvements. In Proceedings of the International Conference on Extending Database, Avignon, France, pp. 1–17.
Thompson, C. A., Califf, M. E., and Mooney, R. J. 1999. Active learning for natural language parsing and information extraction. In Proceedings of the International Conference on Machine Learning, Bled, Slovenia, pp. 406–14.
Tomanek, K., Wermter, J., and Hahn, U. 2007. An approach to text corpus construction which cuts annotation costs and maintains reusability of annotated data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Prague, Czech Republic, pp. 486–5.
Uren, V. S., et al. 2006. Semantic annotation for knowledge management: requirements and a survey of the state of the art. Journal of Web Semantics 4 (1): 1428.
Vijayanarasimhan, S., and Grauman, K. 2009. What’s it going to cost you? Predicting effort versus informativeness for multi-label image annotations. In Proceedings of the Confernce on Computer Vision and Pattern Recognition, Miami, Florida, pp. 2262–9.
Vlachos, A., 2008. A stopping criterion for active learning. Computer Speech & Language 22 (3): 295312.
Wu, T., and Pottenger, W. M., 2005. A semi-supervised active learning algorithm for information extraction from textual data. Journal of the American Society for Information Science and Technology 56 (3): 258–71.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 35
Total number of PDF views: 238 *
Loading metrics...

Abstract views

Total abstract views: 716 *
Loading metrics...

* Views captured on Cambridge Core between 15th June 2017 - 20th August 2018. This data will be updated every 24 hours.