Hostname: page-component-5db6c4db9b-wnbrb Total loading time: 0 Render date: 2023-03-26T16:49:39.616Z Has data issue: true Feature Flags: { "useRatesEcommerce": false } hasContentIssue true

A Semi-automatic and low-cost method to learn patterns for named entity recognition*

Published online by Cambridge University Press:  15 June 2017

Barcelona Supercomputing Center, Carrer de Jordi Girona, 29-31, 08034 Barcelona, Spain e-mail:
Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlands e-mail:


Named Entity Recognition is a basic task in Information Extraction that aims at identifying entities of interest within full text documents. The patterns used to recognize entities can be rule based, as in the popular JAPE system. However, hand-crafting effective patterns is often difficult, and yet there is little research devoted to methods capable of learning human-readable patterns, possibly with arbitrary sets of features. In this paper, we present a semi-automatic method to generate both regular expressions and a subset of the JAPE language. It does not need a corpus annotated beforehand. Instead, it employs active learning and combines clustering with an algorithm that finds alignments between symbols present in the entities discovered during the learning process. The method currently supports a fixed set of character features and an arbitrary set of token features, but it can incorporate other kinds of features as well. Through several experiments with an English corpus, we show the ability of the method to generate effective patterns at a low annotation cost, and how it can successfully help in the annotation of brand new corpora.

Copyright © Cambridge University Press 2017 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)



This work was partially supported by the Spanish Government through a Juan de la Cierva fellowship and project MDM-2015-0502. We specially thank Jorge Morato and Sonia Sánchez for their advice, as well as the anonymous reviewers for their suggestions.


Alfonseca, E., and Manandhar, S. 2002. An unsupervised method for general named entity recognition and automated concept discovery. In Proceedings of the 1st International Conference on General WordNet, Mysore, India, pp. 34–43.Google Scholar
Appelt, D. E., and Onyshkevych, B. 1998. The common pattern specification language. In Proceedings of the TIPSTER Text Program: Phase III, Baltimore, Maryland, pp. 23–30.Google Scholar
Asahara, M., and Matsumoto, Y. 2003. Japanese named entity extraction with redundant morphological analysis. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Canada: Edmonton, vol. 1, pp. 8–15.Google Scholar
Bikel, D. M., Miller, S., Schwartz, R., and Weischedel, R. 1997. Nymble: a high-performance learning name-finder. In Proceedings of the 5th Conference on Applied Natural Language Processing, Washington, DC, pp. 194–201.Google Scholar
Boguraev, B. K. 2004. Annotation-based finite state processing in a large-scale NLP architecture. In , Nikolov et al. (eds.), Recent Advances in Natural Language Processing III: Selected Papers from RANLP 2003, John Benjamins Publishing, Amsterdam, pp. 6177.CrossRefGoogle Scholar
Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R. 1998. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the 6th Workshop on Very Large Corpora, Montreal, Canada, pp. 152–160.Google Scholar
Brauer, F., Rieger, R., Mocan, A., and Barczynski, W. M. 2011. Enabling information extraction by inference of regular expressions from sample entities. In Proceedings of the 20th Conference on Information and Knowledge Management, Glasgow, United Kindgdom, pp. 1285–94.Google Scholar
Califf, M. E. 1998. Relational Learning Techniques for Natural Language Information Extraction. PhD Thesis, The University of Texas at Austin.Google Scholar
Chiticariu, L., Krishnamurthy, R., Li, Y., Reiss, F., and Vaithyanathan, S. 2010. Domain adaptation of rule-based annotators for named-entity recognition tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Massachusetts, USA, pp. 1002–12.Google Scholar
Chiticariu, L., and Reiss, F. R. 2013. Rule-based information extraction is dead! Long live rule-based information extraction systems! In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Seattle, USA, pp. 827–32.Google Scholar
Ciravegna, F., and Wilks, Y. 2003. Designing adaptive information extraction for the semantic web in amilcare. In Handschuh, S., and Staab, S., , S. (eds.), Annotation for the Semantic Web, Frontiers in Artificial Intelligence and Applications series, vol. 96, pp. 112–27. IOS Press.Google Scholar
Culotta, A., and Mccallum, A. 2005. Reducing labeling effort for structured prediction tasks. In Proceedings of the 20th National Conference on Artificial Intelligence, Pittsburgh, Pennsylvania, pp. 746–51.Google Scholar
Cunningham, H., et al. 2013. Developing language processing components with GATE (a user gGuide). Technical Report, University of Sheffield Department of Computer Science.Google Scholar
Day, W. H. E., and Edelsbrunner, H., 1984. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification 1 (1): 724.CrossRefGoogle Scholar
Drozdzynski, W., Krieger, H.-U., Piskorski, J., Schäfer, U., and Xu, F., 2004. Shallow processing with unification and typed feature structures: foundations and applications. Künstliche Intelligenz 1 (1): 1723.Google Scholar
Etzioni, O., et al. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artificial Intelligence 165 (1): 91134.CrossRefGoogle Scholar
Fersini, E., Messina, E., Felici, G., and Roth, D., 2014. Soft-constrained inference for named entity recognition. Information Processing and Management 50 (5): 807–19.CrossRefGoogle Scholar
Finkel, J. R., Grenager, T., and Manning, C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Ann Arbor, Michigan, pp. 363–70.Google Scholar
Freitag, D. 1998. Toward general-purpose learning for information extraction retargetability. In Proceedings of the 17th International Conference on Computational Linguistics, Montreal, Canada, pp. 404–8.Google Scholar
Gantz, J., and Reinsel, D. 2012. The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. Technical Report, IDC.Google Scholar
Gupta, S., and Manning, C. D. 2014. Improved pattern learning for bootstrapped entity extraction. In Proceedings of the 18th Conference on Computational Natural Language Learning, Baltimore, USA, pp. 98–108.Google Scholar
Hachey, B., Alex, B., and Becker, M. 2005. Investigating the effects of selective sampling on the annotation task. In Proceedings of the 9th Conference on Computational Natural Language Learning, Ann Arbor, Michigan, pp. 144–51.Google Scholar
Haertel, R. A., Seppi, K. D., Ringger, E. K., and Carroll, J. L. 2008. Return on investment for active learning. NIPS Workshop on Cost-Sensitive Learning.Google Scholar
Irmak, U., and Kraft, R. 2010. A scalable machine-learning approach for semi-structured named entity recognition. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, USA, pp. 461–70.Google Scholar
Jones, R. 2005. Learning to Extract Entities from Labelled and Unlabelled Text. PhD Thesis, Carnegie Mellon University.Google Scholar
Kazama, J., and Torisawa, K. 2007. A new perceptron algorithm for sequence labeling with non-local features. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Prague, Czech Republic, pp. 315–24.Google Scholar
Kluegl, P., Toepfer, M., Beck, P.-D., Fette, G., and Puppe, F. 2015. Uima ruta: rapid development of rule-based information extraction applications. Natural Language Engineering 22 (1), 140.CrossRefGoogle Scholar
Lavelli, A., Califf, M. E., Ciravegna, F., Freitag, D., Giuliano, C., Kushmerick, N., and Romano, L. 2004. IE evaluation: criticisms and recommendations. In AAAI Workshop on Adaptive Text Extraction and Mining, San Jose, California.Google Scholar
Levenshtein, V. I., 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10 (8): 707–10.Google Scholar
Li, Y., Bontcheva, K., and Cunningham, H., 2009. Adapting SVM for data sparseness and imbalance: a case study in information extraction. Natural Language Engineering 15 (2): 241–71.CrossRefGoogle Scholar
Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., and Jagadish, H. 2008. Regular expression learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Waikiki, Hawaii, pp. 21–30.Google Scholar
Liu, X., Wei, F., Zhang, S., and Zhou, M., 2013. Named entity recognition for tweets. ACM Transactions on Intelligent Systems and Technology 4 (1): 3.CrossRefGoogle Scholar
Maedche, A., and Staab, S., 2001. Ontology learning for the semantic web. IEEE Intelligent Systems 16 (2): 72–9.CrossRefGoogle Scholar
Marrero, M., Sánchez-Cuadrado, S., Morato, J., and Andreadakis, G., 2009. Evaluation of named entity extraction systems. Research in Computing Science 41: 4758.Google Scholar
Marrero, M., Sánchez-Cuadrado, S., Urbano, J., Morato, J., and Moreiro, J. A. 2012. Information retrieval systems adapted to the biomedical domain. arXiv:1203.6845 [cs.CL].Google Scholar
Marrero, M., and Urbano, J. 2015. Information Extraction Grammars. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.), Advances in Information Retrieval. ECIR 2015. Lecture Notes in Computer Science, vol. 9022. Springer, Cham.Google Scholar
Marrero, M., Urbano, J., Sánchez-Cuadrado, S., Morato, J., and Gómez-Berbís, J. M., 2013. Named entity recognition: fallacies, challenges and opportunities. Journal of Computer Standards and Interfaces 35 (5): 482–9.CrossRefGoogle Scholar
McCallum, A., and Li, W. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning, Edmonton, Canada, pp. 188–91.Google Scholar
Nadeau, D. 2007. Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision. PhD Thesis, School of Information Technology and Engineering, University of Ottawa.Google Scholar
Nagesh, A., and Chiticariu, L. 2012. Towards efficient named-entity rule induction for customizability. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Jeju Island, Korea, pp. 128–38.Google Scholar
Nédellec, C., et al. 2013. Overview of BioNLP shared task 2013. ACL Workshop on BioNLP, Sofia, Bulgaria, pp. 17.Google Scholar
Nouvel, D., Antoine, J. Y., Friburger, N., and Soulet, A. 2012. Coupling knowledge-based and data-driven systems for named entity recognition. In Proceedings of the ACL Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, Avignon, France, pp. 69–77.Google Scholar
Pang, B., and Lee, L., 2007. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2 (1–2): 1135.CrossRefGoogle Scholar
Pasca, M., Lin, D., Bigham, J., Lifchits, A., and Jain, A. 2006. Organizing and searching the world wide web of facts-step one: the one million fact extraction challenge. In Proceedings of the 21st National Conference on Artificial Intelligence, Boston, Massachusetts, pp. 1400–5.Google Scholar
Popescu, A.-M., and Etzioni, O. 2005. Extracting product features and opinions from reviews. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Vancouver, Canada, pp. 339–46.Google Scholar
Ratinov, L., and Roth, D. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Conference on Natural Language Learning, Boulder, Colorado, pp. 147–55.Google Scholar
Reeve, L. H., and Han, H. 2005. Survey of Semantic Annotation Platforms. ACM Symposium on Applied Computing, Santa Fe, USA, pp. 1634–8.Google Scholar
Rinaldi, F., et al. 2005. CAFETIERE: conceptual annotations for facts, events, terms, individual entities, and RElations. Technical Report TR-U4.3.1, Parmenides Project IST-2001-39023.Google Scholar
Ringger, E., et al. 2008. Assessing the costs of machine-assisted corpus annotation through a user study. In Proceedings of the International Conference on Language Resources and Evaluation, Marrakech, Morocco, pp. 3318–24.Google Scholar
Ritter, A., Clark, S., and Etzioni, O. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Edinburgh, United Kingdom, pp. 1524–34.Google Scholar
Sarawagi, S., 2008. Information extraction. Foundations and Trends in Databases 1 (3): 261377.CrossRefGoogle Scholar
Sekine, S., Grishman, R., and Shinnou, H. 1998. A decision tree method for finding and classifying names in japanese texts. In Proceedings of the 6th Workshop on Very Large Corpora, Montreal, Canada, pp. 171–8.Google Scholar
Settles, B., 2012. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6 (1): 1114.CrossRefGoogle Scholar
Shen, D., Zhang, J., Su, J., Zhou, G., and Tan, C.-L. 2004. Multi-criteria-based active learning for named entity recognition. In Proceedings of the Annual Meeting of the ACL, Barcelona, Spain, pp. 589–96.Google Scholar
Shinyama, Y., and Sekine, S. 2004. Named entity discovery using comparable news articles. In Proceedings of the International Conference on Computational Linguistics, Geneva, Switzerland, p. 848.Google Scholar
Silberztein, M. 2005. NooJ: a linguistic annotation system for corpus processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Vancouver, Canada, pp. 10–11.Google Scholar
Siniakov, P. 2008. GROPUS-an Adaptive Rule Based Algorithm for Information Extraction. PhD Thesis, Free University of Berlin.Google Scholar
Soderland, S., 1999. Learning information extraction rules for semi-structured and free text. Machine Learning 34 (1): 233–72.CrossRefGoogle Scholar
Srihari, R. K., and Li, W. 1999. Information extraction supported question answering. Technical Report, Cymfony Inc.Google Scholar
Srikant, R., and Agrawal, R. 1996. Mining sequential patterns: generalizations and performance improvements. In Proceedings of the International Conference on Extending Database, Avignon, France, pp. 1–17.Google Scholar
Thompson, C. A., Califf, M. E., and Mooney, R. J. 1999. Active learning for natural language parsing and information extraction. In Proceedings of the International Conference on Machine Learning, Bled, Slovenia, pp. 406–14.Google Scholar
Tomanek, K., Wermter, J., and Hahn, U. 2007. An approach to text corpus construction which cuts annotation costs and maintains reusability of annotated data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Prague, Czech Republic, pp. 486–5.Google Scholar
Uren, V. S., et al. 2006. Semantic annotation for knowledge management: requirements and a survey of the state of the art. Journal of Web Semantics 4 (1): 1428.CrossRefGoogle Scholar
Vijayanarasimhan, S., and Grauman, K. 2009. What’s it going to cost you? Predicting effort versus informativeness for multi-label image annotations. In Proceedings of the Confernce on Computer Vision and Pattern Recognition, Miami, Florida, pp. 2262–9.Google Scholar
Vlachos, A., 2008. A stopping criterion for active learning. Computer Speech & Language 22 (3): 295312.CrossRefGoogle Scholar
Wu, T., and Pottenger, W. M., 2005. A semi-supervised active learning algorithm for information extraction from textual data. Journal of the American Society for Information Science and Technology 56 (3): 258–71.CrossRefGoogle Scholar