Skip to main content
×
×
Home

The effect of morphology in named entity recognition with sequence tagging

  • ONUR GÜNGÖR (a1), TUNGA GÜNGÖR (a2) and SUZAN ÜSKÜDARLI (a2)
Abstract

This work proposes a sequential tagger for named entity recognition in morphologically rich languages. Several schemes for representing the morphological analysis of a word in the context of named entity recognition are examined. Word representations are formed by concatenating word and character embeddings with the morphological embeddings based on these schemes. The impact of these representations is measured by training and evaluating a sequential tagger composed of a conditional random field layer on top of a bidirectional long short-term memory layer. Experiments with Turkish, Czech, Hungarian, Finnish and Spanish produce the state-of-the-art results for all these languages, indicating that the representation of morphological information improves performance.

Copyright
Footnotes
Hide All

†This research was supported by Boğaziçi University Research Fund (BAP) under Grant 13083.

Footnotes
References
Hide All
Appelt, D. E., Hobbs, J. R., Bear, J., Israel, D., Kameyama, M., Martin, D., Myers, K., and Tyson, M. 1995. SRI International FASTUS system: MUC-6 test results and analysis. In Proceedings of the 6th Conference on Message Understanding, Association for Computational Linguistics, pp. 237–48.
Babych, B., and Hartley, A. 2003. Improving machine translation quality with automatic named entity recognition. In Proceedings of the 7th International EAMT Workshop on MT and Other Language Technology Tools, Improving MT through Other Language Technology Tools: Resources and Tools for Building MT, Association for Computational Linguistics, pp. 1–8.
Bhatia, P., Guthrie, R., and Eisenstein, J. 2016. Morphological priors for probabilistic neural word embeddings. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 490–500.
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T., 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5: 135–46.
Borthwick, A. E. 1999. A Maximum Entropy Approach to Named Entity Recognition, Ph.D. thesis. New York, NY, USA: New York University.
Collobert, R., and Weston, J. 2008. A unified architecture for natural language processing. In Proceedings of the 25th International Conference on Machine Learning (ICML-08), ACM, pp. 160–7.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P., 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12: 2493–537.
Cotterell, R., and Schütze, H., 2018. Joint Semantic Synthesis and Morphological Analysis of the Derived Word. Transactions of the Association for Computational Linguistics 6: 3348.
Çöltekin, Ç. 2014. A set of open source tools for Turkish natural language processing. In Proceedings of the 9th International Conference on Language Resources and Evaluation, pp. 1079–86.
Demir, H., and Özgür, A. 2014. Improving named entity recognition for morphologically rich languages using word embeddings. In Proceedings of the International Conference on Machine Learning and Applications (ICMLA), IEEE, pp. 117–22.
Dietterich, T. G., 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10: 1895–923.
Erjavec, T. 2004. MULTEXT-East version 3: multilingual morphosyntactic specifications, lexicons and corpora. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC-04), ELRA, pp. 1535–1538.
Erjavec, T. 2010. MULTEXT-East version 4: multilingual morphosyntactic specifications, lexicons and corpora. In Proceedings of the International Conference on Language Resources and Evaluation (LREC).
Farkas, R., Szeredi, D., Varga, D., and Vincze, V. 2010. MSD-KR harmonizacio Szeged Treebank 2.5-ben [Harmonizing MSD and KR codes in the Szeged Treebank 2.5]. In Proceedings of the VII Magyar Szamıtogepes Nyelveszeti Konferencia, pp. 349–53.
Finkel, J. R., Grenager, T., and Manning, C. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–70.
Graves, A., and Schmidhuber, J., 2005. Framewise phoneme classification with bidirectional LSTM networks and other neural network architectures. Neural Networks 18: 602–10.
Grishman, R., and Sundheim, B. 1996. Message understanding conference-6: a brief history. In Proceedings of the 16th Conference on Association for Computational Linguistics, pp. 466–71.
Guo, H., Zhu, H., Guo, Z., Zhang, X., Wu, X., and Su, Z. 2009. Domain adaptation with latent semantic association for named entity recognition. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 281–9.
Guo, J., Xu, G., Cheng, X., and Li, H. 2009. Named entity recognition in query. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 267–74.
Hajič, J., Panevová, J., Hajičová, E., Sgall, P., Pajas, P., Štěpánek, J., Havelka, J., Mikulová, M., Žabokrtský, Z., Ševčíková-Razímová, M., and Urešová, Z., 2006. Prague Dependency Treebank 2.0. Philadelphia, PA, USA: Linguistic Data Consortium.
Hajič, J., Hajičová, E., Mikulová, M., and Mírovský, J. 2017. Prague dependency treebank. In Ide, N. and Pustejovsky, J. (eds.), Handbook of Linguistic Annotation, pp. 555–94. Netherlands: Springer.
Hana, J., Zeman, D., Hajic, J., Hanová, H., Hladká, B., and Jerábek, E. 2005. Manual for morphological annotation. ÚFAL Technical Report, Revision for the Prague Dependency Treebank 2.0 (No. 2005/27).
Harris, Z. S. 1954. Distributional structure. Word 10: 146–62.
Hochreiter, S., and Schmidhuber, J., 1997. Long short-term memory. Neural Computation 9: 1735–80.
Huang, Z., Xu, W., and Yu, K. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv:1508.01991.
Humphreys, K., Gaizauskas, R., Azzam, S., Huyck, C., Mitchell, B., Cunningham, H., and Wilks, Y. 1998. University of Sheffield: description of the LaSIE-II system as used for MUC-7. In Proceedings of the 7th Message Understanding Conferences (MUC-7), ACL.
Jiang, J., and Zhai, C. X. 2007. Instance weighting for domain adaptation in NLP. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 264–71.
Konkol, M., and Konopik, M. 2013. CRF-based Czech named entity recognizer and consolidation of Czech NER research. In Habernal, I. and Matoušek, V. (eds.), Text, Speech and Dialogue, pp. 153–60. Lecture Notes in Computer Science, vol. 8082. Berlin, Heidelberg: Springer.
Koskenniemi, K. 1983. Two-level morphology: a general computational model for word form recognition and production. Publication no. 11, Department of General Linguistics, University of Helsinki, Finland.
Koskenniemi, K. 1984. A general computational model for word-form recognition and production. In Proceedings of the 10th International Conference on Computational Linguistics and 22nd Annual Meeting on Association for Computational Linguistics, pp. 178–81.
Kripke, S., 1982. Naming and Necessity. Boston: Harvard University Press.
Kuru, O., Can, O. A., and Yuret, D. 2016. CharNER: character-level named entity recognition. In Proceedings of the 26th International Conference on Computational Linguistics (COLING-2016), pp. 911–21.
Lafferty, J., McCallum, A., and Pereira, F. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML), pp. 282–9.
Lankinen, M., Heikinheimo, H., Takala, P., Raiko, T., and Karhunen, J. 2016. A character-word compositional neural language model for Finnish. CoRR abs/1612.03266.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. 2016. Neural architectures for named entity recognition. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT-2016), pp. 260–70.
Lee, J., Kim, G., Yoo, J., Jung, C., Kim, M., and Yoon, S. 2017. Training IBM Watson using automatically generated question-answer pairs, CoRR, abs/1611.03932.
Liu, Y., and Ren, F. 2011. Japanese named entity recognition for question answering system. In Proceedings of the IEEE International Conference on Cloud Computing and Intelligence Systems, IEEE, pp. 402–6.
Lee, C., Hwang, Y., and Jang, M. 2007. Fine-grained named entity recognition and relation extraction for question answering. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-2007), ACM, pp. 799–800.
Luong, T., Socher, R., and Manning, C. D. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the 17th Conference on Computational Natural Language Learning (CoNLL), pp. 104–13.
Ma, X., and Hovy, E. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1064–74.
McCallum, A., and Li, W. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL, Association for Computational Linguistics, pp. 188–91.
Mikolov, T., Karafiát, M., Burget, L., Cernocky, J., and Khudanpur, S. 2010. Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH), p. 3.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, vol. 2, pp. 3111–9.
Miwa, M., and Bansal, M. 2016. End-to-end relation extraction using LSTMs on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1105–16.
Oflazer, K., 1994. Two-level description of Turkish morphology. Literary and Linguistic Computing 9: 137–48.
Oflazer, K., 2003. Dependency parsing with an extended finite-state approach. Computational Linguistics 29: 515–44.
Lewis, G. L., 1991. Turkish Grammar. Oxford: Oxford University Press.
Pennington, J., Socher, R., and Manning, C. D. 2014. GloVe: global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2014), pp. 1532–43.
Pirinen, T. A. 2015. Omorfi – Free and open source morphological lexical database for Finnish. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA-2015), pp. 313–5.
Proszeky, G., and Tihanyi, L., 1993. Humor: high-speed unification morphology and its applications for agglutinative languages. La Tribune Des Industries de la Langue 10: 28–9.
Rao, D., McNamee, P., and Dredze, M. 2013. Entity linking: Finding extracted entities in a knowledge base. In Poibeau, T., Saggion, H., Piskorski, J. and Yangarber, R. (eds.), Multi-Source, Multilingual Information Extraction and Summarization, pp. 93115. Berlin, Heidelberg: Springer.
Sak, H., Güngör, T., and Saraçlar, M. 2007. Morphological disambiguation of Turkish text with perceptron algorithm. In Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing, pp. 107–18.
Santos, C. D., and Zadrozny, B. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine Learning (ICML-2014), pp. 1818–26.
Şeker, G. A., and Eryiğit, G. 2012. Initial explorations on using CRFs for Turkish named entity recognition. In Proceedings of the International Conference on Computational Linguistics (COLING-2012), pp. 2459–74.
Ševčíková, M., Žabokrtský, Z., and Krůza, O. 2007. Named entities in Czech: annotating data and developing NE tagger. In Proceedings of the International Conference on Text, Speech and Dialogue, pp. 188–95.
Shen, Q., Clothiaux, D., Tagtow, E., Littell, P., and Dyer, C. 2016. The role of context in neural morphological disambiguation. In Proceedings of the Conference on Computational Linguistics (COLING-2016), pp. 181–91.
Silfverberg, M., Ruokolainen, T., Lindén, K., and Kurimo, M., 2016. FinnPos: an open-source morphological tagging and lemmatization toolkit for Finnish. Language Resources and Evaluation 50: 863–78.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2013), pp. 1631–42.
Straková, J., Straka, M., and Hajič, J. 2016. Neural networks for featureless named entity recognition in Czech. In Proceedings of the 19th International Conference on Text, Speech and Dialogue (TSD-2016), pp. 173–181.
Szarvas, G., Farkas, R., Felföldi, L., Kocsor, A., and Csirik, J. 2006a. Highly accurate named entity corpus for Hungarian. In Proceedings of the International Conference on Language Resources and Evaluation.
Szarvas, G., Farkas, R., and Kocsor, A. 2006b. A multilingual named entity recognition system using boosting and C4.5 decision tree learning algorithms. In Proceedings of the International Conference on Discovery Science, pp. 267–78.
Toutanova, K., Klein, D., Manning, C., and Singer, Y. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the HLT-NAACL 2003, pp. 252–9.
Tron, V., Halacsy, P., Rebrus, P., Rung, A., Simon, E., and Vajda, P. 2006. The annotation system of HunMorph. Technical Report, The Media Research Center, Budapest University of Technology and Economics.
Tür, G., Hakkani-Tür, D., and Oflazer, K., 2003. A statistical information extraction system for Turkish. Natural Language Engineering 9: 181210.
Turian, J., Ratinov, L., and Bengio, Y. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 384–94.
Underhill, R., 1976. Turkish Grammar. Cambridge, MA: MIT Press.
Varga, D., and Simon, E., 2007. Hungarian named entity recognition with a maximum entropy approach. Acta Cybernetica 18: 293301.
Votrubec, J. 2006. Morphological tagging based on averaged perceptron. In Proceedings of the 15th Annual Conference of Doctoral Students (WDS-2006), pp. 191–5.
Voutilainen, A. 2011. FinnTreeBank: creating a research resource and service for language researchers with constraint grammar. In Proceedings of NoDaLiDa 2011 Workshop on Constraint Grammar Applications, pp. 41–9.
Wu, D., Lee, W. S., Ye, N., and Chieu, H. L. 2009. Domain adaptive bootstrapping for named entity recognition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2009), ACL, pp. 1523–32.
Xu, Y., and Liu, J. 2017. Implicitly incorporating morphological information into word embedding. CoRR abs/1701.02481.
Yang, Z., Salakhutdinov, R., and Cohen, W. 2016. Multi-task cross-lingual sequence tagging from scratch. CoRR abs/1603.06270.
Yeniterzi, R. 2011. Exploiting morphology in Turkish named entity recognition system. In Proceedings of the Association for Computational Linguistics Student Session (ACL-2011), pp. 105–10.
Yildiz, E., Tirkaz, C., Sahin, H. B., Eren, M. T., and Sonmez, O. 2016. A morphology-aware network for morphological disambiguation. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, AAAI Press, pp. 2863–9.
Zsibrita, J., Vincze, V., and Farkas, R. 2013. magyarlanc: A toolkit for morphological and dependency parsing of hungarian. In Proceedings of the Recent Advances in Natural Language Processing (RANLP-2013), pp. 763–71.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed