Skip to main content
×
Home

Improving mention detection for Basque based on a deep error analysis

  • ANDER SORALUZE (a1), OLATZ ARREGI (a1), XABIER ARREGI (a1) and ARANTZA DÍAZ DE ILARRAZA (a1)
Abstract
Abstract

This paper presents the improvement process of a mention detector for Basque. The system is rule-based and takes into account the characteristics of mentions in Basque. A classification of error types is proposed based on the errors that occur during mention detection. A deep error analysis distinguishing error types and causes is presented and improvements are proposed. At the final stage, the system obtains an F-measure of 74.57% under the Exact Matching protocol and of 80.57% under Lenient Matching. We also show the performance of the mention detector with gold standard data as input, in order to omit errors caused by the previous stages of linguistic processing. In this scenario, we obtain an F-measure of 85.89% with Strict Matching and of 89.06% with Lenient Matching, i.e., a difference of 11.32 and 8.49 percentage points, respectively. Finally, how improvements in mention detection affect coreference resolution is analysed.

Copyright
References
Hide All
Aduriz I., Aranzabe M., Arriola J. M., Atutxa M., Díaz de Ilarraza A., Ezeiza N., Gojenola K., Oronoz M., Soroa A., and Urizar R., 2006. Methodology and steps towards the construction of EPEC, a corpus of written Basque tagged at morphological and syntactic levels for the automatic processing. Language and Computers 56 (1): 115.
Aduriz I., and Díaz de Ilarraza A. 2003. Morphosyntactic disambiguation and shallow parsing in computational processing of Basque. In Oyharçabal B. (ed.), Inquiries into the lexicon-syntax relations in Basque, pp. 121. University of the Basque Country, Bilbao, Spain.
Alegria I., Ansa O., Artola X., Ezeiza N., Gojenola K., and Urizar R., 2004. Representation and treatment of multiword expressions in Basque. In ACL Workshop on Multiword Expressions (MWE ’04), Barcelona, Spain, pp. 4855.
Alegria I., Aranzabe M., Ezeiza N., Ezeiza A., and Urizar R. 2002. Using finite state technology in natural language processing of Basque. In Watson B. W., and Wood D. Implementation and Application of Automata, pp. 112. Lecture Notes in Computer Science. Berlin: Springer/Heidelberg.
Alegria I., Artola X., Sarasola K., and Urkia M., 1996. Automatic morphological analysis of Basque. Literary & Linguistic Computing 11 (4): 193203.
Alegria I., Ezeiza N., Fernandez I., and Urizar R., 2003. Named entity recognition and classification for texts in Basque. In II Jornadas de Tratamiento y Recuperación de Información (JOTRI 2003), Madrid, Spain, pp. 198203.
Arrieta B. 2010. Azaleko sintaxiaren tratamendua ikasketa automatikoko tekniken bidez: euskarako kateen eta perpausen identifikazioa eta bere erabilera koma-zuzentzaile batean. PhD Thesis, Computer Languages and Systems, University of the Basque Country, Donostia-San Sebastián, Spain.
Artstein R., and Poesio M., 2008. Inter-coder agreement for computational linguistics. Computational Linguistics 34 (4): 555–96.
Bagga B., and Badlwin B., 1998. Algorithms for scoring coreference chains. In Proceedings of the 1st International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference, Granada, Spain, pp. 563–66.
Broscheit S., Poesio M., Ponzetto S. P., Rodriguez K. J., Romano L., Uryupina O., Versley Y., and Zanoli R., 2010. BART: a multilingual anaphora resolution system. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval 2010), Uppsala, Sweden, pp. 104–07.
Chang K.-W., Samdani R., Rozovskaya A., Rizzolo N., Sammons M., and Roth D., 2011. Inference protocols for coreference resolution. In Proceedings of the 15th Conference on Computational Natural Language Learning: Shared Task (CoNLL 2011), Portland, Oregon, pp. 40–4.
Doddington G., Mitchell A., Przybocki M., Ramshaw L., Strassel S., and Weischedel R., 2004. The automatic content extraction (ACE) program–tasks, data, and evaluation. In Proceedings of Language Resources and Evaluation Conference (LREC 2004), Lisbon, Portugal, pp. 837–40.
Hacioglu K., Douglas B., and Chen Y., 2005. Detection of entity mentions occuring in english and Chinese text. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT ’05), Vancouver, British Columbia, Canada, pp. 379–86.
Hulden M., 2009. Foma: a finite-state compiler and library. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), Athens, Greece, pp. 2932.
Karlsson F., Voutilainen J., Heikkilä J., and Anttila A., 1995. Constraint Grammar: Language-independent System for Parsing Unrestricted Text. Berlin: Mouton de Gruyter.
Klein D., and Manning C., 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (ACL ’03), Sapporo, Japan, pp. 423–30.
Kopeć M., and Ogrodniczuk M. 2014. Inter-annotator agreement in coreference annotation of Polish. In Sobecki J., Boonjing V., and Chittayasothorn S., (eds.), Advanced Approaches to Intelligent Information and Database Systems, Studies in Computational Intelligence, vol. 551. Switzerland: Springer. Springer International Publishing, Switzerland.
Kummerfeld J. K., Bansal M., Burkett D., and Klein D., 2011. Mention detection: heuristics for the OntoNotes annotations. In Proceedings of the 15th Conference on Computational Natural Language Learning: Shared Task (CoNLL 2011), Portland, Oregon, pp. 102–6.
Kummerfeld J. K., and Klein D., 2011. Error-driven analysis of challenges in coreference resolution. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), Seattle, Washington, USA, pp. 265–77.
Laka I. 1996. A brief grammar of Euskara, the Basque language. http://www.ehu.es/grammar. University of the Basque Country.
Lalitha D. S., Sundar R. V., and Rao R. K. P., 2014. A generic anaphora resolution engine for Indian languages. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 1824–33.
Lee H., Chang A., Peirsman Y., Chambers N., Surdeanu M., and Jurafsky D., 2013. A generic anaphora resolution engine for Indian languages. Computational Linguistics 39 (4): 885916.
Luo X., 2005. On coreference resolution performance metrics. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT ’05), Vancouver, British Columbia, Canada, pp. 2532.
Marcus M., Marcinkiewicz M., and Santorini B., 1993. Building a large annotated corpus of english: the Penn treebank. Computational Linguistics 19 (2): 313–30.
Màrquez L., Recasens M., and Sapena E., 2013. Coreference resolution: an empirical study based on SemEval-2010 shared task 1. Language Resources and Evaluation 47 (3): 661–94.
Miháltz M. 2008. Knowledge-based coreference resolution for Hungarian. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco.
MUC-6, 1995. Coreference Task Definition (v2.3, 8 Sep 95). In Proceedings of the 6th Message Understanding Conference (MUC-6), Columbia, Maryland, USA, pp. 335–44.
MUC-7 1998. Coreference task definition (v3.0, 13 Jul 97). In Proceedings of the 7th Message Understanding Conference (MUC-7), Fairfax, Virginia, USA.
NIST 2008. Automatic Content Extraction 2008 Evaluation Plan (ACE08).
Ngụy G., Novák V., and Żabokrtský Z., 2009. Comparison of classification and ranking approaches to pronominal anaphora resolution in Czech. In Proceedings of the SIGDIAL 2009 Conference, London, UK, pp. 276–85.
Nguyen N., Kim J.-D., and Tsujii J., 2008. Challenges in pronoun resolution system for biomedical text. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, pp. 2408–12.
Ogrodniczuk M., and Kopeć M. 2011. End-to-end coreference resolution baseline System for Polish. In Vetulani Z. (ed.), Proceedings of the 5th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Oznań, Poland, pp. 167–71.
Ohta T., Pyysalo S., Tsujii J., and Ananiadou S., 2008. Open-domain anatomical entity mention detection. In Proceedings of the Workshop on Detecting Structure in Scholarly Discourse (ACL’12), Jeju, Republic of Korea, pp. 2736.
Orasan C., Cristea D., Mitkov R., and Branco A. 2008. Anaphora resolution exercise: an overview. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco.
Pradhan S., Hovy E., Marcus M., Palmer M., Ramshaw L., and Weischedel R., 2007. OntoNotes: a unified relational semantic representation. In Proceedings of the International Conference on Semantic Computing (ICSC 2007), Irvine, California, pp. 517–26.
Pradhan S., Moschitti A., Xue N., Uryupina O., and Zhang Y., 2012. CoNLL-2012 Shared task: modeling multilingual unrestricted coreference in OntoNotes. In Proceedings of the 16th Conference on Computational Natural Language Learning (CoNLL 2012), Jeju, Korea, pp. 140.
Pradhan S., Ramshaw L., Marcus M., Palmer M., Weischedel R., and Xue N., 2011. CoNLL-2011 shared task: modeling unrestricted coreference in OntoNotes. In Proceedings of the 15th Conference on Computational Natural Language Learning: Shared Task (CoNLL 2011), Portland, Oregon, pp. 127.
Pradhan S., Luo X., Recasens M., Hovy E., Ng V., and Strube V., 2014. Scoring coreference partitions of predicted mentions: A reference implementation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), Baltimore, Maryland, pp. 30–5.
Recasens M., and Hovy E., 2010. Coreference resolution across Corpora: languages, coding Schemes, and preprocessing information. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL ’10), Sweden, pp. 1423–32.
Recasens M., and Hovy M., 2011. BLANC: implementing the Rand index for coreference evaluation. Natural Language Engineering 17 (4): 485510.
Recasens M., Màrquez L., Sapena E., Martí M. A., Taulé M., Hoste V., Poesio M., and Versley Y., 2010. SemEval-2010 task 1: coreference resolution in multiple languages. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval 2010), Uppsala, Sweden, pp. 18.
Recasens M., and Martí M., 2010. AnCora-CO: coreferentially annotated corpora for Spanish and Catalan. Language Resources and Evaluation 44 (4): 315–45.
Soon W. M., Ng H. T., and Lim D. C. Y., 2001. A machine learning approach to coreference resolution of noun phrases. Computational Linguistics 27 (4): 521–44.
Soraluze A., Arregi O., Arregi X., Ceberio K., and Díaz de Ilarraza A., 2012. Mention detection: first steps in the development of a Basque correference resolution system. In KONVENS 2012, The 11th Conference on Natural Language Processing, Vienna, Austria, pp. 128–36.
Stoyanov V., Gilbert N., Cardie C., and Riloff E., 2009. Conundrums in noun phrase coreference resolution: making sense of the state-of-the-art. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 656–64.
Uryupina O. 2008. Error analysis for learning-based coreference resolution. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco.
Uryupina O., 2010. Corry: a system for coreference resolution. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval 2010), Uppsala, Sweden, pp. 100–3.
Uryupina O., and Moschitti A., 2013. Multilingual mention detection for coreference resolution. In Proceedings of the 6th International Joint Conference on Natural Language Processing, Nagoya, Japan, pp. 100–8.
Vilain M., Burger J., Aberdeen J., Connolly D., and Hirschman L., 1995. A model-theoretic coreference scoring scheme. In Proceedings of the 6th Conference on Message Understanding (MUC6), Columbia, Maryland, pp. 4552.
Versley Y., Ponzetto S. P., Poesio M., Eidelman V., Jern A., Smith J., Yang X., and Moschitti A., 2008. BART: a modular toolkit for coreference resolution. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Columbus, Ohio, pp. 912.
Zhekova D., and Kübler S., 2010. UBIU: a language-independent system for coreference resolution. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval 2010), Uppsala, Sweden, pp. 96–9.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 2
Total number of PDF views: 51 *
Loading metrics...

Abstract views

Total abstract views: 315 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 23rd November 2017. This data will be updated every 24 hours.