Hostname: page-component-76fb5796d-skm99 Total loading time: 0 Render date: 2024-04-27T17:31:59.169Z Has data issue: false hasContentIssue false

Tagging Icelandic text: A linguistic rule-based approach

Published online by Cambridge University Press:  29 April 2008

Hrafn Loftsson*
Affiliation:
School of Computer Science, Reykjavik University, IS-103 Reykjavik, Icelandhrafn@ru.is
Get access

Abstract

The Icelandic language is a morphologically complex language, for which a large tagset has been created. This paper describes the design of a linguistic rule-based system for part-of-speech tagging Icelandic text. The system contains two main components: a disambiguator, IceTagger, and an unknown word guesser, IceMorphy. IceTagger uses a small number of local elimination rules along with a global heuristics component. The heuristics guess the functional roles of the words in a sentence, mark prepositional phrases, and use the acquired knowledge to force feature agreement where appropriate. IceMorphy is used for guessing the tag profile for unknown words and for automatically filling tag profile gaps in the lexicon. Evaluation shows that IceTagger achieves 91.54% accuracy, a substantial improvement on the highest accuracy, 90.44%, obtained using three state-of-the-art data-driven taggers. Furthermore, the accuracy increases to 92.95% by using IceTagger along with two data-driven taggers in a simple voting scheme. The development time of the tagging system was only seven man-months, which can be considered a short development period for a linguistic rule-based system.

Type
Research Article
Copyright
Copyright © Cambridge University Press 2008

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

REFERENCES

Birn, Juhani. 1998. Swedish Constraint Grammar: A short presentation. Technical Report, Lingsoft, Inc.Google Scholar
Brants, Thorsten. 2000. TnT: A statistical part-of-speech tagger. The 6th Conference on Applied Natural Language Processing, 224–231. Seattle, WA.Google Scholar
Brill, Eric. 1992. A simple rule-based part of speech tagger. The 3rd Conference on Applied Natural Language Processing, 152–155. Trento.CrossRefGoogle Scholar
Brill, Eric. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics 21 (4), 543565.Google Scholar
Chanod, Jean-Pierre & Tapanainen, Pasi. 1995. Tagging French – comparing a statistical and a constraint-based method. The 7th Conference of European Chapter of the ACL. Dublin, 149–156.Google Scholar
Cherry, Lorinda L. 1980. PARTS – A system for assigning word classes to English text. Technical Report, Computing Science #81, Bell Laboratories.Google Scholar
Dietterich, Thomas G. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10 (7), 18951924.CrossRefGoogle ScholarPubMed
Halteren, Hans van, Zavrel, Jakub & Daelemans, Walter. 2001. Improving accuracy in wordclass tagging through combination of machine learning systems. Computational Linguistics 27 (2) 199230.CrossRefGoogle Scholar
Helgadóttir, Sigrún. 2004. Testing data-driven learning algorithms for PoS tagging of Icelandic. In Holmboe, Henrik (ed.), Nordisk Sprogteknologi 2004, 257265. Copenhagen: Museum Tusculanums Forlag.Google Scholar
Hinrichs, Erhard W. & Trushkina, Julia. 2002. Getting a grip on morphological disambiguation. KONVENS 2002, 6. Konferenz zur Verarbeitung natürlicher Sprache. Saarbrücken, 59–66.Google Scholar
Karlsson, Fred, Voutilainen, Atro, Heikkilä, Juha & Anttila, Arto (eds.). 1995. Constraint Grammar. Berlin: Mouton de Gruyter.CrossRefGoogle Scholar
Klein, Sheldon & Simmons, Robert. 1963. A computational approach to grammatical coding of English words. Journal of the ACM 10, 334347.CrossRefGoogle Scholar
Kupiec, Julian. 1992. Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language 6, 225242.CrossRefGoogle Scholar
Lezius, Wolfgang. 2000. Morphy – German morphology, part-of-speech tagging and applications. The 9th EURALEX International Congress, 619–623. Stuttgart.Google Scholar
Loftsson, Hrafn. 2006a. Tagging a morphologically complex language using heuristics. In Salakoski, Tapio, Ginter, Filip, Pyysalo, Sampo & Pahikkala, Tapio (eds.), Advances in Natural Language Processing: The 5th International Conference on NLP, FinTAL 2006, 640651. Turku.CrossRefGoogle Scholar
Loftsson, Hrafn. 2006b. Tagging Icelandic text: An experiment with integrations and combinations of taggers. Language Resources and Evaluation 40 (2), 175181.CrossRefGoogle Scholar
Marcus, Mitchell P., Santorini, Beatrice & Marcinkiewicz, Mary A.. 1994. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19 (2), 313330.Google Scholar
Megyesi, Beáta. 2002. Data-driven syntactic analysis: Methods and applications for Swedish. Ph.D. dissertation. KTH, Stockholm.Google Scholar
Mikheev, Andrei. 1997. Automatic rule induction for unknown word guessing. Computational Linguistics 21 (4), 543565.Google Scholar
Nakov, Preslav, Bonev, Yury, Angelova, Galia, Cius, Evelyn & vonHahn, Walther. 2003. Guessing morphological classes of unknown German nouns. Recent Advances in Natural Language Processing, 347–356. Borovets.CrossRefGoogle Scholar
Ngai, Grace & Florian, Radu. 2001. Transformation-based learning in the fast lane. The 2nd Conference of the North American Chapter of the ACL, 1–8. Pittsburgh, PA.CrossRefGoogle Scholar
Pind, Jörgen, Magnússon, Friðrik & Briem, Stefán. 1991. The Icelandic frequency dictionary. Reykjavik: The Institute of Lexicography, University of Iceland.Google Scholar
Ratnaparkhi, Adwait. 1996. A maximum entropy model for part-of-speech tagging. The Conference on Empirical Methods in Natural Language Processing, 133–142. Philadelphia, PA.Google Scholar
Samuelsson, Christer. 1994. Morphological tagging based entirely on Bayesian inference. In Eklund, Robert (ed.), The 9th Nordic Conference on Computational Linguistics, 225238. Stockholm.Google Scholar
Samuelsson, Christer & Voutilainen, Atro. 1997. Comparing a linguistic and a stochastic tagger. The 8th Conference on the European Chapter of the ACL, 246–253. Madrid.CrossRefGoogle Scholar
Schmid, Helmut. 1995. Improvements in part-of-speech tagging with an application to German. In Feldweg, Helmut & Hinrichs, Erhard W. (eds.), Lexikon und Text, 4750. Tübingen: Max Niemeyer Verlag.Google Scholar
Toutanova, Kristina, Klein, Dan, Manning, Christopher D. & Singer, Yoram. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. The 2003 Conference of the North American Chapter of the ACL on Human Language Technology, 252–259. Edmonton.CrossRefGoogle Scholar
Toutanova, Kristina & Manning, Christopher D.. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. The 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 63–70. Hong Kong.CrossRefGoogle Scholar
Voutilainen, Atro. 1995. A syntax-based part-of-speech analyzer. The 7th Conference of the European Chapter of the ACL, 157–164. Dublin.CrossRefGoogle Scholar
Loftsson, Hrafn. 2008. Tagging Icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics 31.1 ▪▪–▪▪.Google Scholar