Tagging Icelandic text: A linguistic rule-based approach

Hrafn Loftsson

doi:10.1017/S0332586508001820

Tagging Icelandic text: A linguistic rule-based approach

Published online by Cambridge University Press: 29 April 2008

Hrafn Loftsson

Show author details

Hrafn Loftsson*: Affiliation:
School of Computer Science, Reykjavik University, IS-103 Reykjavik, Icelandhrafn@ru.is

Article contents

Abstract
References

Get access

Abstract

The Icelandic language is a morphologically complex language, for which a large tagset has been created. This paper describes the design of a linguistic rule-based system for part-of-speech tagging Icelandic text. The system contains two main components: a disambiguator, IceTagger, and an unknown word guesser, IceMorphy. IceTagger uses a small number of local elimination rules along with a global heuristics component. The heuristics guess the functional roles of the words in a sentence, mark prepositional phrases, and use the acquired knowledge to force feature agreement where appropriate. IceMorphy is used for guessing the tag profile for unknown words and for automatically filling tag profile gaps in the lexicon. Evaluation shows that IceTagger achieves 91.54% accuracy, a substantial improvement on the highest accuracy, 90.44%, obtained using three state-of-the-art data-driven taggers. Furthermore, the accuracy increases to 92.95% by using IceTagger along with two data-driven taggers in a simple voting scheme. The development time of the tagging system was only seven man-months, which can be considered a short development period for a linguistic rule-based system.

Keywords

data-driven tagging disambiguator linguistic rule-based tagging simple voting unknown word guesser

Type: Research Article
Information: Nordic Journal of Linguistics , Volume 31 , Issue 1 , June 2008 , pp. 47 - 72

DOI: https://doi.org/10.1017/S0332586508001820 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2008

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

REFERENCES

Birn, Juhani. 1998. Swedish Constraint Grammar: A short presentation. Technical Report, Lingsoft, Inc.Google Scholar

Brants, Thorsten. 2000. TnT: A statistical part-of-speech tagger. The 6th Conference on Applied Natural Language Processing, 224–231. Seattle, WA.Google Scholar

Brill, Eric. 1992. A simple rule-based part of speech tagger. The 3rd Conference on Applied Natural Language Processing, 152–155. Trento.CrossRef Google Scholar

Brill, Eric. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics 21 (4), 543–565.Google Scholar

Chanod, Jean-Pierre & Tapanainen, Pasi. 1995. Tagging French – comparing a statistical and a constraint-based method. The 7th Conference of European Chapter of the ACL. Dublin, 149–156.Google Scholar

Cherry, Lorinda L. 1980. PARTS – A system for assigning word classes to English text. Technical Report, Computing Science #81, Bell Laboratories.Google Scholar

Dietterich, Thomas G. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10 (7), 1895–1924.CrossRef Google Scholar PubMed

Halteren, Hans van, Zavrel, Jakub & Daelemans, Walter. 2001. Improving accuracy in wordclass tagging through combination of machine learning systems. Computational Linguistics 27 (2) 199–230.CrossRef Google Scholar

Helgadóttir, Sigrún. 2004. Testing data-driven learning algorithms for PoS tagging of Icelandic. In Holmboe, Henrik (ed.), Nordisk Sprogteknologi 2004, 257–265. Copenhagen: Museum Tusculanums Forlag.Google Scholar

Hinrichs, Erhard W. & Trushkina, Julia. 2002. Getting a grip on morphological disambiguation. KONVENS 2002, 6. Konferenz zur Verarbeitung natürlicher Sprache. Saarbrücken, 59–66.Google Scholar

Karlsson, Fred, Voutilainen, Atro, Heikkilä, Juha & Anttila, Arto (eds.). 1995. Constraint Grammar. Berlin: Mouton de Gruyter.CrossRef Google Scholar

Klein, Sheldon & Simmons, Robert. 1963. A computational approach to grammatical coding of English words. Journal of the ACM 10, 334–347.CrossRef Google Scholar

Kupiec, Julian. 1992. Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language 6, 225–242.CrossRef Google Scholar

Lezius, Wolfgang. 2000. Morphy – German morphology, part-of-speech tagging and applications. The 9th EURALEX International Congress, 619–623. Stuttgart.Google Scholar

Loftsson, Hrafn. 2006a. Tagging a morphologically complex language using heuristics. In Salakoski, Tapio, Ginter, Filip, Pyysalo, Sampo & Pahikkala, Tapio (eds.), Advances in Natural Language Processing: The 5th International Conference on NLP, FinTAL 2006, 640–651. Turku.CrossRef Google Scholar

Loftsson, Hrafn. 2006b. Tagging Icelandic text: An experiment with integrations and combinations of taggers. Language Resources and Evaluation 40 (2), 175–181.CrossRef Google Scholar

Marcus, Mitchell P., Santorini, Beatrice & Marcinkiewicz, Mary A.. 1994. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19 (2), 313–330.Google Scholar

Megyesi, Beáta. 2002. Data-driven syntactic analysis: Methods and applications for Swedish. Ph.D. dissertation. KTH, Stockholm.Google Scholar

Mikheev, Andrei. 1997. Automatic rule induction for unknown word guessing. Computational Linguistics 21 (4), 543–565.Google Scholar

Nakov, Preslav, Bonev, Yury, Angelova, Galia, Cius, Evelyn & vonHahn, Walther. 2003. Guessing morphological classes of unknown German nouns. Recent Advances in Natural Language Processing, 347–356. Borovets.CrossRef Google Scholar

Ngai, Grace & Florian, Radu. 2001. Transformation-based learning in the fast lane. The 2nd Conference of the North American Chapter of the ACL, 1–8. Pittsburgh, PA.CrossRef Google Scholar

Pind, Jörgen, Magnússon, Friðrik & Briem, Stefán. 1991. The Icelandic frequency dictionary. Reykjavik: The Institute of Lexicography, University of Iceland.Google Scholar

Ratnaparkhi, Adwait. 1996. A maximum entropy model for part-of-speech tagging. The Conference on Empirical Methods in Natural Language Processing, 133–142. Philadelphia, PA.Google Scholar

Samuelsson, Christer. 1994. Morphological tagging based entirely on Bayesian inference. In Eklund, Robert (ed.), The 9th Nordic Conference on Computational Linguistics, 225–238. Stockholm.Google Scholar

Samuelsson, Christer & Voutilainen, Atro. 1997. Comparing a linguistic and a stochastic tagger. The 8th Conference on the European Chapter of the ACL, 246–253. Madrid.CrossRef Google Scholar

Schmid, Helmut. 1995. Improvements in part-of-speech tagging with an application to German. In Feldweg, Helmut & Hinrichs, Erhard W. (eds.), Lexikon und Text, 47–50. Tübingen: Max Niemeyer Verlag.Google Scholar

Toutanova, Kristina, Klein, Dan, Manning, Christopher D. & Singer, Yoram. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. The 2003 Conference of the North American Chapter of the ACL on Human Language Technology, 252–259. Edmonton.CrossRef Google Scholar

Toutanova, Kristina & Manning, Christopher D.. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. The 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 63–70. Hong Kong.CrossRef Google Scholar

Voutilainen, Atro. 1995. A syntax-based part-of-speech analyzer. The 7th Conference of the European Chapter of the ACL, 157–164. Dublin.CrossRef Google Scholar

Loftsson, Hrafn. 2008. Tagging Icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics 31.1 ▪▪–▪▪.Google Scholar

Article contents

Tagging Icelandic text: A linguistic rule-based approach

Abstract

Keywords

Access options

References

REFERENCES

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests