Hostname: page-component-5d59c44645-lfgmx Total loading time: 0 Render date: 2024-02-24T16:10:12.705Z Has data issue: false hasContentIssue false

Modernising historical Slovene words

Published online by Cambridge University Press:  03 August 2015

LATL-CUI, Université de Genève, 7 route de Drize, 1227 Carouge, Switzerland e-mail:
Department of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia e-mail:


We propose a language-independent word normalisation method and exemplify it on modernising historical Slovene words. Our method relies on character-level statistical machine translation (CSMT) and uses only shallow knowledge. We present relevant data on historical Slovene, consisting of two (partially) manually annotated corpora and the lexicons derived from these corpora, containing historical word–modern word pairs. The two lexicons are disjoint, with one serving as the training set containing 40,000 entries, and the other as a test set with 20,000 entries. The data spans the years 1750–1900, and the lexicons are split into fifty-year slices, with all the experiments carried out separately on the three time periods. We perform two sets of experiments. In the first one – a supervised setting – we build a CSMT system using the lexicon of word pairs as training data. In the second one – an unsupervised setting – we simulate a scenario in which word pairs are not available. We propose a two-step method where we first extract a noisy list of word pairs by matching historical words with cognate modern words, and then train a CSMT system on these pairs. In both sets of experiments, we also optionally make use of a lexicon of modern words to filter the modernisation hypotheses. While we show that both methods produce significantly better results than the baselines, their accuracy and which method works best strongly correlates with the age of the texts, meaning that the choice of the best method will depend on the properties of the historical language which is to be modernised. As an extrinsic evaluation, we also compare the quality of part-of-speech tagging and lemmatisation directly on historical text and on its modernised words. We show that, depending on the age of the text, annotation on modernised words also produces significantly better results than annotation on the original text.

Copyright © Cambridge University Press 2015 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)


Baron, A., and Rayson, P. 2008. VARD 2: a tool for dealing with spelling variation in historical corpora. Proceedings of the Postgraduate Conference in Corpus Linguistics, Birmingham, UK: Aston University.Google Scholar
Beinborn, L., Zesch, T., and Gurevych, I., 2013. Cognate production using character-based machine translation. In Proceedings of IJCNLP 2013, Nagoya, Japan, pp. 883–91.Google Scholar
Bollmann, M., 2012. (Semi-)automatic normalization of historical texts using distance measures and the norma tool. In Proceedings of the 2nd Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2), Lisbon, Portugal, pp. 314.Google Scholar
Brants, T., 2000. TnT – a statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference (ANLP 2000), Seattle, WA, USA, pp. 224–31.Google Scholar
De Clercq, O., Desmet, B., Schulz, S., Lefever, E., and Hoste, V., 2013. Normalization of Dutch user-generated content. In Proceedings of RANLP 2013, Hissar, Bulgaria, pp. 179–88.Google Scholar
Erjavec, T., 2011. Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. In Proceedings of the 5th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), Portland, OR, USA, pp. 33–8.Google Scholar
Erjavec, T., 2012. The goo300k corpus of historical Slovene. In Proceedings of LREC 2012, Istanbul, Turkey, pp. 2257–60.Google Scholar
Erjavec, T. 2015. The IMP historical Slovene language resources. Language Resources and Evaluation, 49 (3): 753–75.Google Scholar
Erjavec, T., and Džeroski, S., 2004. Machine learning of language structure: lemmatising unknown Slovene words. Applied Artificial Intelligence 18 (1): 1741.Google Scholar
Erjavec, T., Ignat, C., Pouliquen, B., and Steinberger, R., 2005. Massive multilingual corpus compilation: Acquis communautaire and ToTaLe. In Proceedings of the 2nd Language and Technology Conference, Poznan, Poland, pp. 32–6.Google Scholar
Federico, M., Bertoldi, N., and Cettolo, M., 2008. IRSTLM: an open source toolkit for handling large scale language models. In Proceedings of Interspeech 2008, Brisbane, Australia, pp. 1618–21.CrossRefGoogle Scholar
Fišer, D., and Ljubešić, N., 2011. Bilingual lexicon extraction from comparable corpora for closely related languages. In Proceedings of RANLP 2011, Hissar, Bulgaria, pp. 125–31.Google Scholar
Fišer, D., and Sagot, B. 2015. Constructing a poor man’s wordnet in a resource-rich world. Language Resources and Evaluation, online.CrossRefGoogle Scholar
Jiampojamarn, S., Kondrak, G., and Sherif, T., 2007. Applying many-to-many alignments and Hidden Markov Models to letter-to-phoneme conversion. In Proceedings of HLT-NAACL 2007, Rochester, NY, USA, pp. 372–9.Google Scholar
Johnson, H., Martin, J., Foster, G., and Kuhn, R., 2007. Improving translation quality by discarding most of the phrasetable. In Proceedings of EMNLP-CoNLL 2007, Prague, Czech Republic, pp. 967–75.Google Scholar
Jurish, B., 2010. Comparing canonicalizations of historical German text. In Proceedings of the SIGMORPHON 2010 Workshop, Uppsala, Sweden, pp. 72–7.Google Scholar
Kestemont, M., Daelemans, W., and De Pauw, G., 2010. Weigh your words – memory-based lemmatization for Middle Dutch. Literary and Linguistic Computing 25 : 287301.Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E., 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the ACL 2007 Demonstration Session, Prague, Czech Republic, pp. 177–80.Google Scholar
Koehn, P. and Knight, K., 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL 2002 Workshop on Unsupervised Lexical Acquisition (SIGLEX 2002), Philadelphia, USA, pp. 916.Google Scholar
Kondrak, G., and Dorr, B., 2004. Identification of confusable drug names: a new approach and evaluation methodology. In Proceedings of COLING 2004, Geneva, Switzerland, pp. 952–8.Google Scholar
Kondrak, G., Marcu, D., and Knight, K., 2003. Cognates can improve statistical translation models. In Proceedings of NAACL-HLT 2003, Edmonton, Canada, pp. 46–8.Google Scholar
Kondrak, G., and Sherif, T., 2006. Evaluation of several phonetic similarity algorithms on the task of cognate identification. In Proceedings of the ACL 2006 Workshop on Linguistic Distances, Sydney, Australia, pp. 4350.Google Scholar
Ljubešić, N., Erjavec, T., and Fišer, D., 2014. Standardizing tweets with character-level machine translation. In Proceedings of CICLing 2014, Kathmandu, Nepal: Springer, pp. 164–75.Google Scholar
Logar Berginc, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt, Š., and Krek, S., 2012. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba [The Gigafida, KRES, ccGigafida and ccKRES corpora of Slovene language: compilation, content, use]. Ljubljana, Slovenia: Trojina, zavod za uporabno slovenistiko: Fakulteta za drubene vede.Google Scholar
Mann, G. S., and Yarowsky, D., 2001. Multipath translation lexicon induction via bridge languages. In Proceedings of NAACL 2001, Pittsburgh, PA, USA, pp. 151–8.Google Scholar
Melamed, I. D., 1995. Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. In Proceedings of the 3rd Workshop on Very Large Corpora (WVLC3), Boston, MA, USA, pp. 184–98.Google Scholar
Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M. A., and Lieberman Aiden, E. 2011. Quantitative analysis of culture using millions of digitized books. Science 331 (6014): 176–82.Google Scholar
Och, F. J., 2003. Minimum error rate training in statistical machine translation. In Proceedings of ACL 2003, Sapporo, Japan, pp. 160–7.Google Scholar
Och, F. J., and Ney, H., 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 1951.Google Scholar
Pettersson, E., Megyesi, B. B., and Nivre, J., 2013a. Normalisation of historical text using context-sensitive weighted Levenshtein distance and compound splitting. In Proceedings of the 19th Nordic Conference of Computational Linguistics (Nodalida 2013), Oslo, Norway, pp. 163–79.Google Scholar
Pettersson, E., Megyesi, B. B., and Nivre, J., 2014. A multilingual evaluation of three spelling normalisation methods for historical text. In Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), Gothenburg, Sweden, pp. 3241.Google Scholar
Pettersson, E., Megyesi, B. B., and Tiedemann, J., 2013b. An SMT approach to automatic annotation of historical text. In Proceedings of the Nodalida Workshop on Computational Historical Linguistics, Oslo, Norway, pp. 5469.Google Scholar
Piotrowski, M. 2012. Natural Language Processing for Historical Texts, Morgan & Claypool.Google Scholar
Rayson, P., Archer, D., Baron, A., and Smith, N. 2007. Tagging historical corpora – the problem of spelling variation. In Proceedings of Digital Historical Corpora, Dagstuhl-Seminar 06491, Wadern, Germany: International Conference and Research Center for Computer Science, Schloss Dagstuhl.Google Scholar
Reffle, U., 2011. Efficiently generating correction suggestions for garbled tokens of historical language. Natural Language Engineering 17 : 265–82.Google Scholar
Sánchez-Martínez, F., Martínez-Sempere, I., Ivars-Ribes, X., and Carrasco, R. C. 2013. An open diachronic corpus of historical Spanish: annotation criteria and automatic modernisation of spelling, (Research Report). Alicante: Departament de Llenguatges i Sistemes Informtics, Universitat d’Alacant.Google Scholar
Scheible, S., Whitt, R. J., Durrell, M., and Bennett, P., 2011. A gold standard corpus of Early Modern German. In Proceedings of the 5th Linguistic Annotation Workshop (LAW-V), Portland, OR, USA, pp. 124–8.Google Scholar
Scherrer, Y., 2007. Adaptive string distance measures for bilingual dialect lexicon induction. In Proceedings of the ACL 2007 Student Research Workshop, Prague, Czech Republic, pp. 5560.Google Scholar
Scherrer, Y., and Erjavec, T., 2013. Modernizing historical Slovene words with character-based SMT. In Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, Sofia, Bulgaria, pp. 5862.Google Scholar
Scherrer, Y., and Sagot, B., 2014. A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages. In Proceedings of LREC 2014, Reykjavik, Iceland, pp. 502–8.Google Scholar
Tiedemann, J., 1999. Automatic construction of weighted string similarity measures. In Proceedings of EMNLP-VLC 1999, University of Maryland, MD, USA, pp. 213–19.Google Scholar
Tiedemann, J., 2009. Character-based PSMT for closely related languages. In Proceedings of EAMT 2009, Barcelona, Spain, pp. 12–9.Google Scholar
Tiedemann, J., 2012. Character-based pivot translations for under-resourced languages and domains. In Proceedings of EACL 2012, Avignon, France, pp. 141–51.Google Scholar
Tiedemann, J., and Nabende, P., 2009. Translating transliterations. International Journal of Computing and ICT Research 3 (1): 3341.Google Scholar
Tiedemann, J., and Nakov, P., 2013. Analyzing the use of character-level translation with sparse and noisy datasets. In Proceedings of RANLP 2013, Hissar, Bulgaria, pp. 676–84.Google Scholar
Vilar, D., Peter, J.-T., and Ney, H. 2007. Can we translate letters? In Proceedings of the 2nd Workshop on Statistical Machine Translation, Prague, Czech Republic, pp. 33–9.Google Scholar