Skip to main content Accessibility help
×
Home

Modernising historical Slovene words

  • YVES SCHERRER (a1) and TOMAŽ ERJAVEC (a2)

Abstract

We propose a language-independent word normalisation method and exemplify it on modernising historical Slovene words. Our method relies on character-level statistical machine translation (CSMT) and uses only shallow knowledge. We present relevant data on historical Slovene, consisting of two (partially) manually annotated corpora and the lexicons derived from these corpora, containing historical word–modern word pairs. The two lexicons are disjoint, with one serving as the training set containing 40,000 entries, and the other as a test set with 20,000 entries. The data spans the years 1750–1900, and the lexicons are split into fifty-year slices, with all the experiments carried out separately on the three time periods. We perform two sets of experiments. In the first one – a supervised setting – we build a CSMT system using the lexicon of word pairs as training data. In the second one – an unsupervised setting – we simulate a scenario in which word pairs are not available. We propose a two-step method where we first extract a noisy list of word pairs by matching historical words with cognate modern words, and then train a CSMT system on these pairs. In both sets of experiments, we also optionally make use of a lexicon of modern words to filter the modernisation hypotheses. While we show that both methods produce significantly better results than the baselines, their accuracy and which method works best strongly correlates with the age of the texts, meaning that the choice of the best method will depend on the properties of the historical language which is to be modernised. As an extrinsic evaluation, we also compare the quality of part-of-speech tagging and lemmatisation directly on historical text and on its modernised words. We show that, depending on the age of the text, annotation on modernised words also produces significantly better results than annotation on the original text.

Copyright

References

Hide All
Baron, A., and Rayson, P. 2008. VARD 2: a tool for dealing with spelling variation in historical corpora. Proceedings of the Postgraduate Conference in Corpus Linguistics, Birmingham, UK: Aston University.
Beinborn, L., Zesch, T., and Gurevych, I., 2013. Cognate production using character-based machine translation. In Proceedings of IJCNLP 2013, Nagoya, Japan, pp. 883–91.
Bollmann, M., 2012. (Semi-)automatic normalization of historical texts using distance measures and the norma tool. In Proceedings of the 2nd Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2), Lisbon, Portugal, pp. 314.
Brants, T., 2000. TnT – a statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference (ANLP 2000), Seattle, WA, USA, pp. 224–31.
De Clercq, O., Desmet, B., Schulz, S., Lefever, E., and Hoste, V., 2013. Normalization of Dutch user-generated content. In Proceedings of RANLP 2013, Hissar, Bulgaria, pp. 179–88.
Erjavec, T., 2011. Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. In Proceedings of the 5th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), Portland, OR, USA, pp. 33–8.
Erjavec, T., 2012. The goo300k corpus of historical Slovene. In Proceedings of LREC 2012, Istanbul, Turkey, pp. 2257–60.
Erjavec, T. 2015. The IMP historical Slovene language resources. Language Resources and Evaluation, 49 (3): 753–75.
Erjavec, T., and Džeroski, S., 2004. Machine learning of language structure: lemmatising unknown Slovene words. Applied Artificial Intelligence 18 (1): 1741.
Erjavec, T., Ignat, C., Pouliquen, B., and Steinberger, R., 2005. Massive multilingual corpus compilation: Acquis communautaire and ToTaLe. In Proceedings of the 2nd Language and Technology Conference, Poznan, Poland, pp. 32–6.
Federico, M., Bertoldi, N., and Cettolo, M., 2008. IRSTLM: an open source toolkit for handling large scale language models. In Proceedings of Interspeech 2008, Brisbane, Australia, pp. 1618–21.
Fišer, D., and Ljubešić, N., 2011. Bilingual lexicon extraction from comparable corpora for closely related languages. In Proceedings of RANLP 2011, Hissar, Bulgaria, pp. 125–31.
Fišer, D., and Sagot, B. 2015. Constructing a poor man’s wordnet in a resource-rich world. Language Resources and Evaluation, online.
Jiampojamarn, S., Kondrak, G., and Sherif, T., 2007. Applying many-to-many alignments and Hidden Markov Models to letter-to-phoneme conversion. In Proceedings of HLT-NAACL 2007, Rochester, NY, USA, pp. 372–9.
Johnson, H., Martin, J., Foster, G., and Kuhn, R., 2007. Improving translation quality by discarding most of the phrasetable. In Proceedings of EMNLP-CoNLL 2007, Prague, Czech Republic, pp. 967–75.
Jurish, B., 2010. Comparing canonicalizations of historical German text. In Proceedings of the SIGMORPHON 2010 Workshop, Uppsala, Sweden, pp. 72–7.
Kestemont, M., Daelemans, W., and De Pauw, G., 2010. Weigh your words – memory-based lemmatization for Middle Dutch. Literary and Linguistic Computing 25 : 287301.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E., 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the ACL 2007 Demonstration Session, Prague, Czech Republic, pp. 177–80.
Koehn, P. and Knight, K., 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL 2002 Workshop on Unsupervised Lexical Acquisition (SIGLEX 2002), Philadelphia, USA, pp. 916.
Kondrak, G., and Dorr, B., 2004. Identification of confusable drug names: a new approach and evaluation methodology. In Proceedings of COLING 2004, Geneva, Switzerland, pp. 952–8.
Kondrak, G., Marcu, D., and Knight, K., 2003. Cognates can improve statistical translation models. In Proceedings of NAACL-HLT 2003, Edmonton, Canada, pp. 46–8.
Kondrak, G., and Sherif, T., 2006. Evaluation of several phonetic similarity algorithms on the task of cognate identification. In Proceedings of the ACL 2006 Workshop on Linguistic Distances, Sydney, Australia, pp. 4350.
Ljubešić, N., Erjavec, T., and Fišer, D., 2014. Standardizing tweets with character-level machine translation. In Proceedings of CICLing 2014, Kathmandu, Nepal: Springer, pp. 164–75.
Logar Berginc, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt, Š., and Krek, S., 2012. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba [The Gigafida, KRES, ccGigafida and ccKRES corpora of Slovene language: compilation, content, use]. Ljubljana, Slovenia: Trojina, zavod za uporabno slovenistiko: Fakulteta za drubene vede.
Mann, G. S., and Yarowsky, D., 2001. Multipath translation lexicon induction via bridge languages. In Proceedings of NAACL 2001, Pittsburgh, PA, USA, pp. 151–8.
Melamed, I. D., 1995. Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. In Proceedings of the 3rd Workshop on Very Large Corpora (WVLC3), Boston, MA, USA, pp. 184–98.
Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M. A., and Lieberman Aiden, E. 2011. Quantitative analysis of culture using millions of digitized books. Science 331 (6014): 176–82.
Och, F. J., 2003. Minimum error rate training in statistical machine translation. In Proceedings of ACL 2003, Sapporo, Japan, pp. 160–7.
Och, F. J., and Ney, H., 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 1951.
Pettersson, E., Megyesi, B. B., and Nivre, J., 2013a. Normalisation of historical text using context-sensitive weighted Levenshtein distance and compound splitting. In Proceedings of the 19th Nordic Conference of Computational Linguistics (Nodalida 2013), Oslo, Norway, pp. 163–79.
Pettersson, E., Megyesi, B. B., and Nivre, J., 2014. A multilingual evaluation of three spelling normalisation methods for historical text. In Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), Gothenburg, Sweden, pp. 3241.
Pettersson, E., Megyesi, B. B., and Tiedemann, J., 2013b. An SMT approach to automatic annotation of historical text. In Proceedings of the Nodalida Workshop on Computational Historical Linguistics, Oslo, Norway, pp. 5469.
Piotrowski, M. 2012. Natural Language Processing for Historical Texts, Morgan & Claypool.
Rayson, P., Archer, D., Baron, A., and Smith, N. 2007. Tagging historical corpora – the problem of spelling variation. In Proceedings of Digital Historical Corpora, Dagstuhl-Seminar 06491, Wadern, Germany: International Conference and Research Center for Computer Science, Schloss Dagstuhl.
Reffle, U., 2011. Efficiently generating correction suggestions for garbled tokens of historical language. Natural Language Engineering 17 : 265–82.
Sánchez-Martínez, F., Martínez-Sempere, I., Ivars-Ribes, X., and Carrasco, R. C. 2013. An open diachronic corpus of historical Spanish: annotation criteria and automatic modernisation of spelling, (Research Report). Alicante: Departament de Llenguatges i Sistemes Informtics, Universitat d’Alacant.
Scheible, S., Whitt, R. J., Durrell, M., and Bennett, P., 2011. A gold standard corpus of Early Modern German. In Proceedings of the 5th Linguistic Annotation Workshop (LAW-V), Portland, OR, USA, pp. 124–8.
Scherrer, Y., 2007. Adaptive string distance measures for bilingual dialect lexicon induction. In Proceedings of the ACL 2007 Student Research Workshop, Prague, Czech Republic, pp. 5560.
Scherrer, Y., and Erjavec, T., 2013. Modernizing historical Slovene words with character-based SMT. In Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, Sofia, Bulgaria, pp. 5862.
Scherrer, Y., and Sagot, B., 2014. A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages. In Proceedings of LREC 2014, Reykjavik, Iceland, pp. 502–8.
Tiedemann, J., 1999. Automatic construction of weighted string similarity measures. In Proceedings of EMNLP-VLC 1999, University of Maryland, MD, USA, pp. 213–19.
Tiedemann, J., 2009. Character-based PSMT for closely related languages. In Proceedings of EAMT 2009, Barcelona, Spain, pp. 12–9.
Tiedemann, J., 2012. Character-based pivot translations for under-resourced languages and domains. In Proceedings of EACL 2012, Avignon, France, pp. 141–51.
Tiedemann, J., and Nabende, P., 2009. Translating transliterations. International Journal of Computing and ICT Research 3 (1): 3341.
Tiedemann, J., and Nakov, P., 2013. Analyzing the use of character-level translation with sparse and noisy datasets. In Proceedings of RANLP 2013, Hissar, Bulgaria, pp. 676–84.
Vilar, D., Peter, J.-T., and Ney, H. 2007. Can we translate letters? In Proceedings of the 2nd Workshop on Statistical Machine Translation, Prague, Czech Republic, pp. 33–9.

Related content

Powered by UNSILO

Modernising historical Slovene words

  • YVES SCHERRER (a1) and TOMAŽ ERJAVEC (a2)

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed.