Modernising historical Slovene words

YVES SCHERRER; TOMAŽ ERJAVEC

doi:10.1017/S1351324915000236

Modernising historical Slovene words

Published online by Cambridge University Press: 03 August 2015

YVES SCHERRER and

TOMAŽ ERJAVEC

Show author details

YVES SCHERRER: Affiliation:
LATL-CUI, Université de Genève, 7 route de Drize, 1227 Carouge, Switzerland e-mail: yves.scherrer@unige.ch
TOMAŽ ERJAVEC: Affiliation:
Department of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia e-mail: tomaz.erjavec@ijs.si

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We propose a language-independent word normalisation method and exemplify it on modernising historical Slovene words. Our method relies on character-level statistical machine translation (CSMT) and uses only shallow knowledge. We present relevant data on historical Slovene, consisting of two (partially) manually annotated corpora and the lexicons derived from these corpora, containing historical word–modern word pairs. The two lexicons are disjoint, with one serving as the training set containing 40,000 entries, and the other as a test set with 20,000 entries. The data spans the years 1750–1900, and the lexicons are split into fifty-year slices, with all the experiments carried out separately on the three time periods. We perform two sets of experiments. In the first one – a supervised setting – we build a CSMT system using the lexicon of word pairs as training data. In the second one – an unsupervised setting – we simulate a scenario in which word pairs are not available. We propose a two-step method where we first extract a noisy list of word pairs by matching historical words with cognate modern words, and then train a CSMT system on these pairs. In both sets of experiments, we also optionally make use of a lexicon of modern words to filter the modernisation hypotheses. While we show that both methods produce significantly better results than the baselines, their accuracy and which method works best strongly correlates with the age of the texts, meaning that the choice of the best method will depend on the properties of the historical language which is to be modernised. As an extrinsic evaluation, we also compare the quality of part-of-speech tagging and lemmatisation directly on historical text and on its modernised words. We show that, depending on the age of the text, annotation on modernised words also produces significantly better results than annotation on the original text.

Type: Articles
Information: Natural Language Engineering , Volume 22 , Issue 6 , November 2016 , pp. 881 - 905

DOI: https://doi.org/10.1017/S1351324915000236 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2015

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Baron, A., and Rayson, P. 2008. VARD 2: a tool for dealing with spelling variation in historical corpora. Proceedings of the Postgraduate Conference in Corpus Linguistics, Birmingham, UK: Aston University.Google Scholar

Beinborn, L., Zesch, T., and Gurevych, I., 2013. Cognate production using character-based machine translation. In Proceedings of IJCNLP 2013, Nagoya, Japan, pp. 883–91.Google Scholar

Bollmann, M., 2012. (Semi-)automatic normalization of historical texts using distance measures and the norma tool. In Proceedings of the 2nd Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2), Lisbon, Portugal, pp. 3–14.Google Scholar

Brants, T., 2000. TnT – a statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference (ANLP 2000), Seattle, WA, USA, pp. 224–31.Google Scholar

De Clercq, O., Desmet, B., Schulz, S., Lefever, E., and Hoste, V., 2013. Normalization of Dutch user-generated content. In Proceedings of RANLP 2013, Hissar, Bulgaria, pp. 179–88.Google Scholar

Erjavec, T., 2011. Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. In Proceedings of the 5th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), Portland, OR, USA, pp. 33–8.Google Scholar

Erjavec, T., 2012. The goo300k corpus of historical Slovene. In Proceedings of LREC 2012, Istanbul, Turkey, pp. 2257–60.Google Scholar

Erjavec, T. 2015. The IMP historical Slovene language resources. Language Resources and Evaluation, 49 (3): 753–75.Google Scholar

Erjavec, T., and Džeroski, S., 2004. Machine learning of language structure: lemmatising unknown Slovene words. Applied Artificial Intelligence 18 (1): 17–41.Google Scholar

Erjavec, T., Ignat, C., Pouliquen, B., and Steinberger, R., 2005. Massive multilingual corpus compilation: Acquis communautaire and ToTaLe. In Proceedings of the 2nd Language and Technology Conference, Poznan, Poland, pp. 32–6.Google Scholar

Federico, M., Bertoldi, N., and Cettolo, M., 2008. IRSTLM: an open source toolkit for handling large scale language models. In Proceedings of Interspeech 2008, Brisbane, Australia, pp. 1618–21.CrossRef Google Scholar

Fišer, D., and Ljubešić, N., 2011. Bilingual lexicon extraction from comparable corpora for closely related languages. In Proceedings of RANLP 2011, Hissar, Bulgaria, pp. 125–31.Google Scholar

Fišer, D., and Sagot, B. 2015. Constructing a poor man’s wordnet in a resource-rich world. Language Resources and Evaluation, online.CrossRef Google Scholar

Jiampojamarn, S., Kondrak, G., and Sherif, T., 2007. Applying many-to-many alignments and Hidden Markov Models to letter-to-phoneme conversion. In Proceedings of HLT-NAACL 2007, Rochester, NY, USA, pp. 372–9.Google Scholar

Johnson, H., Martin, J., Foster, G., and Kuhn, R., 2007. Improving translation quality by discarding most of the phrasetable. In Proceedings of EMNLP-CoNLL 2007, Prague, Czech Republic, pp. 967–75.Google Scholar

Jurish, B., 2010. Comparing canonicalizations of historical German text. In Proceedings of the SIGMORPHON 2010 Workshop, Uppsala, Sweden, pp. 72–7.Google Scholar

Kestemont, M., Daelemans, W., and De Pauw, G., 2010. Weigh your words – memory-based lemmatization for Middle Dutch. Literary and Linguistic Computing 25 : 287–301.Google Scholar

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E., 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the ACL 2007 Demonstration Session, Prague, Czech Republic, pp. 177–80.Google Scholar

Koehn, P. and Knight, K., 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL 2002 Workshop on Unsupervised Lexical Acquisition (SIGLEX 2002), Philadelphia, USA, pp. 9–16.Google Scholar

Kondrak, G., and Dorr, B., 2004. Identification of confusable drug names: a new approach and evaluation methodology. In Proceedings of COLING 2004, Geneva, Switzerland, pp. 952–8.Google Scholar

Kondrak, G., Marcu, D., and Knight, K., 2003. Cognates can improve statistical translation models. In Proceedings of NAACL-HLT 2003, Edmonton, Canada, pp. 46–8.Google Scholar

Kondrak, G., and Sherif, T., 2006. Evaluation of several phonetic similarity algorithms on the task of cognate identification. In Proceedings of the ACL 2006 Workshop on Linguistic Distances, Sydney, Australia, pp. 43–50.Google Scholar

Ljubešić, N., Erjavec, T., and Fišer, D., 2014. Standardizing tweets with character-level machine translation. In Proceedings of CICLing 2014, Kathmandu, Nepal: Springer, pp. 164–75.Google Scholar

Logar Berginc, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt, Š., and Krek, S., 2012. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba [The Gigafida, KRES, ccGigafida and ccKRES corpora of Slovene language: compilation, content, use]. Ljubljana, Slovenia: Trojina, zavod za uporabno slovenistiko: Fakulteta za drubene vede.Google Scholar

Mann, G. S., and Yarowsky, D., 2001. Multipath translation lexicon induction via bridge languages. In Proceedings of NAACL 2001, Pittsburgh, PA, USA, pp. 151–8.Google Scholar

Melamed, I. D., 1995. Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. In Proceedings of the 3rd Workshop on Very Large Corpora (WVLC3), Boston, MA, USA, pp. 184–98.Google Scholar

Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M. A., and Lieberman Aiden, E. 2011. Quantitative analysis of culture using millions of digitized books. Science 331 (6014): 176–82.Google Scholar

Och, F. J., 2003. Minimum error rate training in statistical machine translation. In Proceedings of ACL 2003, Sapporo, Japan, pp. 160–7.Google Scholar

Och, F. J., and Ney, H., 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 19–51.Google Scholar

Pettersson, E., Megyesi, B. B., and Nivre, J., 2013a. Normalisation of historical text using context-sensitive weighted Levenshtein distance and compound splitting. In Proceedings of the 19th Nordic Conference of Computational Linguistics (Nodalida 2013), Oslo, Norway, pp. 163–79.Google Scholar

Pettersson, E., Megyesi, B. B., and Nivre, J., 2014. A multilingual evaluation of three spelling normalisation methods for historical text. In Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), Gothenburg, Sweden, pp. 32–41.Google Scholar

Pettersson, E., Megyesi, B. B., and Tiedemann, J., 2013b. An SMT approach to automatic annotation of historical text. In Proceedings of the Nodalida Workshop on Computational Historical Linguistics, Oslo, Norway, pp. 54–69.Google Scholar

Piotrowski, M. 2012. Natural Language Processing for Historical Texts, Morgan & Claypool.Google Scholar

Rayson, P., Archer, D., Baron, A., and Smith, N. 2007. Tagging historical corpora – the problem of spelling variation. In Proceedings of Digital Historical Corpora, Dagstuhl-Seminar 06491, Wadern, Germany: International Conference and Research Center for Computer Science, Schloss Dagstuhl.Google Scholar

Reffle, U., 2011. Efficiently generating correction suggestions for garbled tokens of historical language. Natural Language Engineering 17 : 265–82.Google Scholar

Sánchez-Martínez, F., Martínez-Sempere, I., Ivars-Ribes, X., and Carrasco, R. C. 2013. An open diachronic corpus of historical Spanish: annotation criteria and automatic modernisation of spelling, (Research Report). Alicante: Departament de Llenguatges i Sistemes Informtics, Universitat d’Alacant.Google Scholar

Scheible, S., Whitt, R. J., Durrell, M., and Bennett, P., 2011. A gold standard corpus of Early Modern German. In Proceedings of the 5th Linguistic Annotation Workshop (LAW-V), Portland, OR, USA, pp. 124–8.Google Scholar

Scherrer, Y., 2007. Adaptive string distance measures for bilingual dialect lexicon induction. In Proceedings of the ACL 2007 Student Research Workshop, Prague, Czech Republic, pp. 55–60.Google Scholar

Scherrer, Y., and Erjavec, T., 2013. Modernizing historical Slovene words with character-based SMT. In Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, Sofia, Bulgaria, pp. 58–62.Google Scholar

Scherrer, Y., and Sagot, B., 2014. A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages. In Proceedings of LREC 2014, Reykjavik, Iceland, pp. 502–8.Google Scholar

Tiedemann, J., 1999. Automatic construction of weighted string similarity measures. In Proceedings of EMNLP-VLC 1999, University of Maryland, MD, USA, pp. 213–19.Google Scholar

Tiedemann, J., 2009. Character-based PSMT for closely related languages. In Proceedings of EAMT 2009, Barcelona, Spain, pp. 12–9.Google Scholar

Tiedemann, J., 2012. Character-based pivot translations for under-resourced languages and domains. In Proceedings of EACL 2012, Avignon, France, pp. 141–51.Google Scholar

Tiedemann, J., and Nabende, P., 2009. Translating transliterations. International Journal of Computing and ICT Research 3 (1): 33–41.Google Scholar

Tiedemann, J., and Nakov, P., 2013. Analyzing the use of character-level translation with sparse and noisy datasets. In Proceedings of RANLP 2013, Hissar, Bulgaria, pp. 676–84.Google Scholar

Vilar, D., Peter, J.-T., and Ney, H. 2007. Can we translate letters? In Proceedings of the 2nd Workshop on Statistical Machine Translation, Prague, Czech Republic, pp. 33–9.Google Scholar

Article contents

Modernising historical Slovene words

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests