Skip to main content

The automatic identification of lexical variation between language varieties


Languages are not uniform. Speakers of different language varieties use certain words differently – more or less frequently, or with different meanings. We argue that distributional semantics is the ideal framework for the investigation of such lexical variation. We address two research questions and present our analysis of the lexical variation between Belgian Dutch and Netherlandic Dutch. The first question involves a classic application of distributional models: the automatic retrieval of synonyms. We use corpora of two different language varieties to identify the Netherlandic Dutch synonyms for a set of typically Belgian words. Second, we address the problem of automatically identifying words that are typical of a given lect, either because of their high frequency or because of their divergent meaning. Overall, we show that distributional models are able to identify more lectal markers than traditional keyword methods. Distributional models also have a bias towards a different type of variation. In summary, our results demonstrate how distributional semantics can help research in variational linguistics, with possible future applications in lexicography or terminology extraction.

Hide All
Bai, J., Song, D., Bruza, P., Nie, J.-Y., and Cao, G. 2005. Query expansion using term relationships in language models for information retrieval. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM 2005), pp. 688695. New York, NY: ACM Press.
Baroni, M., Lenci, A., and Onnis, L. 2007. ISA meets Lara: an incremental word space model for cognitively plausible simulations of semantic learning. In Proceedings of the ACL Workshop on Cognitive Aspects of Computational Language Acquisition, pp. 4956. Stroudsburg, PA: Association for Computational Linguistics.
Bertels, A. 2008. Sémantique quantitative et corpus technique: des analyses statistiques aux interprétations linguistiques. In Actes des 9es Journées internationales d'Analyse statistique des Données Textuelles (JADT 2008), pp. 179188. Lyon: Presses universitaires de Lyon.
Boussidan, A., Sagi, E., and Ploux, S. 2009. Phonaesthemic and etymological effects on the distribution of senses in statistical models of semantics. In Proceedings of the CogSci Workshop on Distributional Semantics Beyond Concrete Concepts (DiSCo 2009), pp. 35–40.
Buchanan, L., Burgess, C., and Lund, K. 1996. Overcrowding in semantic neighborhoods: modeling deep dyslexia. Brain and Cognition 32: 111114.
Burgess, C., Livesay, K., and Lund, K. 1998. Explorations in context space: words, sentences, discourse. Discourse Processes 25: 211257.
Curran, J. R. 2004. From Distributional to Semantic Similarity. PhD thesis, University of Edinburgh, Edinburgh, UK.
Den Boon, T., and Geeraerts, D. (Eds.). 2005. Van Dale Groot Woordenboek van de Nederlandse taal (14th ed.). Utrecht/Antwerp, Belgium: Van Dale Lexicografie.
Divjak, D., and Gries, S. T. 2006. Ways of trying in Russian: clustering behavioral profiles. Corpus Linguistics and Linguistic Theory 2 (1): 2360.
Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19 (1): 6174.
Firth, J. R. 1957. A synopsis of linguistic theory 1930–1955. In Firth, J. R. (ed.), Studies in Linguistic Analysis, pp. 132. Oxford: Blackwell.
Foltz, P. W. 1996. Latent semantic analysis for text-based research. Behavior Research Methods, Instruments and Computers 28 (2): 197202.
Fung, P., and McKeown, K. 1997. Finding terminology translations from non-parallel corpora. In Proceedings of the 5th Workshop on Very Large Corpora, pp. 192–202.
Fung, P., and Yee, L. Y. 1998. An IR approach for translating new words from non-parallel, comparable texts. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL 1998), pp. 414–420.
Gamallo Otero, P. 2008. Evaluating two different methods for the task of extracting bilingual lexicons from comparable corpora. In Proceedings of the LREC-2008 Workshop on Comparable Corpora, pp. 1926. Paris, France: European Language Resources Association.
Gamallo Otero, P., and Pichel Campos, J. R. 2008. Learning Spanish-Galician translation equivalents using a comparable corpus and a bilingual dictionary. In Computational Linguistics and Intelligent Text Processing, pp. 423433. Lecture Notes in Computer Science, vol. 4919. New York, NY: Springer.
Geeraerts, D. 2010. Lexical variation in space. In Auer, P., and Schmidt, J. E. (eds.), Language and Space. An International Handbook of Linguistic Variation, pp. 820836. Berlin: De Gruyter Mouton.
Geeraerts, D., Grondelaers, S., and Speelman, D. 1999. Convergentie en Divergentie in de Nederlandse Woordenschat. Amsterdam: Meertens Instituut.
Glynn, D. 2007. Mapping Meaning. Toward a Usage-Based Methodology in Cognitive Semantics. PhD thesis, University of Leuven, Leuven, Belgium.
Grefenstette, G. 1994. Explorations in Automatic Thesaurus Discovery. Dordrecht: Kluwer.
Gries, S. T. 2006. Corpus-based methods and cognitive semantics: the many meanings of to run. In Gries, S. T., and Stefanowitsch, A. (eds.), Corpora in Cognitive Linguistics: Corpus-Based Approaches to Syntax and Lexis, pp. 5799. Berlin: Mouton de Gruyter.
Harris, Z. 1954. Distributional structure. Word 10 (2/3): 146162.
Jijkoun, V., and de Rijke, M. 2005. Recognizing textual entailment: is word similarity enough? In Quinonero-Candela, J., Dagan, I., Magnini, B., and d'Alché-Buc, F. (eds.), Machine Learning Challenges, Evaluating Predictive Uncertainty, Visual Object Classification and Recognizing Textual Entailment, First PASCAL Machine Learning Challenges Workshop (MLCW 2005), Lecture Notes in Computer Science 3944, pp. 449460. New York, NY: Springer.
Kakkonen, T., Myller, N., Timonen, J., and Sutinen, E. 2005. Automatic essay grading with probabilistic latent semantic analysis. In Proceedings of the 2nd Workshop on Building Educational Applications Using NLP (EdAppsNLP 05), pp. 2936. Stroudsburg, PA: Association for Computational Linguistics.
Kilgarriff, A. 2001. Comparing corpora. International Journal of Corpus Linguistics 6 (1): 97133.
Kilgarriff, A. 2005. Language is never ever ever random. Corpus Linguistics and Linguistic Theory 1 (2): 263276.
Kintsch, W. 2000. Metaphor comprehension: a computational theory. Psychonomic Bulletin & Review 7: 257266.
Landauer, T. K., and Dumais, S. T. 1997. A solution to Plato's problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review 104 (2): 211240.
Lin, D. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL 1998), pp. 768774. Stroudsburg, PA: Association for Computational Linguistics.
Lowe, W., and McDonald, S. 2000. The direct route: mediated priming in semantic space. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society (CogSci 2000), pp. 675680. Wheat Ridge, CO: Cognitive Science Society.
Martin, W. 2005. Het Belgisch-Nederlands anders bekeken: het Referentiebestand Belgisch-Nederlands (RBBN). Technical report, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands.
Mitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K.-M., Malave, V. L., Mason, R. A., and Just, M. A. 2008. Predicting human brain activity associated with the meanings of nouns. Science 320 (5880): 11911195.
Mohammad, S., Gurevych, I., Hirst, G., and Zesch, T. 2007. Cross-lingual distributional profiles of concepts for measuring semantic distance. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), pp. 571580. Stroudsburg, PA: Association for Computational Linguistics.
Ordelman, R., de Jong, F., van Hessen, A., and Hondorp, G. 2007. TwNC: a multifaceted Dutch news corpus. ELRA Newsletter 12 (3–4): 19.
Padó, S., and Lapata, M. 2007. Dependency-based construction of semantic space models. Computational Linguistics 33 (2): 161199.
Pantel, P., and Lin, D. 2002. Discovering word senses from text. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002), pp. 613619. New York, NY: ACM Press.
Peirsman, Y., and Geeraerts, D. 2009. Predicting strong associations on the basis of corpus data. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), pp. 648656. Stroudsburg, PA: Association for Computational Linguistics.
Peirsman, Y., Heylen, K., and Speelman, D. 2007. Finding semantically related words in Dutch. Co-occurrences versus syntactic contexts. In Proceedings of the Workshop on Contextual Information in Semantic Space Models (CoSMO 2007), pp. 34–41.
Rapp, R. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL 1995), pp. 320322. Stroudsburg, PA: Association for Computational Linguistics.
Rapp, R. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 1999), pp. 519526. Stroudsburg, PA: Association for Computational Linguistics.
Rayson, P., Berridge, D., and Francis, B. 2004. Extending the Cochran Rule for the comparison of word frequencies between corpora. In Le poids des mots. Actes des 7es Journées internationales d'Analyse statistique des Données Textuelles (JADT 2004), pp. 926936. Louvain-la-Neuve, Belgium: Presses universitaires de Louvain.
Sagi, E., Kaufmann, S., and Clark, B. 2009. Semantic density analysis: comparing word meaning across time and phonetic space. In Proceedings of the EACL 2009 Workshop on GEMS: Geometrical Models of Natural Language Semantics, Stroudsburg, pp. 104111. PA: Association for Computational Linguistics.
Sahlgren, M. 2006. The Word-Space Model. Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations Between Words in High-dimensional Vector Spaces. PhD thesis, Stockholm University, Stockholm, Sweden.
Salton, G., and McGill, M. J. 1983. Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill.
Schütze, H. 1998. Automatic word sense discrimination. Computational Linguistics 24 (1): 97124.
Scott, M. 1997. PC analysis of key words – and key key words. System 25 (2): 233245.
Soares da Silva, A. 2010. Measuring and parameterizing lexical convergence and divergence between European and Brazilian Portuguese: endo/exogeneousness and foreign and normative influence. In Geeraerts, D., Kristiansen, G., and Peirsman, Y. (eds.), Advances in Cognitive Sociolinguistics. Berlin: De Gruyter Mouton.
Speelman, D., Grondelaers, S., and Geeraerts, D. 2003. Profile-based linguistic uniformity as a generic method for comparing language varieties. Computers and the Humanities 37: 317337.
Szmrecsanyi, B. 2010. The English genitive alternation in a cognitive sociolinguistics perspective. In Geeraerts, D., Kristiansen, G., and Peirsman, Y. (eds.), Advances in Cognitive Sociolinguistics. Berlin: De Gruyter Mouton.
Tummers, J., Heylen, K., and Geeraerts, D. 2005. Usage-based approaches in cognitive linguistics: a technical state of the art. Corpus Linguistics and Linguistic Theory 1: 225261.
Turney, P., and Pantel, P. 2010. From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research 37: 141188.
Van de Cruys, T. 2008. A comparison of bag of words and syntax-based approaches for word categorization. In Baroni, M., Evert, S., and Lenci, A. (eds.), Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics, pp. 47–54.
Van der Plas, L. 2008. Automatic Lexico-Semantic Acquisition for Question Answering. PhD thesis, University of Groningen, Groningen, the Netherlands.
Wittgenstein, L. 1953. Philosophical Investigations. Oxford: Blackwell.
Wulff, S., Stefanowitsch, A., and Gries, S. T. 2007. Brutal Brits and persuasive Americans: variety-specific meaning construction in the into-causative. In Radden, G., Köpcke, K.-M., Berg, T., and Siemund, P. (eds.), Aspects of Meaning Construction in Lexicon and Grammar, pp. 265281. Amsterdam: John Benjamins.
Zhitomirsky-Geffet, M., and Dagan, I. 2009. Bootstrapping distributional feature vector quality. Computational Linguistics 35 (3): 435461.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 6
Total number of PDF views: 86 *
Loading metrics...

Abstract views

Total abstract views: 399 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 21st April 2018. This data will be updated every 24 hours.