Skip to main content Accessibility help
×
Home

Borders and boundaries in Bosnian, Croatian, Montenegrin and Serbian: Twitter data to the rescue

  • Nikola Ljubešić (a1) (a2), Maja Miličević Petrović (a3) and Tanja Samardžić (a4)

Abstract

In this paper we deal with the spatial distribution of 16 linguistic features known to vary between Bosnian, Croatian, Montenegrin, and Serbian. We perform our analyses on a dataset of geo-encoded Twitter status messages collected in the period from mid-2013 to the end of 2016. We perform two types of analyses. The first one finds boundaries in the spatial distribution of the linguistic variable levels through the kernel density estimation smoothing technique. These boundaries are then plotted over the state borders for a visual comparison. The second analysis deals with linguistic distance between the states. The groupings of linguistic variables and countries are calculated given the state borders and the Jensen-Shannon divergence between distributions of the 16 variables within each state. This analysis is completed with a measure of variable consistency for each country. These analyses are intended to show the extent to which current state borders correspond to linguistic boundaries. They suggest that Croatia and Serbia still represent the two extremes, reflecting a history of normative divergences, while Bosnia-Herzegovina and Montenegro, depending on the variable, lean to one or the other side.

Copyright

Corresponding author

*Address for correspondence: Nikola Ljubešić, Jožef Stefan Institute, Ljubljana, Slovenia; University of Zagreb, Zagreb, Croatia, nikola.ljubesic@ijs.si

References

Hide All
Alexander, Ronelle. 2013. Language and identity: The fate of Serbo-Croatian. In Roumen Daskalov and Tchavdar Marinov (eds.), Entangled histories of the Balkans. Volume 1: National ideologies and language policies, 341417. Leiden & Boston: Brill.
Barić, Eugenija, Lončarić, Mijo, Malic, Dragicá, Pavešić, Slavko, Peti, Mirko, Zečević, Vesna & Znika, Marija. 1997. Hrvatska gramatika, 2nd edn. Zagreb: Školska knjiga.
Bart, Gabriela, Glaser, Elvira, Sibler, Pius & Weibel, Robert. 2013. Analysis of Swiss German syntactic variants using spatial statistics. In Xosé Afonso Álvarez Pérez, Ernestina Carrilho & Catarina Magro (eds.), Current approaches to limits and areas in dialectology, 143169. Newcastle upon Tyne: Cambridge Scholars Publishing.
Bekavac, Božo, Seljan, Sanja & Simeon, Ivana. 2008. Corpus-based comparison of contemporary Croatian, Serbian and Bosnian. In Marko Tadić, Mila Dimitrova-Vulchanova & Svetla Koeva (eds.), Proceedings of the Sixth International Conference “Formal approaches to South Slavic and Balkan languages” (FASSBL 6), 33–39. Zagreb: Croatian Language Technologies Society & Faculty of Humanities and Social Sciences.
Britain, David. 2002. Dialectology. In David Bickerton (ed.), A web guide to teaching and learning in languages, linguistics and area studies. Southampton: Subject Centre for Languages, Linguistics and Area Studies. http://www.llas.ac.uk/resources/gpg/964 [Updated January 2005].
Browne, Wayles & Alt, Theresa. 2004. A handbook of Bosnian, Serbian, and Croatian. http://www.seelrc.org:8080/grammar/mainframe.jsp?nLanguageID=1 (29 October, 2017).
Chambers, J.K. & Trudgill, Peter. 1998. Dialectology, 2nd edn. Cambridge: Cambridge University Press.
Čedić, Ibrahim. 2001. Bosanskohercegovački standardnojezički izraz – bosanski jezik. In Svein Mønnesland (ed.), Jezik i demokratizacija, 69–77. Sarajevo: Institut za jezik. Reprinted in Branko Tošović & Arno Wonisch (eds.). 2009. Bošnjački pogledi na odnose između srpskog, hrvatskog i bošnjačkog jezika, 41–50. Graz & Sarajevo: Institut für Slawistik der Karl-Franzens-Universität Graz & Institut za jezik Sarajevo.
Čirgić, Adnan, Pranjković, Ivo & Silić, Josip. 2010. Gramatika crnogorskoga jezika. Podgorica: Ministarstvo prosvjete i nauke Crne Gore.
Doyle, Gabriel. 2014. Mapping dialectal variation by querying social media. In Proceedings of the 14th Conference of the European chapter of the Association for Computational Linguistics, 98–106. Gothenburg: Association for Computational Linguistics.
Dražić, Jasmina & Vojinović, Jelena. 2009. Imenice tipa nomina agentis u srpskom i hrvatskom jeziku (tvorbeni i semantički aspekt). In Branko Tošović (ed.), Die Unterschiede zwischen dem Bosnischen/Bosniakischen, Kroatischen und Serbischen. Lexik – Wortbildung – Phraseologie, 311–320. Berlin-Münster-Wien-Zürich-London: LIT Verlag. Reprinted in Branko Tošović & Arno Wonisch (eds). 2010. Srpski pogledi na odnose između srpskog, hrvatskog i bošnjačkog jezika, Book I/2, 41–50. Graz & Belgrade: Institut für Slawistik der Karl-Franzens-Universität Graz & Beogradska knjiga.
Eisenstein, Jacob, O’Connor, Brendan, Smith, Noah A. & Xing, Eric P.. 2010. A latent variable model for geographic lexical variation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 1277–1287. Cambridge, MA: Association for Computational Linguistics.
Eisenstein, Jacob, Smith, Noah A. & Xing, Eric P.. 2011. Discovering sociolinguistic associations with structured sparsity. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human language technologies, 1365–1374. Portland: Association for Computational Linguistics.
Eisenstein, Jacob, O’Connor, Brendan, Smith, Noah A. & Xing, Eric P.. 2014. Diffusion of lexical change in social media. PloS ONE 9(11). e113114. https://doi.org/10.1371/journal.pone.0113114
Fišer, Darja, Erjavec, Tomaž, Ljubešić, Nikola & Miličević, Maja. 2015. Comparing the nonstandard language of Slovene, Croatian and Serbian tweets. In Mojca Smolej (ed.), Simpozij Obdobja 34. Slovnica in slovar - aktualni jezikovni opis, Part 1, 225231. Ljubljana: Filozofska fakulteta.
Glaser, Elvira. 2013. Area formation in morphosyntax. In Peter Auer, Martin Hilpert, Anja Stukenbrock & Benedikt Szmrezcsanyi (eds.), Space in language and linguistics: Geographical, interactional and cognitive perspectives (linguae & litterae 24), 195–221. Berlin & Boston: De Gruyter.
Goebl, Hans. 1982. Dialektometrie: Prinzipien und methoden des einsatzes der numerischen taxonomie im bereich der dialektgeographie. Wien: Osterreichischen Akademie der Wissenschaften.
Goebl, Hans. 1984. Dialektometrische Studien: Anhand italoromanischer, riitoromanischer und galloromanischer Sprachmaterialien aus AIS und ALF. 3 Vol. Tübingen: Max Niemeyer.
Gonçalves, Bruno & Sánchez, David. 2014. Crowdsourcing dialect characterization through Twitter. PLoS ONE 9(11): e112074. https://doi.org/10.1371/journal.pone.0112074
Halilović, Senahid. 2004. Pravopis bosanskoga jezika za osnovne i srednje škole. Zenica: Dom štampe.
Hornsby, David. 2009. Dedialectalization in France: Convergence and divergence. International Journal of the Sociology of Language 196(97). 157180.
Hudeček, Lana & Vukojevic, Luká. 2007. Da li, je li i li – normativni status i raspodjela. Rasprave 33. 217234.
Ivić, Pavle. 1956. Dijalektologija srpskohrvatskog jezika. Uvod i štokavsko narečje. Novi Sad: Matica srpska.
Jahić, Dževad, Halilović, Senahid & Palić, Ismail. 2000. Gramatika bosanskoga jezika. Zenica: Dom štampe.
Kortmann, Bernd & Wagner, Susanne. 2005. The Freiburg English dialect project and corpus. In Bernd Kortmann, Tanja Herrmann, Lukas Pietsch & Susane Wagner (eds.), A Comparative Grammar of British English Dialects: Agreement, Gender, Relative Clauses, 120. Berlin & New York: Mouton de Gruyter.
Kovačić, Marko. 2005. Serbian and Croatian: One language or languages? Jezikoslovlje 6. 195204.
Labov, William. 1963. The social motivation of a sound change. Word 19. 273309.
Ljubešić, Nikola, Mikelić, Nives & Boras, Damir. Language identification: How to distinguish similar languages? In Proceedings of the 29th International Conference on Information Technology Interfaces ITI 2007, 541–546. Cavtat, Croatia.
Ljubešić, Nikola, Fišer, Darja & Erjavec, Tomaž. 2014. TweetCaT: A tool for building Twitter corpora of smaller languages. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 2279–2283. Reykjavik, Iceland.
Ljubešić, Nikola & Kranjčić, Denis. 2015. Discriminating between closely related languages on Twitter. Informatica 39(1). 18.
Ljubešić, Nikola, Klubička, Filip, Agić, Željko & Jazbec, Ivo-Pavao. 2016. New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), 23–28. Paris: European Language Resources Association (ELRA).
Ljubešić, Nikola, Samardžić, Tanja & Derungs, Curdin. 2016. TweetGeo – A tool for collecting, processing and analyzing geo-encoded linguistic data. In Yuji Matsumoto & Rashmi Prasad (eds.), Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 3412–3421. Osaka: The COLING 2016 Organizing Committee.
Miličević, Maja, Ljubešić, Nikola & Fišer, Darja. 2017. Birds of a feather don’t quite tweet together: An analysis of spelling variation in Slovene, Croatian and Serbian twitterese. In Darja Fišer & Michael Beißwenger (eds.), Investigating computer-mediated communication: Corpus-based approaches to language in the digital world, 1443. Ljubljana: Scientific Publishing House of the Faculty of Arts, University of Ljubljana.
Miličević, Maja & Ljubešić, Nikola. 2016. Tviterasi, tviteraši or twitteraši? Producing and analyzing a normalized dataset of Croatian and Serbian tweets. Slovenščina 2.0 4. 156–188.
Nerbonne, John, Heeringa, Wilbert, Erik van den Hout, E, van der Kooi, Peter, Otten, Simone & van de Vis, Willem. 1995. Phonetic distance between Dutch dialects. In Gert Durieux, Walter Daelemans & Steven Gillis (eds.), CLIN VI: Proceedings from the Sixth CLIN Meeting, 185–202. Antwerpen: Center for Dutch Language and Speech, University of Antwerpen (UIA).
Nerbonne, John, Heeringa, Wilbert & Kleiweg, Peter. 1999. Edit distance and dialect proximity. In David Sankoff & Joseph Kruskal (eds.), Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison, 2nd edn., 515. Stanford: CSLI.
Nguyen, Dong, Smith, Noah & Rosé, Carolyn. 2011. Author age prediction from text using linear regression. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 115–123. Portland: Association for Computational Linguistics.
Perović, Milenko A., Silić, Josip & Vasiljeva, Ljudmila. 2009. Pravopis crnogorskoga jezika i rječnik crnogorskoga jezika (pravopisni rječnik). Podgorica: Ministarstvo prosvjete i nauke Crne Gore.
Pešikan, Mitar, Jerković, Jovan & Pižurica, Mato. 2010. Pravopis srpskoga jezika. Novi Sad: Matica srpska.
Petrović, Tanja. 2015. Srbija i njen Jug : “južnjački dijalekti” između jezika, kulture i politike. Beograd: Fabrika knjiga.
Pichler, Heike & Hesson, Ashley. 2016. Discourse-pragmatic variation across situations, varieties, ages: I DON’T KNOW in sociolinguistic and medical interviews. Language & Communication 49. 118.
Piper, Predrag. 2009. O prirodi gramatičkih razlika između srpskog i hrvatskog jezika. In Predrag Piper (ed.), Južnoslovenski jezici: gramatičke strukture i funkcije, 537552. Beograd: Beogradska knjiga.
Pranjković, Ivo. 1997. Hrvatski standardni jezik i srpski standardni jezik. In Emil Tokarz (ed.), Język wobec przemian kultury, 50–59. Katowice: Wydawnictwo Uniwersytetu Śląskiego. Reprinted in Branko Tošović & Arno Wonisch (eds.). 2012. Hrvatski pogledi na odnose između srpskog, hrvatskog i bošnjačkog jezika, Book II, 408–417. Graz & Zagreb: Institut für Slawistik der Karl-Franzens-Universität Graz & Izvori.
Scheffler, Tatjana, Gontrum, Johannes, Wegel, Matthias & Wendler, Steve. 2014. Mapping German tweets to geographic regions. In Proceedings of the NLP4CMC Workshop at Konvens, 2634. Bochum: Bochumer Linguistische Arbeitsberichte.
Séguy, Jean. 1971. La relation entre la distance spatiale et la distance lexicale. Revue de linguistique romane 35. 335357.
Silić, Josip. 2008. Fonetsko-fonološke i ortografsko-ortoepske razlike između bosanskoga (bošnjačkoga), hrvatskoga i srpskoga jezika. In Branko Tošović (ed.). Die Unterschiede zwischen dem Bosnischen/Bosniakischen, Kroatischen und Serbischen, 266–274. Berlin-Münster-Wien-Zürich-London: LIT Verlag. Reprinted in Branko Tošović & Arno Wonisch (eds.). 2010. Hrvatski pogledi na odnose između srpskog, hrvatskog i bošnjačkog jezika, Book I, 87–98. Graz & Zagreb: Institut für Slawistik der Karl-Franzens-Universität Graz & Izvori.
Speelman, Dirk, Grondelaers, Stefan & Geeraerts, Dirk. 2003. Profile-based linguistic uniformity as a generic method for comparing language varieties. Computers and the Humanities 37(3). 317317.
Stanojčić, Živojin & Popović, Ljubomir. 2008. Gramatika srpskog jezika za gimnazije i srednje škole. Beograd: Zavod za udžbenike.
Stevanović, Mihailo. 1989. Savremeni srpskohrvatski jezik. Beograd: Naučna knjiga.
Szmrecsanyi, Benedikt. 2008. Corpus-based dialectometry: aggregate morphosyntactic variability in British English dialects. International Journal of Humanities and Arts Computing 2(1/2) (special issue; John Nerbonne, Charlotte Gooskens, Sebastian Kürschner & Renée van Bezooijen (eds.) Language Variation). 279–296.
Šehović, Amela. 2009. Mocioni sufiksi u bosanskom, hrvatskom i srpskom jeziku (u nomina agentis et professionis). In Branko Tošović & Arno Wonisch (eds.), Bošnjački pogledi na odnose između srpskog, hrvatskog i bošnjačkog jezika, 433445. Graz & Sarajevo: Institut für Slawistik der Karl-Franzens-Universität Graz & Institut za jezik Sarajevo.
Špago-Ćumurija, Edina. 2009. Bosnian or Croatian? Sintaksičke razlike u kursevima bosanskog i hrvatskog jezika za strance. In Branko Tošović (ed.), Die Unterschiede zwischen dem Bosnischen/Bosniakischen, Kroatischen und Serbischen. Grammatik, 375–387. Berlin-Münster-Wien-Zürich-London: LIT Verlag. Reprinted in Branko Tošović & Arno Wonisch (eds.). 2009. Bošnjački pogledi na odnose između srpskog, hrvatskog i bošnjačkog jezika, 273–292. Graz & Sarajevo: Institut für Slawistik der Karl-Franzens-Universität Graz & Institut za jezik Sarajevo.
Tošović, Branko. 2008. Gramatičke razlike između srpskog, hrvatskog i bošnjačkog jezika (preliminarium). In Tilman Berger & Biljana Golubović (eds.), Morphologie – Mündlichkeit – Medien: Festschrift für Jochen Raecke, 311–322. Hamburg: Verlag Dr. Kovač. Reprinted in Branko Tošović & Arno Wonisch (eds.). 2010. Srpski pogledi na odnose između srpskog, hrvatskog i bošnjačkog jezika, Book I/2, 183–200. Graz & Belgrade: Institut für Slawistik der Karl-Franzens-Universität Graz & Beogradska knjiga.
Tošović, Branko. 2009. Die grammatikalischen Unterschiede zwischen dem Bosnischen/Bosniakischen, Kroatischen und Serbischen. In Branko Tošović (ed.), Die Unterschiede zwischen dem Bosnischen/Bosniakischen, Kroatischen und Serbischen. Grammatik, 131–188. Berlin-Münster-Wien-Zürich-London: LIT Verlag. Reprinted in Branko Tošović & Arno Wonisch (eds.). 2010. Srpski pogledi na odnose između srpskog, hrvatskog i bošnjačkog jezika, Book I/2, 237–292. Graz & Belgrade: Institut für Slawistik der Karl-Franzens-Universität Graz & Beogradska knjiga.
Trudgill, Peter. 1974. Linguistic change and diffusion: description and explanation in sociolinguistic dialect geography. Language in Society 3. 215246.
Trudgill, Peter, Gordon, Elizabeth, Lewis, Gillian & MacLagan, Margaret. 2000. Determinism in new-dialect formation and the genesis of New Zealand English. Journal of Linguistics 36(2). 299318.
Wieling, Martijn, Nerbonne, John & Baayen, Harald. 2011. Quantitative social dialectology: Explaining linguistic variation geographically and socially. PLoS ONE 6(9). e23613. doi:10.1371/journal.pone.0023613
Woolhiser, Curt. 2005. Political borders and dialect divergence/convergence in Europe. In Peter Auer, Frans Hinskens & Paul Kerswill (eds.), Dialect Change. Convergence and Divergence in European Languages, 236262. New York: Cambridge University Press.

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed