Skip to main content Accessibility help

A statistical method for the identification and aggregation of regional linguistic variation

  • Jack Grieve (a1), Dirk Speelman (a1) and Dirk Geeraerts (a1)


This paper introduces a method for the analysis of regional linguistic variation. The method identifies individual and common patterns of spatial clustering in a set of linguistic variables measured over a set of locations based on a combination of three statistical techniques: spatial autocorrelation, factor analysis, and cluster analysis. To demonstrate how to apply this method, it is used to analyze regional variation in the values of 40 continuously measured, high-frequency lexical alternation variables in a 26-million-word corpus of letters to the editor representing 206 cities from across the United States.



Hide All
Allen, Harold B. (1973). The linguistic atlas of the Upper Midwest. Minneapolis: University of Minnesota Press.
Biber, Douglas. (1989). A typology of English texts. Language 27:343.
Bloch, Bernard. (1971). Postvocalic r in New England Speech, a study in American dialect geography. In Allen, H. B. & Underwood, G. N., (eds.), Readings in American dialectology. New York: Appleton Century Croft Meredith Corporation.
Carver, Craig. (1987). American regional dialects. Ann Arbor: University of Michigan Press.
Chambers, Jack, & Trudgill, Peter. (1998). Dialectology. 2nd ed.Cambridge, UK: Cambridge University Press.
Cliff, A. D., & Ord, J. K. (1973). Spatial autocorrelation. London: Pion.
Cliff, A. D., & Ord, J. K. (1981). Spatial processes: Models and applications. London: Pion.
Davis, Lawrence M., & Houck, Charles L. (1992). Is there a Midland dialect area? American Speech 67:6170.
Geeraerts, Dirk, Grondelaers, Stefan, & Bakema, Peter. (1994). The structure of lexical variation: Meaning, naming and context. Berlin: Mouton de Gruter.
Goebl, Hans. (1982). Dialektometrie: Prinzipien und methoden des einsatzes der numerischen taxonomie im bereich der dialektgeographie. Vienna: Verlag der Osterreichischen Akademie der Wissenschaften.
Goebl, Hans. (1984). Dialektometrische studien: Anhand italoromanischer, rätoromanischer und galloromanischer Sprachmaterialien aus AIS und ALF. Tübingen: Niemeyer.
Goebl, Hans. (2006). Recent advances in Salzburg dialectometry. Literary and Linguistic Computing 21:411435.
Goebl, Hans. (2007). On the geolinguistic change in Northern France between 1300 and 1900: A dialectometrical inquiry. In Nerbonne, J., Ellison, T. M., & Kondrak, G. (eds.), Computing and historical phonology: Proceedings of the Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology. Association for Computational Linguistics 7583.
Grieve, Jack. (2009). A corpus-based regional dialect survey of grammatical variation in written Standard American English. Ph.D. dissertation, Northern Arizona University.
Hair, Joseph, Black, Bill, Babin, Barry, Anderson, Rolph E., & Tatham, Ronald L. (2006). Multivariate data analysis. 6th ed.Englewood Cliffs, NJ: Prentice-Hall.
Heeringa, Wilbert. (2004). Measuring dialect pronunciation differences using Levenshtein distance. Ph.D. dissertation, University of Groningen.
Inhalainen, et al. (1987) cited in text.
Ihalainen, Ossi. 1988. Creating linguistic databases from machine-readable dialect texts. In Thomas, A. (ed), Methods in dialectology. Clevedon, UK: Multilingual Matters. 569584.
Ihalainen, Ossi. (1990). A source of data for the study of English dialect syntax: the Helsinki Corpus. In Aarts, J. & Meijs W, W. (eds.), Theory and practice in corpus linguistics. Amsterdam: Rodopi. 83103.
Inhalainen, Ossi. (1991). A point of verb syntax in south-western British English: An analysis of a dialect continuum. In Aijmer, K. & Altenberg, B. (eds.), English corpus linguistics: Studies in honour of Jan Svartvik. London: Longman. 290302.
Kortmann, Bernd, Herrmann, Tanja, Pietsch, Lukas, & Wagner, Susanne. (2005). A comparative grammar of British English dialects. Berlin: Mouton/de Gruyter.
Kretzschmar, William. (1992). Isoglosses and predictive modeling. American Speech 67:227249.
Kretzschmar, William. (1996). Quantitative areal analysis of dialect features. Language Variation and Change 8:1339.
Kretzschmar, William. (2003). Mapping Southern English. American Speech 78:130149.
Kurath, Hans. (1949). Word geography of the eastern United States. University of Michigan Press.
Labov, William. (1966a). The social stratification of English in New York City. Washington, DC: Center for Applied Linguistics.
Labov, William. (1966b). The linguistic variable as a structural unit. Washington Linguistics Review 3:422.
Labov, William. (1972). Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press.
Labov, William, Ash, Sharon, & Boberg, Charles. (2006). Atlas of North American English: Phonetics, phonology, and sound change. New York: Mouton de Gruyter.
Lee, Jay, & Kretzschmar, William. (1993). Spatial analysis of linguistic data with GIS functions. International Journal of Geographical Information Systems 7:541560.
Marckwardt, Albert H. (1957). Principal and subsidiary dialect areas in the North Central states. PADS 27:315.
Moran, Patrick A. P. (1948). The interpretation of statistical maps. Journal of the Royal Statistical Society, Series B 37:243251.
Nerbonne, John. (2006). Identifying linguistic structure in aggregate comparison. Literary and Linguistic Computing 21:463476.
Nerbonne, John, & Heeringa, Wilbert. (2009). Measuring dialect differences. In Schmidt, J. E. & Auer, P. (eds), Language and space: Theories and methods. Berlin: Mouton De Gruyter. 550567.
Nerbonne, John, & Kleiweg, Peter. (2003). Lexical distance in LAMSAS. Computers and the Humanities 37:339357.
Nerbonne, John, & Kleiweg, Peter. (2007). Toward a dialectological yardstick. Journal of Quantitative Linguistics 14:148166.
Nerbonne, John, & Kretschmar, William. (2003). Introducing computational methods in dialectometry. Computers and the Humanities 37:245255.
Nerbonne, John, & Kretschmar, William. (2006). Progress in dialectometry: Toward explanation. Literary and Linguistic Computing 21:387397.
Odland, John D. (1988). Spatial autocorrelation. Thousand Oaks, CA: Sage Publications.
Ord, J. K., & Getis, Arthur. (1995). Local spatial autocorrelation statistics: Distributional issues and an application. Geographical Analysis 27:286306.
Pederson, L. (1986). Linguistic atlas of the Gulf states. Athens, GA: University of Georgia Press.
Perry, M. J. (2003). State to state migration flows: 1995 to 2000. Census 2000 Special Reports. CENSR-8. Available at:
Preston, Dennis. (2002). Language with attitude. In Chambers, J., Trudgill, P., & Schilling-Estes, N. (eds.), The handbook of language variation and change. Malden, MA: Blackwell. 4066.
Prokic, Jenna, & Nerbonne, John. (2008). Recognizing groups among dialects. International Journal of Humanities and Arts Computing 1:153172.
Rumpf, Jonas, Pickl, Simon, Elspass, Stephan, Koenig, Werner, & Schmidt, Volker. (2009). Structural analysis of dialect maps using methods from spatial statistics. Zeitschrift für Dialektologie und Linguistik 76:280308.
Rumpf, Jonas, Pickl, Simon, Elspass, Stephan, Koenig, Werner, & Schmidt, Volker. (2010). Quantification and statistical analysis of structural similarities in dialectological area-class maps. Dialectologia et Geolinguistica 18:73100.
Schneider, Edgar. (2002). Investigating variation and change in written documents. In Chambers, J., Trudgill, P., & Schilling-Estes, N. (eds.), The handbook of language variation and change. London: Blackwell.
Séguy, Jean. (1971). La relation entre la distance spatiale et la distance lexicale. Revue de linguistique romane 35:335357.
Séguy, Jean. (1973a). Atlas linguistique et ethnographique de la Gascogne. Vol. 6. Paris: Centre national de la recherché scientifique.
Séguy, Jean. (1973b). La dialectometrie dans l'Atlas linguistique de la Gascogne. Revue de linguistique romane 37:124.
Shackleton, Robert G. (2005). English-American speech relationships: A quantitative approach. Journal of English Linguistics 33:99160.
Sinnott, R. W. (1984). Virtues of the Haversine. Sky and Telescope 68:159.
Speelman, Dirk, Grondelaers, Stefan, & Geeraerts, Dirk. (2003). Computers and the Humanities 37:317337.
Szmrecsanyi, Benedikt. (2008). Corpus-based dialectometry: Aggregate morphosyntactic variability in British English dialects. International Journal of Humanities and Arts Computing. 279296.
Tabachnick, Barbara G., & Fidell, Linda S. (2007). Using multivariate statistics. 5th ed.Boston: Allyn and Bacon.
U. S. Census Bureau. (2005). State of residence in 2000 by state of birth. PHC-T-38. Available at:
Ward, Joe H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58:236244.
Wieling, Martijn, & Nerbonne, John. (2010). Hierarchical bipartite spectral graph partitioning to cluster dialect varieties and determine their most important linguistic features. Paper presented at: TextGraphs-5 Workshop on Graph-Based Methods for Natural Language Processing 16, July 16, 2010, Uppsala, Sweden. 3341.
Wolfram, Walt. (1969). A sociolinguistic description of Detroit Negro speech. Washington, DC: Center for Applied Linguistics.
Wolfram, Walt. (1991). The linguistic variable: Fact and fantasy. American Speech 66:2232.
Wolfram, Walt. (1993). Indentifying and interpreting variables. In Preston, D. (ed.), American dialect research. Philadelphia: John Benjamins. 193221.
Wolfram, Walt, & Schilling-Estes, Natalie. (2006). American English: Dialects and variation. 2nd ed.Cambridge/Oxford: Basil Blackwell.
Zelinsky, Wilbur. (1973). Cultural geography of the United States. Englewood Cliffs, NJ: Prentice-Hall.
Type Description Title
Supplementary materials

Grieve supplementary material

 PDF (42.2 MB)
42.2 MB


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed