Abstract The paper illustrates the results of a correlation study focusing on linguistic variation in an Italian region, Tuscany. By exploiting a multilevel representation scheme of dialectal data, the study analyses attested patterns of phonetic and morpho-lexical variation with the aim of testing the degree of correlation between a) phonetic and morpho-lexical variation, and b) linguistic variation and geographic distance. The correlation analysis was performed by combining two complementary approaches proposed in dialectometric literature, namely by computing both global and place-specific correlation measures and by inspecting their spatial distribution. Achieved results demonstrate that phonetic and morpho-lexical variations in Tuscany seem to follow a different pattern than encountered in previous studies.
It is a well-known fact that different types of features contribute to the linguistic distance between any two locations, which can differ for instance with respect to the word used to denote the same object or the phonetic realisation of a particular word. Yet, the correlation between different feature types in defining patterns of dialectal variation represents an area of research still unexplored. In traditional dialectology, there is no obvious way to approach this matter beyond fairly superficial and impressionistic observations. The situation changes if the same research question is addressed in the framework of dialectometric studies, where it is possible to measure dialectal distances with respect to distinct linguistic levels and to compute whether and to what extent observed distances correlate.
Abstract In the present investigation we aim to determine to which degree various linguistic factors contribute to the intelligibility of Swedish words among Danes. We correlated the results of an experiment on word intelligibility with eleven linguistic factors and carried out logistic regression analyses. In the experiment, the intelligibility of 384 frequent Swedish words was tested among Danish listeners via the Internet. The choice of eleven linguistic factors was motivated by their contribution to intelligibility in earlier studies. The highest correlation was found in the negative correlation between word intelligibility and phonetic distances. Also word length, different syllable numbers, foreign sounds, neighbourhood density, word frequency, orthography, and the absence of the prosodic phenomenon of ‘stød’ in Swedish contribute significantly to intelligibility. Although the results thus show that linguistic factors contribute to the intelligibility of single words, the amount of explained variance was not very large (R2(Cox and Snell)= .16, R2 (Nagelkerke) = .21) when compared with earlier studies which were based on aggregate intelligibility. Partly, the lower scores result from the logistic regression model used. It was necessary to use logistic regression in our study because the intelligibility scores were coded in a binary variable. Additionally, we attribute the lower correlation to the higher number of idiosyncrasies of single words compared with the aggregate intelligibility and linguistic distance used in earlier studies. Based on observations in the actual data from the intelligibility experiment, we suggest further steps to be taken to improve the predictability of word intelligibility.
Abstract In this paper the role of concept characteristics in lexical dialectometric research is examined in three consecutive logical steps. First, a regression analysis of data taken from a large lexical database of Limburgish dialects in Belgium and The Netherlands is conducted to illustrate that concept characteristics such as concept salience, concept vagueness and negative affect contribute to the lexical heterogeneity in the dialect data. Next, it is shown that the relationship between concept characteristics and lexical heterogeneity influences the results of conventional lexical dialectometric measurements. Finally, a dialectometric procedure is proposed which downplays this undesired influence, thus making it possible to obtain a clearer picture of the ‘truly’ regional variation. More specifically, a lexical dialectometric method is proposed in which concept characteristics form the basis of a weighting schema that determines to which extent concept specific dissimilarities can contribute to the aggregate dissimilarities between locations.
BACKGROUND AND RESEARCH QUESTIONS
An important assumption underlying most if not all methods of dialectometry is that the automated analysis of the differences in language use between different locations, as they are recorded by dialectologists in large scale surveys, can reveal patterns which directly reflect regional variation. In this paper, in which we focus on lexical variation, we want to address one factor, viz. concept characteristics, which we will claim complicates this picture.
The argumentation which underlies our claim consists of three consecutive logical steps. As a first step, we analyse data taken from a large lexical database of Limburgish dialects in Belgium and The Netherlands, in which we more particularly zoom in on the names for concepts in the field of ‘the human body’.
Abstract In this experimental study, we aim to arrive at a global picture of the mutual intelligibility of various Dutch language varieties by carrying out a computer-controlled lexical decision task in which ten target varieties are evaluated – the Belgian and Netherlandic Dutch standard language as well as four regional varieties of both countries. We auditorily presented real as well as pseudo-words in various varieties of Dutch to Netherlandic and Belgian test subjects, who were asked to decide as quickly as possible whether the items were existing Dutch words or not. The experiment's working assumption is that the faster the subjects react, the better the intelligibility of (the language variety of) the word concerned.
When speakers of different languages or language varieties communicate with each other, one group (generally the economically and culturally weaker one) often switches to the language or language variety of the other, or both groups of speakers adopt a third, common lingua franca. However, if the languages or language varieties are so much alike that the degree of mutual comprehension is sufficiently high, both groups of speakers might opt for communicating in their own language variety.
This type of interaction between closely related language varieties, which Haugen (1966) coins semicommunication and Braunmüller and Zeevaert (2001) refer to as receptive multilingualism, has been investigated between speakers of native Indian languages in the United States (Pierce 1952), between Spaniards and Portuguese (Jensen, 1989), between speakers of Scandinavian languages (Zeevaert, 2004; Gooskens, 2006; Lars-Olof Delsing, 2007) and between Slovaks and Czechs (Budovičová, 1987).
Abstract The present paper investigates to what extent subjects base their judgments of linguistic distances on actual dialect data presented in a listening experiment and to what extent they make use of previous knowledge of the dialects when making their judgments. The point of departure for our investigation were distances between 15 Norwegian dialects as perceived by Norwegian listeners. We correlated these perceptual distances with objective phonetic distances measured on the basis of the transcriptions of the recordings used in the perception experiment. In addition, we correlated the perceptual distances with objective distances based on other datasets. On the basis of the correlation results and multiple regression analyses we conclude that the listeners did not base their judgments solely on information that they heard during the experiments but also on their general knowledge of the dialects. This conclusion is confirmed by the fact that the effect is stronger for the group of listeners who recognised the dialects than for listeners who did not recognise the dialects on the tape.
To what extent do subjects base their judgment of linguistic distances between dialects on what they really hear, i.e. on the linguistic phenomena available in the speech signal, and to what degree do they generalise from the knowledge that they have from previous confrontations with the dialects? This is the central question of the investigation described in this paper. The answer to this question is important to scholars who want to understand how dialect speakers perceive dialect pronunciation differences and may give more insight in the mechanisms behind the way in which linguistic variation is experienced.
This is the report of a panel discussion held in connection with the special session on computational methods in dialectology at Methods XIII: Methods in Dialectology on 5 August, 2008 at the University of Leeds. We scheduled this panel discussion in order to reflect on what the introduction of computational methods has meant to our subfield of linguistics, dialectology (in alternative divisions of linguistic subfields also known as variationist linguistics), and whether the dialectologists' experience is typical of such introductions in other humanities studies. Let's emphasise that we approach the question as working scientists and scholars in the humanities rather than as methodology experts or as historians or philosophers of science, i.e. we wished to reflect on how the introduction of computational methods has gone in our own field in order to conduct our own future research more effectively, or alternatively, to suggest to colleagues in neighbouring disciplines which aspects of computational studies have been successful, which have not been, and which might have been introduced more effectively. Since we explicitly wished to reflect not only on how things have gone in dialectology, but also to compare our experiences to others, we invited panellists with broad experience in linguistics and other fields.
We introduce the chair and panellists briefly.
John Nerbonne chaired the panel discussion. He works on dialectology, but also on grammar, and on applications such as language learning and information extraction and information access. He works in Groningen, and is past president of the Association for Computational Linguistics (2002).
Abstract The research reported in this paper departs from most previous work in dialectometry in several ways. Empirically, it draws on frequency vectors derived from naturalistic corpus data and not on discrete atlas classifications. Linguistically, it is concerned with morphosyntactic (as opposed to lexical or pronunciational) variability. Methodologically, it marries the careful analysis of dialect phenomena in authentic, naturalistic texts to aggregational-dialectometrical techniques. Two research questions guide the investigation: First, on methodological grounds, is corpus-based dialectometry viable at all? Second, to what extent is morphosyntactic variation in nonstandard British dialects patterned geographically? By way of validation, findings will be matched against previous work on the dialect geography of Great Britain.
The overarching aim in this study is to provide a methodological sketch of how to blend philologically responsible corpus-based research with aggregational-dialectometrical analysis techniques. The bulk of previous research in dialectometry has focussed on phonology and lexis (however, for work on Dutch dialect syntax see Spruit 2005, 2006, 2008, Spruit et al. t.a.). Moreover, orthodox dialectometry draws on linguistic atlas classifications as its primary data source. The present study departs from these traditions in several ways. It endeavours, first, to measure aggregate morphosyntactic distances and similarities between traditional dialects in the British Isles. Second, the present study does not rely on atlas data but on frequency information deriving from a careful analysis of language use in authentic, naturalistic texts. This is another way of saying that the aggregate analysis in this paper is frequency-based, an approach that contrasts with atlas-based dialectometry, which essentially relies on categorical input data.
Abstract Component models such as factor analysis can be used to analyse spatial distributions of a large number of different features – for instance the isogloss data in a dialect atlas, or the distributions of ethnological or archaeological phenomena – with the goal of finding dialects or similar cultural aggregates. However, there are several such methods, and it is not obvious how their differences affect their usability for computational dialectology. We attempt to tackle this question by comparing five such methods using two different dialectological data sets. There are some fundamental differences between these methods, and some of these have implications that affect the dialectological interpretation of the results.
Languages are traditionally subdivided into geographically distinct dialects, although any such division is just a coarse approximation of a more fine-grained variation. This underlying variation is usually visualised in the form of maps, where the distribution of various features is shown as isoglosses. It is possible to view dialectal regions, in this paper also called simply dialects, as combinations of the distribution areas of these features, where the features have been weighted in such a way that the differences between the resulting dialects are as sharp as possible. Ideally, dialect borders are drawn where several isoglosses overlap.
As more and more dialectological data is available in electronic form, it is becoming increasingly attractive to apply computational methods to this problem. One way to do this is to use clustering methods (e.g. Kaufman and Rousseeuw, 1990), especially as such methods have been used in dialectometric studies (e.g. Heeringa and Nerbonne, 2002; Moisl and Jones, 2005).
Abstract This paper describes a new method for quantifying the similarity of the lexical distribution of phonemes in different varieties of a language (in this case English). In addition to introducing the method, it discusses phonological problems which must be addressed if any comparison of this sort is to be attempted, and applies the method to a limited data set of varieties of English. Since the method assesses their structural similarity, it will be useful for analysing the historical development of varieties of English and the relationships (either as a result of common origin or of contact) that hold between them.
In recent years considerable progress has been made in assessing the relationships between linguistic varieties by measuring the similarity between strictly comparable sets of phonetic data. In particular, measurement of Levenshtein Distance (see, for example, Nerbonne, Heeringa, and Kleiweg, 1999; Nerbonne and Heeringa, 2001; Heeringa, 2004) has proved useful for determining the relationships between closely related varieties, and the ‘Sound Comparisons’ method for assessing the distance between varieties provides a very promising alternative technique for looking into the changing relationships between closely-related and not so closely-related varieties (Heggarty, McMahon and McMahon, 2005; McMahon, Heggarty, McMahon and Maguire, 2007).
Phonetic comparison algorithms of this sort are not, however, without their problems. Firstly, they often depend upon auditory phonetic transcriptions of one degree of fineness or another, with all the associated issues of transcriber isoglosses, inaccuracies and realism that this method brings (see Milroy and Gordon, 2003: 144–152 for a discussion of the issues).
Two basic questions
In this paper we ask two questions, which superficially seem to ask the same thing but in actual fact do not. First, we ask to what degree two languages (or language varieties) A and B resemble each other. The second question is how well a listener of variety B understands a speaker of variety A.
When we ask to what degree two language varieties resemble one another, or how different they are (which is basically the same question), it should be clear that the answer cannot be expressed in a single number. Languages differ from each other not in just one dimension but in a great many respects. They may differ in their sound inventories, in the details of the sounds in the inventory, in their stress, tone and intonation systems, in their vocabularies, and in the way they build words from morphemes and sentences from words. Last, but not least, they may differ in the meanings they attach to the forms in the language, in so far as the forms in two languages may be related to each other. In order to express the distance between two languages, we need a weighted average of the component distances along each of the dimensions identified (and probably many more). So, linguistic distance is a multidimensional phenomenon and we have no a priori way of weighing the dimensions.
The answer to the question how well listener B understands speaker A can be expressed as a single number. If listener B does not understand speaker A at all, the number would be zero. If listener B gets every detail of speaker A's intentions, the score would be maximal.
Abstract In this study 91 local Swedish dialects were analysed based on vowel pronunciation. Acoustic measurements of vowel quality were made for 18 vowels of 1,014 speakers by means of principal component analysis of vowel spectra. Two principal components were extracted explaining more than ¾ of the total variance in the vowel spectra. Plotting vowels in the PC1-PC2 plane showed a solution with strong resemblance to vowels in a formant plane. Per location averages of all speakers were calculated and factor analysis was run with the 91 locations as data cases and the two acoustic component of the 18 words as variables. Nine factors were extracted corresponding to distinct geographic distribution patterns. The factor scores of the analysis revealed co-occurrence of a number of linguistic features.
The traditional method of identifying dialect areas has been the so-called isogloss method, where researchers choose some linguistic features that they find representative for the dialect areas and draw lines on maps based on different realisations of these features. One problem with the isogloss method is that isoglosses rarely coincide, and a second is that the choice of linguistic features is subjective and depends on what the researcher chooses to emphasise. Dialectometric research has been trying to avoid these problems by aggregating over large data sets and using more objective data-driven methods when determining dialect areas (Séguy, 1973; Goebl, 1982; Heeringa, 2004; Nerbonne, 2009).
The volume we are introducing here contains a selection of the papers presented at a special track on computational techniques for studying language variation held at The Thirteenth International Conference on Methods in Dialectology in Leeds on 4–5 August 2008. We are grateful to the conference organisers, Nigel Armstrong, Joan Beal, Fiona Douglas, Barry Heselwood, Susan Lacey, Ann Thompson, and Clive Upton for their cooperation in our organisation of the event. We likewise owe thanks to the referees of the present volume, who we are pleased to acknowledge explicitly: Agnes de Bie, Roberto Bolognesi, David Britain, Cynthia Clopper, Ken Decker, Anders Eriksson, Hans Goebl, Stefan Grondelaers, Carlos Gussenhoven, Nynke de Haas, Frans Hinskens, Angela Kluge, Gitte Kristiansen, Alexandra Lenz, Maarten Mous, Hermann Niebaum, Lisa Lena Opas-Hänninen, Petya Osenova, John Paolillo, Louis Pols, Helge Sandøy, Bob Shackleton, Felix Schaeffler, Marco Spruit, Rint Sybesma, Nathan Vaillette, Gunther de Vogelaer, and Esteve Valls.
The conference track consisted of 24 papers and posters, including a keynote address by Vincent van Heuven on phonetic techniques for studying variation and comprehensibility. Fourteen contributions were selected for publication in this special issue of the International Journal for Humanities and Arts Computing, including van Heuven's. In addition the conference track featured a panel session reflecting on the introduction of computational techniques to the study of language variation and more generally, on computing and the humanities. We have prepared a report on the panel session for publication here as well.
Abstract In this paper we relate linguistic, geographic and social distances to each other in order to get a better understanding of the impact the Dutch-German state border has had on the linguistic characteristics of a sub-area of the Kleverlandish dialect area. This area used to be a perfect dialect continuum. We test three models for explaining today's pattern of linguistic variation in the area. In each model another variable is used as the determinant of linguistic variation: geographic distance (continuum model), the state border (gap model) and social distance (social model). For the social model we use perceptual data for friends, relatives and shopping locations. Testing the three models reveals that nowadays the dialect variation in the research area is closely related to the existence of the state border and to the social structure of the area. The geographic spatial configuration hardly plays a role anymore.
The Dutch-German state border south of the river Rhine was established in 1830. Before that time, the administrative borders in this region frequently changed. The Kleverlandish dialect area, which extends from Duisburg in Germany to Nijmegen in The Netherlands, crosses the state border south of the Rhine. The area is demarcated by the Uerdingen line in the south, the diphthongisation line of the West Germanic ‘i’ in the West, and the border with the Low Saxon dialects of the Achterhoek area in the North-East. The geographic details of the area can be found in Figure 1 (the state border is depicted with a dashed-dotted line).
Abstract In the present investigation, the intelligibility of 17 Scandinavian language varieties and standard Danish was assessed among young Danes from Copenhagen. In addition, distances between standard Danish and each of the 17 varieties were measured at the lexical level and at different phonetic levels. In order to determine how well these linguistic levels can predict intelligibility, we correlated the intelligibility scores with the linguistic distances and we carried out a number of regression analyses. The results show that for this particular set of closely related language varieties phonetic distance is a better predictor of intelligibility than lexical distance. Consonant substitutions, vowel insertions and vowel shortenings contribute significantly to the prediction of intelligibility.
Gooskens (2007) correlated lexical and phonetic distances with mutual intelligibility scores for the Mainland Scandinavian standard languages, Danish, Norwegian and Swedish. Subjects from different places in Denmark, Norway and Sweden listened to the two standard languages spoken in the neighbouring countries and linguistic distances were measured between the language varieties of the listeners and the test languages. In total there were 18 mean intelligibility scores and 18 corresponding linguistic distances. The distances were measured at the two linguistic levels that are generally taken to be most important for mutual intelligibility in Scandinavia, namely the lexical and the phonetic level (Delsing and Lundin Åkesson, 2005; Torp, 1998). The results showed a high correlation between intelligibility scores and phonetic distances (r = −.80, p < .01).
We are pleased to launch the first of several special issues designed to highlight cutting-edge research, methods, applications, literature, and websites in key fields of humanities and arts computing. The current double issue on variationist linguistics and computational humanities is an exemplar of what we hope to accomplish, especially in shortening the time it takes for important papers to move from initial presentation to publication. Under the guest editorship of John Nerbonne, Professor of Humanities Computing, Charlotte Gooskens, Associate Professor of Scandinavian Languages and Literature, both at the University of Groningen, The Netherlands, Sebastian Kürschner, Tenure track position (‘Juniorprofessur’) in variationist linguistics and language contact at the University of Erlangen-Nürnberg, Germany, and Renée van Bezooijen, Researcher at the University of Groningen, The Netherlands. This issue also introduces a roundtable discussion that we intend to become a regular feature of these special editions. The aim of the forum is to assess contributions to the field and link them to the broader interests of humanities and arts computing, as well as to highlight opportunities for connection and research within and among disciplines.
Over the next year, we will publish two additional thematic issues. Volume 3.1 will focus on humanities GIS. The past decade has witnessed an explosion of interest in the application of geo-spatial technologies to history, literature, and other arts and humanities disciplines. The special issue will highlight leading presentations from an August 2008 conference at the University of Essex and will include two new features – book reviews and website/tool reviews.
Abstract In this paper we apply various clustering algorithms to the dialect pronunciation data. At the same time we propose several evaluation techniques that should be used in order to deal with the instability of the clustering techniques. The results have shown that three hierarchical clustering algorithms are not suitable for the data we are working with. The rest of the tested algorithms have successfully detected two-way split of the data into the Eastern and Western dialects. At the aggregate level that we used in this research, no further division of sites can be asserted with high confidence.
Dialectometry is a multidisciplinary field that uses various quantitative methods in the analysis of dialect data. Very often those techniques include classification algorithms such as hierarchical clustering algorithms used to detect groups within certain dialect area. Although known for their instability (Jain and Dubes, 1988), clustering algorithms are often applied without evaluation (Goebl, 2007; Nerbonne and Siedle, 2005) or with only partial evaluation (Moisl and Jones, 2005). Very small differences in the input data can produce substantially different grouping of dialects (Nerbonne et al., 2008). Without proper evaluation, it is very hard to determine if the results of the applied clustering technique are an artifact of the algorithm or the detection of real groups in the data.
The aim of this paper is to evaluate algorithms used to detect groups among language dialect varieties measured at the aggregate level. The data used in this research is dialect pronunciation data that consists of various pronunciations of 156 words collected all over Bulgaria.
Email your librarian or administrator to recommend adding this to your organisation's collection.