Hostname: page-component-89b8bd64d-r6c6k Total loading time: 0 Render date: 2026-05-09T14:09:01.588Z Has data issue: false hasContentIssue false

Extracting hierarchical features of cultural variation using network-based clustering

Published online by Cambridge University Press:  02 May 2022

Xiran Liu
Affiliation:
Institute for Computational and Mathematical Engineering, Stanford University, Stanford, California, USA
Noah A. Rosenberg*
Affiliation:
Department of Biology, Stanford University, Stanford, California, USA
Gili Greenbaum*
Affiliation:
Department of Ecology, Evolution and Behavior, The Hebrew University of Jerusalem, Jerusalem, Israel
*
*Corresponding author. E-mail: noahr@stanford.edu; gil.g@mail.huji.ac.il
*Corresponding author. E-mail: noahr@stanford.edu; gil.g@mail.huji.ac.il

Abstract

High-dimensional datasets on cultural characters contribute to uncovering insights about factors that influence cultural evolution. Because cultural variation in part reflects descent processes with a hierarchical structure – including the descent of populations and vertical transmission of cultural traits – methods designed for hierarchically structured data have potential to find applications in the analysis of cultural variation. We adapt a network-based hierarchical clustering method for use in analysing cultural variation. Given a set of entities, the method constructs a similarity network, hierarchically depicting community structure among them. We illustrate the approach using four datasets: pronunciation variation in the US mid-Atlantic region, folklore variation in worldwide cultures, phonemic variation across worldwide languages and temporal variation in first names in the US. In these examples, the method provides insights into processes that affect cultural variation, uncovering geographic and other influences on observed patterns and cultural characters that make important contributions to them.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
Copyright © The Author(s), 2022. Published by Cambridge University Press
Figure 0

Figure 1. Hierarchical features of variation in English pronunciation in the middle and south Atlantic region of the US. (a) Hierarchical tree of the pronunciation similarity network. Informants are marked on the map by the colour of the finest-scale cluster to which they belong. (b, c) Two major clusters detected at the first level of the hierarchy in (a), each re-coloured with the full colour interval. (d, e) Two finer-scale clusters of the hierarchy in (c). In these panels, colours are assigned based on placement in the area of the hierarchy circled in (c), with all descendants of a child in the circled area assigned the same colour. The colours in (d) correspond to 1/4 and 3/4 on the unit interval, and the colours in (e) correspond to 1/14, 3/14, 5/14, 7/14, 9/14, 11/14 and 13/14. For convenience, the child clusters associated with specific internal nodes in the tree diagrams are numbered. Birth-year distributions of informants in these child clusters appear on the right.

Figure 1

Figure 2. Hierarchical features of variation in folklore motifs across cultures. (a) Hierarchical tree of the motif similarity network. Regions are marked on the map by the colour of the finest-scale cluster to which they belong. (b) Distributions of normalized mutual information (NMI) between hierarchies extracted from sampled subsets of motifs and from all motifs, with 100 subsets of 20, 50, 100 and 500 motifs each. (c) Geographic distributions of five motifs that occur most frequently in the 200 of 5,000 subsets of 20 motifs that produce hierarchies with highest NMI to the hierarchy produced by all motifs. These motifs drive the hierarchy at higher levels, separating regions into major clusters. (d) Distributions of NMI between the leaves of hierarchies extracted from subsets of motifs (those from b) and the leaves extracted from all motifs. (e) Geographic distributions of three motifs that occur most frequently in the 200 subsets that produce hierarchies whose leaf clusters produce highest NMI to those produced by all motifs. These motifs are more specific to the hierarchy in lower levels and potentially capture fine-scale regional differences.

Figure 2

Figure 3. Hierarchical features of phonemic variation. (a) Hierarchical tree of the phoneme similarity network. Major branches that contain most of the languages are assigned distinct colours, and other branches are coloured grey. (b) Language map. Languages are marked by the colour of the finest-scale cluster to which they belong. Three regions are magnified: (c) Northeast Asia; (d) northeastern Siberia; and (E) East Africa.

Figure 3

Figure 4. Hierarchical features of time series for frequencies of female names (a–d) and male names (e–h). (a, e) Hierarchical tree of similarity in time series for name frequencies. Major branches are assigned distinct colours. Time series of annual national frequencies appear below the trees, with two names selected from each major branch highlighted. The node area is proportional to the number of names in a cluster, except that clusters containing greater than 25 names are set to a fixed size and are coloured half-transparently. (b, f) Recoding of the hierarchies in (a, e) by states of highest frequency. Each cluster shows a pie chart tabulating the states in which names in the cluster have the highest frequency. Time series of name frequencies appear below the hierarchies. (c, g) Recoding of the hierarchies in (a, e) by regions of highest normalized frequency. The states are grouped into four regions: West (AK, AZ, CA, CO, HI, ID, MT, NM, NV, OR, UT, WA, WY), Midwest (IA, IL, IN, KS, MI, MN, MO, ND, NE, OH, SD, WI), South (AL, AR, DC, DE, FL, GA, KY, LA, MD, MS, NC, OK, SC, TN, TX, VA, WV) and Northeast (CT, MA, ME, NH, NJ, NY, PA, RI, VT). The normalized frequency of a name in a region is the count of the name in the region normalized by the total number of individuals in the region. The steps to obtain the states of highest frequency and regions of highest normalized frequency are described in the Methods section. (d, h) Recoding of the hierarchies in (a, e) by mean number of syllables of names in clusters.

Supplementary material: PDF

Liu et al. supplementary material

Liu et al. supplementary material 1

Download Liu et al. supplementary material(PDF)
PDF 923.8 KB
Supplementary material: File

Liu et al. supplementary material

Liu et al. supplementary material 2

Download Liu et al. supplementary material(File)
File 351 KB
Supplementary material: File

Liu et al. supplementary material

Liu et al. supplementary material 3

Download Liu et al. supplementary material(File)
File 9 MB