Hostname: page-component-89b8bd64d-dvtzq Total loading time: 0 Render date: 2026-05-07T15:41:01.879Z Has data issue: false hasContentIssue false

Comparing the semantic structures of lexicon of Mandarin and English

Published online by Cambridge University Press:  09 January 2025

Yi Yang*
Affiliation:
Quantitative Linguistics Lab, Seminar für Sprachwissenschaft, Eberhard Karl University of Tübingen, Tübingen, Germany
R. Harald Baayen
Affiliation:
Quantitative Linguistics Lab, Seminar für Sprachwissenschaft, Eberhard Karl University of Tübingen, Tübingen, Germany
*
Corresponding author: Yi Yang; Email: heyangyihe@163.com
Rights & Permissions [Opens in a new window]

Abstract

This paper presents a cross-language study of lexical semantics within the framework of distributional semantics. We used a wide range of predefined semantic categories in Mandarin and English and compared the clusterings of these categories using FastText word embeddings. Three techniques of dimensionality reduction were applied to mapping 300-dimensional FastText vectors into two-dimensional planes: multidimensional scaling, principal components analysis, and t-distributed stochastic neighbor embedding. The results show that t-SNE provides the clearest clustering of semantic categories, improving markedly on PCA and MDS. In both languages, we observed similar differentiation between verbs, adjectives, and nouns as well as between concrete and abstract words. In addition, the methods applied in this study, especially Procrustes analysis, make it possible to trace subtle differences in the structure of the semantic lexicons of Mandarin and English.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Table 1. Number of words for the partly culture-specific semantic categories used for Mandarin and English

Figure 1

Table 2. Classification models and prediction accuracy for held-out data

Figure 2

Figure 1. Clustering of Mandarin words belonging to 21 categories with MDS dimensions 2 and 3. For an interactive plot, please click here https://quantling.org/plots/yangyi2024/mds23.cn.html.

Figure 3

Figure 2. Clustering of Mandarin words belonging to 21 categories using PCA. For an interactive plot, please click here https://quantling.org/plots/yangyi2024/pca13.cn.html.

Figure 4

Figure 3. Clustering of Mandarin words belonging to 21 categories using t-SNE. For an interactive plot, please click here https://quantling.org/plots/yangyi2024/tsne12.cn.html.

Figure 5

Figure 4. Clustering of English words belonging to 21 categories using MDS. For an interactive plot, please click here https://quantling.org/plots/yangyi2024/mds12.en.html.

Figure 6

Figure 5. Clustering of English words belonging to 21 categories using PCA. For an interactive plot, please click here https://quantling.org/plots/yangyi2024/pca12.en.html.

Figure 7

Figure 6. Clustering of English words belonging to 21 categories using t-SNE. For an interactive plot, please click here https://quantling.org/plots/yangyi2024/tsne12.en.html.

Figure 8

Figure 7. Scatterplots for the MDS, PCA, and t-SNE category centroids in Figures 1–6 for Mandarin (left panels) and for English (right panels). MDS1 and PC2 are not shown as these dimensions are captured by the aspect of traditional/simplified Chinese, as explained in Section 3.1.

Figure 9

Figure 8. Analyses of category centroids for Mandarin (left) and English (right). Upper panels: scatterplots for distance, based on MDS. Lower panels: networks for angle. Vertices are connected when the cosine similarity exceeds the 7th decile of the distribution of cosine similarity.

Figure 10

Figure 9. By-category residuals of procrustes analyses of Mandarin and English, using distances (top panel) and cosine similarities (bottom panel). The dashed lines denote the first and third quartiles.

Figure 11

Figure 10. T-SNE scatterplots of original English and rotated Mandarin semantic vectors. Upper panel: color coding by semantic category; lower panel: color coding by language.

Figure 12

Figure 11. Zoomed-in t-SNE plot highlighting the distribution of person nouns.

Figure 13

Figure 12. Zoomed-in t-SNE plot highlighting the distribution of food nouns.

Figure 14

Figure 13. Zoomed-in t-SNE plot highlighting the distribution of body nouns.

Figure 15

Figure B1. A comparison of t-SNE (top panel), UMAP(center panel), and PaCMAP (lower panel) clusterings of Mandarin words.