Hostname: page-component-89b8bd64d-nlwjb Total loading time: 0 Render date: 2026-05-06T12:19:08.507Z Has data issue: false hasContentIssue false

Morphosyntactic probing of multilingual BERT models

Published online by Cambridge University Press:  25 May 2023

Judit Acs*
Affiliation:
Informatics Laboratory, ELKH Institute for Computer Science and Control (SZTAKI), Budapest, Hungary Department of Automation and Applied Informatics, Faculty of Electrical Engineering and Informatics, Budapest University of Technology and Economics, Budapest, Hungary
Endre Hamerlik
Affiliation:
Informatics Laboratory, ELKH Institute for Computer Science and Control (SZTAKI), Budapest, Hungary Department of Applied Informatics, Comenius University in Bratislava Faculty of Mathematics Physics and Informatics, Bratislava, Slovakia
Roy Schwartz
Affiliation:
School of Computer Science and Engineering, Hebrew University of Jerusalem, Jerusalem, Israel
Noah A. Smith
Affiliation:
Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA Allen Institute for Artificial Intelligence, Seattle, WA, USA
Andras Kornai
Affiliation:
Informatics Laboratory, ELKH Institute for Computer Science and Control (SZTAKI), Budapest, Hungary Department of Algebra, Faculty of Natural Sciences, Budapest University of Technology and Economics
*
Corresponding author: Judit Acs; Email: acsjudit@sztaki.hu
Rights & Permissions [Opens in a new window]

Abstract

We introduce an extensive dataset for multilingual probing of morphological information in language models (247 tasks across 42 languages from 10 families), each consisting of a sentence with a target word and a morphological tag as the desired label, derived from the Universal Dependencies treebanks. We find that pre-trained Transformer models (mBERT and XLM-RoBERTa) learn features that attain strong performance across these tasks. We then apply two methods to locate, for each probing task, where the disambiguating information resides in the input. The first is a new perturbation method that “masks” various parts of context; the second is the classical method of Shapley values. The most intriguing finding that emerges is a strong tendency for the preceding context to hold more information relevant to the prediction than the following context.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Table 1. List of languages and the number of tasks in each language.

Figure 1

Figure 1. Number of tasks by language family.

Figure 2

Figure 2. Probing architecture. Input is tokenized into wordpieces, and a weighted sum of the mBERT layers taken on the last wordpiece of the target word is used for classification by an MLP. Only the MLP parameters and the layer weights $w_i$ are trained. $\mathbf{x}_i$ is the output vector of the $i$th layer, $w_i$ is the learned layer weight. The example task here is $\langle$English, NOUN, Number$\rangle$.

Figure 3

Table 2. Average test accuracy over all languages by task and model

Figure 4

Figure 3. Difference in accuracy between mBERT (left) and chLSTM, and XLM-RoBERTa (right) and chLSTM grouped by language family and morphological category. Gray cells represent missing tasks.

Figure 5

Figure 4. Difference in accuracy between mBERT (left) and chLSTM, and XLM-RoBERTa (right) and chLSTM grouped by language family and POS. Gray cells represent missing tasks.

Figure 6

Figure 5. Task-by-task difference between the MLMs and chLSTM in Slavic languages. Gray cells represent missing tasks.

Figure 7

Figure 6. mBERT XLM-RoBERTa comparison by tag and by POS.

Figure 8

Figure 7. mBERT XLM-RoBERTa comparison by language family.

Figure 9

Table 3. 10 hardest tasks.

Figure 10

Table 4. List of perturbation methods with examples.

Figure 11

Table 5. Perturbation results by model averaged over 247 tasks.

Figure 12

Figure 8. Test accuracy of the perturbed probes grouped by POS. The first group is the average of all 247 tasks. The first two bars in each group are the unperturbed probes’ accuracy.

Figure 13

Figure 9. The effect of context masking perturbations by tag. Error bars indicate the standard deviation.

Figure 14

Figure 10. The effect of context masking on case tasks grouped by language family. Error bars indicate the standard deviation.

Figure 15

Figure 11. The effect of targ and permute. Error bars indicate the standard deviation.

Figure 16

Figure 12. The effect of targ and permute by language family. Error bars indicate the standard deviation.

Figure 17

Figure 13. The pairwise Pearson correlation of perturbation effects between the two models.

Figure 18

Figure 14. The pairwise Pearson correlation of perturbation effects by model.

Figure 19

Figure 15. Co-occurrence counts for each language pair over 100 clustering runs. Languages are sorted by family and a line is added between families.

Figure 20

Figure 16. Shapley values by relative position to the probed target word. The values are averaged over the 247 tasks.

Figure 21

Table 6. Summary of the Shapley values.

Figure 22

Figure 17. Least and most anomalous Shapley distributions. The first row is the mean Shapley values of the 247 tasks and the 5 tasks closest to the mean distribution, that is the least anomalous as measured by the dfm distance from the average Shapley values. The rest of the rows are the most anomalous Shapley values in descending order. For each particular task, its distance from the mean (dfm) is listed in parentheses above the graphs.

Figure 23

Figure 18. Shapley values in Indic tasks.

Figure 24

Figure 19. The average probing accuracy using different MLP variations. We indicate the size(s) of hidden layer(s) in square brackets.

Figure 25

Figure 20. The difference between probing a single layer and probing the weighted sum of layers. concat is the concatenation of all layers. 0 is the embedding layer. Large negative values on the y-axis mean that probing the particular layer on the x-axis is much worse than probing the weighted sum of all layers.

Figure 26

Table 7. Comparison of fine-tuned and frozen (feature extraction) models.

Figure 27

Table 8. Probing accuracy on the randomly initialized mBERT and XLM-RoBERTa models.

Figure 28

Figure 21. Random mBERT (light color) and random XLM-RoBERTa (darker color) performance comparison with different perturbation setups and the unperturbed trained model variants (orange bars). Left-to-right: Blue: Accuracy of the embedding and first layers’ probes; Green: Random models with pre-trained embedding layer: no perturbation, b$_{2}$, l$_{2}$, r$_{2}$, permute; Red: Random models where the embedding layer is random as well: no perturbation, b$_{2}$, l$_{2}$, r$_{2}$, permute; Orange: Unperturbed trained models.

Figure 29

Figure 22. Probing accuracy with reduced training data.

Figure 30

Figure 23. Layer weight outliers. Layer 0 is the embedding layer.

Figure 31

Figure 24. Shapley values by POS and model.

Figure 32

Figure 25. Shapley values by POS and model.

Figure 33

Figure 26. Shapley values in German tasks.