Hostname: page-component-89b8bd64d-z2ts4 Total loading time: 0 Render date: 2026-05-08T18:24:33.570Z Has data issue: false hasContentIssue false

Emerging trends: Subwords, seriously?

Published online by Cambridge University Press:  07 April 2020

Kenneth Ward Church*
Affiliation:
Baidu, USA
Rights & Permissions [Opens in a new window]

Abstract

Subwords have become very popular, but the BERTa and ERNIEbtokenizers often produce surprising results. Byte pair encoding (BPE) trains a dictionary with a simple information theoretic criterion that sidesteps the need for special treatment of unknown words. BPE is more about training (populating a dictionary of word pieces) than inference (parsing an unknown word into word pieces). The parse at inference time can be ambiguous. Which parse should we use? For example, “electroneutral” can be parsed as electron-eu-tral or electro-neutral, and “bidirectional” can be parsed as bid-ire-ction-al and bi-directional. BERT and ERNIE tend to favor the parse with more word pieces. We propose minimizing the number of word pieces. To justify our proposal, a number of criteria will be considered: sound, meaning, etc. The prefix, bi-, has the desired vowel (unlike bid) and the desired meaning (bi is Latin for two, unlike bid, which is Germanic for offer).

Information

Type
Emerging Trends
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s) 2020
Figure 0

Table 1. Some examples of the BERT/ERNIE tokenizer

Figure 1

Table 2. The analysis of + and + reflects frequency. The more frequent form is more likely to be in the dictionary. Regular inflection is relatively safe, but every split is risky, as illustrated by the surprising analysis for “mediates”

Figure 2

Table 3. The analysis of hypo-x should be similar to hyper-x. Hypertension and hypotension, for example, mean high blood pressure and low blood pressure, respectively. Unfortunately, BERT/ERNIE tokenizer splits many of these into too many pieces, making it difficult to see the similarity

Figure 3

Table 4. Rare words are split more than frequent words, but too many words are split more than necessary

Figure 4

Table 5. Since every split is risky, it is better to use as few word pieces as necessary, especially for frequent words. Most words are in the dictionary (83% by token), but 5% are split into three or more pieces, and 2% are split into five or more pieces

Figure 5

Table 6. All splits are risky, but splits in the middle (compounding) are particularly risky