Emerging trends: Subwords, seriously?

Abstract Subwords have become very popular, but the BERTa and ERNIEb tokenizers often produce surprising results. Byte pair encoding (BPE) trains a dictionary with a simple information theoretic criterion that sidesteps the need for special treatment of unknown words. BPE is more about training (populating a dictionary of word pieces) than inference (parsing an unknown word into word pieces). The parse at inference time can be ambiguous. Which parse should we use? For example, “electroneutral” can be parsed as electron-eu-tral or electro-neutral, and “bidirectional” can be parsed as bid-ire-ction-al and bi-directional. BERT and ERNIE tend to favor the parse with more word pieces. We propose minimizing the number of word pieces. To justify our proposal, a number of criteria will be considered: sound, meaning, etc. The prefix, bi-, has the desired vowel (unlike bid) and the desired meaning (bi is Latin for two, unlike bid, which is Germanic for offer).


Desiderata
Subwords/word pieces have become quite popular recently, especially for deep nets. They are used in the front end of BERT (Devlin et al. 2018) and ERNIE (Sun et al. 2019), two very successful deep nets for language applications. BERT provides the following motivation for word pieces: "Using wordpieces gives a good balance between the flexibility of single characters and the efficiency of full words for decoding, and also sidesteps the need for special treatment of unknown words." (Devlin et al. 2018) Subwords are based on byte pair encoding (BPE) (Sennrich, Haddow, and Birch 2016), which borrows ideas from information theory to learn a dictionary of word pieces from a training corpus. Word pieces are being used for a variety of applications: speech (Schuster and Nakajima 2012), translation (Wu et al. 2016), as well as tasks in the GLUE benchmark (Wang et al. 2018), c such as: sentiment, paraphrase, and coreference. Many of these papers are massively cited (more than one thousand citations in Google Scholar).
Some examples of the BERT/ERNIE tokenizer are shown in Tables 1, 2, and 3. These tokenizers are intended to be used on text that is like what they were trained on (often wikipedia and newswire), but many of the examples in this paper are selected from something very different to challenge tokenization with lots of out of vocabulary (OOV) words. We collected a small sample of 10k medical abstracts (1.9M words) from PubMed abstracts. More than 30M abstracts are available for download. d Medical abstracts are rich in technical terminology (OOVs).  Tables 1, 2, and 3 are surprising. Consider "electron-eu-tral" and "electro-neutral." BPE is more about training (how to learn a dictionary of word pieces) than inference (how to parse an OOV into a sequence of word pieces). In this case, the parse is ambiguous. How do we choose between "electron-eu-tral" and "electro-neutral"? We suggest minimizing the number of word pieces.

Many of the analyses in
The examples in Tables 1, 2, and 3 raise a number of engineering and linguistic issues. BPE considers letter statistics, but not risk (variance), sound, meaning, etymology, etc. Many of these other factors are considered important for morphological analysis by various communities for various purposes. Since every split is risky, it is better to use as few word pieces as necessary, especially for frequent words. It should be possible to represent most (frequent) words with one or two word pieces, and almost no words should require more than three word pieces. (c) Avoid risky splits: Infixes (word pieces in the middle) are more risky than prefixes and suffixes (word pieces at the ends). Short word pieces are more risky than long Table 2. The analysis of x + s and x + ed reflects frequency. The more frequent form is more likely to be in the dictionary. Regular inflection is relatively safe, but every split is risky, as illustrated by the surprising analysis for "mediates" word pieces. e Splits near the middle of words are more risky than splits near the ends. Overlapping splits such as "telephone − phone + phony" are safer than simple concatenation (especially for carefully chosen pairs of affixes like "phone" and "phony"). (d) Stability: Similar words should share similar analyses. Small changes should not change the results much. 2. Linguistic considerations (a) Capturing relevant generalizations: Morphological analyses should make it easy to identify related words: "bidirectional" and "bidimensional" share a common prefix, with similar sound and meaning (and history); "bidirectional" and "unidirectional" share all but the prefix. (b) Sound: Word pieces should support grapheme to phoneme conversion.
(i) "bidrectional" and "bidimensional" start with the prefix "bi-" with a long vowel (not "bid-" with a short vowel). (ii) "unidirectional" starts with the prefix "uni-" (not "un-"); again, the two prefixes have different vowels. (iii) "ction" is unlikely to be a morpheme because English syllables do not start with "ct." (iv) Avoid splitting digraphs like "ph" across different word pieces (as in "tele-ephony").
e Thirty-five percentage of the PubMed corpus (by token) makes use of a one-or two-letter word piece. These one-and two-letter pieces cover most the possibilities (all 26 one-letter sequences and 421 of 26 2 two-letter sequences).   h-yp-oa-ctive 9 hyper-active (c) Meaning: "bi" is from the Latin word for two, unlike "bid," which means something else ("offer"), and has a different etymology (Germanic). f Similarly, "uni" is from the Latin word for one, unlike the Germanic "un," which means something else ("not" (for adjectives) or "to do in the reverse direction" (for verbs)). g

Maximize coverage and minimize splits
As suggested above, it should be possible to represent frequent words with one or perhaps two word pieces. Almost no word should require more than three word pieces. Table 1 shows a f https://www.etymonline.com/word/bid g https://www.etymonline.com/search?q=un-  number of examples such as "neurotransmitter" where the BERT/ERNIE tokenizer violates this limit of three word pieces. When this happens, we believe there is almost always a better alternative analysis. Tables 4 and 5 report coverage by type and by token. Rare words are split more than frequent words, but too many words are split more than necessary. That is, compare the top line in Table 4 for more frequent words to other lines in Table 4 for less frequent words. The top line has relatively more mass in the first few columns, indicating that more frequent words are split into fewer pieces. That said, there are way too many splits. Hardly any words should require more than three pieces, but 30% (by type) and 5% (by token) have more than three word pieces.

Risky business
Every split is risky, but some splits are more risky than others. Table 2 shows a number of examples of regular inflection. This is one of the safer splits, but even in this case, "media-tes" is surprising.
In (Coker, Church, and Liberman 1991), we evaluated 11 splitting processes for use in the Bell Labs speech synthesizer. We found that splits near the middle of a word are more risky than splits toward the end. (Among other things, splits in the middle are more likely to split digraphs such as "ph" as in "tele-ep-hony.") Many of the PubMed terms entered the language starting with the scientific enlightenment (at least 500 years after the Norman Invasion), l when it was fashionable to coin new terms based on a "revival" of Greek and Latin. The word potassium entered the language relatively recently (1807). m These new words tend to separate Greek and Latin, but not always. My first employer, AT&T underwent a number of reorganizations over the 20 years that I was there. One of them introduced an interesting new word, "trivest," n when AT&T split itself into three parts, soon after "divestment." This is a misanalysis of "divestment" where "di-" is from Latin (meaning "away from") o and not the Greek "two." BERT's analyses of these words are surprising: dive-st, tri-ves-t, dive-st-ment, and tri-ves-tment.
AT&T used to be called American Telephone and Telegraph, but they changed their name to AT&T because the telegraph technology (and even the word) does not have much of a future. Interestingly, though, all three words (American, Telephone and Telegraph) are in the BERT lexicon. One might have expected the BERT lexicon to include frequent words with a future, and exclude infrequent words, especially those without a future.

Conclusions
Subwords are extremely popular. Many of the papers mentioned here are massively cited. BPE provides a simple information theoretic method for sidestepping OOVs. The method is currently being used for a wide range of applications in speech, translation, GLUE, etc.
That said, it is easy to find surprising analyses such as "electron-eu-tral." If we introduce an additional constraint, minimize the number of word pieces, then we produce the more natural analysis: "electro-neutral." While the information theoretic BPE criterion sounds attractive to engineers, our field should make room for additional perspectives. Linguists are taught that sound and meaning are better sources of evidence than spelling. This is not an unreasonable position. We should be concerned by the fact that BPE often produces analyses with the wrong meaning and the wrong sound (wrong vowel, splitting digraphs). Such analyses have obvious implications for grapheme to phoneme conversion. For other applications, modern deep nets are so powerful that they can often overcome such issues in preprocessing, but even so, if we can avoid such issues with simple suggestions such as minimizing the number of word pieces, we should do so.