Hostname: page-component-77f85d65b8-pkds5 Total loading time: 0 Render date: 2026-03-27T20:47:23.710Z Has data issue: false hasContentIssue false

Benchmarks and goals

Published online by Cambridge University Press:  10 August 2020

Kenneth Ward Church*
Affiliation:
Baidu, Sunnyvale, CA 94089, USA
Rights & Permissions [Opens in a new window]

Abstract

Benchmarks can be a useful step toward the goals of the field (when the benchmark is on the critical path), as demonstrated by the GLUE benchmark, and deep nets such as BERT and ERNIE. The case for other benchmarks such as MUSE and WN18RR is less well established. Hopefully, these benchmarks are on a critical path toward progress on bilingual lexicon induction (BLI) and knowledge graph completion (KGC). Many KGC algorithms have been proposed such as Trans[DEHRM], but it remains to be seen how this work improves WordNet coverage. Given how much work is based on these benchmarks, the literature should have more to say than it does about the connection between benchmarks and goals. Is optimizing P@10 on WN18RR likely to produce more complete knowledge graphs? Is MUSE likely to improve Machine Translation?

Information

Type
Emerging Trends
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2020. Published by Cambridge University Press
Figure 0

Table 1. Fan-out for 168k English words in MUSE, most of which (115k) are disconnected islands that back-translate to themselves (and nothing else). The majority of the rest back-translate to five or more words

Figure 1

Table 2. Five synsets in two (of 29) languages

Figure 2

Table 3. Vocabulary Sizes (excluding disconnected islands). Numbers are larger for WordNet than MUSE, suggesting WordNet has better coverage. Numbers are also larger for English (en) than other languages, suggesting both WordNet and MUSE have better coverage of English than other languages

Figure 3

Table 4. WordNet (WN) has more French glosses than MUSE

Figure 4

Table 5. WordNet has more French glosses than MUSE. Each cell, i, j, counts the number of words with i glosses in WordNet and j glosses in MUSE

Figure 5

Table 6. Interaction between word sense (English) and glosses (French)

Figure 6

Table 7. Sizes of WordNet tables

Figure 7

Table 8. MLI (monolingual lexicon induction) results for challenging test set of 10k PubMed terms (ranks 50–60k). P@1 is disappointing when embeddings are trained on relevant data (and worse when trained on irrelevant data)

Figure 8

Figure 1. BLI technology is more effective for high-frequency words. Accuracy is better for high rank (top), large $score_1$ (middle) and large gap between $score_1$ and $score_2$ (bottom).