Hostname: page-component-77f85d65b8-grvzd Total loading time: 0 Render date: 2026-03-29T17:29:56.651Z Has data issue: false hasContentIssue false

Word2Vec

Published online by Cambridge University Press:  16 December 2016

KENNETH WARD CHURCH*
Affiliation:
IBM, Yorktown Heights, NY, USA e-mail: kwchurch@us.ibm.com
Rights & Permissions [Opens in a new window]

Abstract

My last column ended with some comments about Kuhn and word2vec. Word2vec has racked up plenty of citations because it satisifies both of Kuhn’s conditions for emerging trends: (1) a few initial (promising, if not convincing) successes that motivate early adopters (students) to do more, as well as (2) leaving plenty of room for early adopters to contribute and benefit by doing so. The fact that Google has so much to say on ‘How does word2vec work’ makes it clear that the definitive answer to that question has yet to be written. It also helps citation counts to distribute code and data to make it that much easier for the next generation to take advantage of the opportunities (and cite your work in the process).

Information

Type
Emerging Trends
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © Cambridge University Press 2016
Figure 0

Table 1. Top ten choices for x in man is to woman as king is to x. Although the top candidate, queen, is an impressive choice, many of the other top ten candidates are less impressive, especially the eight of ten candidates with incorrect gender/number. Candidates with larger hor similarities are more likely to inherit the desired gender and number features from woman. The overall score is close to hor + vertdiag, but not exactly because vector length normalization doesn’t distribute over vector addition and subtraction

Figure 1

Table 2. Some types of analogies are easier than others, as indicated by accuracies for top choice (A1), as well as top 2 (A2), top 10 (A10) and top 20 (A20). The rows are sorted by A1. These analogies and the type classification come from the questions-words test set, except for the last row, SAT questions. SAT questions are harder than questions-words

Figure 2

Table 3. The two test sets have very different Venn diagrams. The 178 means that SAT has 178 words that are in a, but not b, c or d. The 14 means that there are 14 words in the overlap between b and d (and not a and c). SAT is more like what I was expecting with small overlaps. Since the vocabulary is much larger than the test set, it is unlikely to find the same word in multiple positions

Figure 3

Figure 1. Six boxplots comparing Vert, Hor and Diag. The three columns use three measures of similarity: (1) word2vec distance, (2) domain space and (3) function space. The top row uses SAT questions, and the bottom row uses questions-words. This plot is based on just $\overline{x}$ similarities, though the plot would not change much if we replaced $\overline{x}$ similarities with x similarities.