Hostname: page-component-89b8bd64d-mmrw7 Total loading time: 0 Render date: 2026-05-06T08:19:03.977Z Has data issue: false hasContentIssue false

Evaluating word embedding models: methods and experimental results

Published online by Cambridge University Press:  08 July 2019

Bin Wang*
Affiliation:
University of Southern California, Los Angeles, CA 90089, USA
Angela Wang
Affiliation:
University of California, Berkeley, Berkeley, CA 94720, USA
Fenxiao Chen
Affiliation:
University of Southern California, Los Angeles, CA 90089, USA
Yuncheng Wang
Affiliation:
University of Southern California, Los Angeles, CA 90089, USA
C.-C. Jay Kuo
Affiliation:
University of Southern California, Los Angeles, CA 90089, USA
*
Corresponding author: Bin Wang, Email: bwang28c@gmail.com

Abstract

Extensive evaluation on a large number of word embedding models for language processing applications is conducted in this work. First, we introduce popular word embedding models and discuss desired properties of word models and evaluation methods (or evaluators). Then, we categorize evaluators into intrinsic and extrinsic two types. Intrinsic evaluators test the quality of a representation independent of specific natural language processing tasks while extrinsic evaluators use word embeddings as input features to a downstream task and measure changes in performance metrics specific to that task. We report experimental results of intrinsic and extrinsic evaluators on six word embedding models. It is shown that different evaluators focus on different aspects of word models, and some are more correlated with natural language processing tasks. Finally, we adopt correlation analysis to study performance consistency of extrinsic and intrinsic evaluators.

Information

Type
Overview Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Authors, 2019
Figure 0

Table 1. Word similarity datasets used in our experiments where pairs indicate the number of word pairs in each dataset

Figure 1

Table 2. Performance comparison (×100) of six-word embedding baseline models against 13-word similarity datasets

Figure 2

Table 3. Performance comparison (×100) of six-word embedding baseline models against word analogy datasets

Figure 3

Table 4. Performance comparison (×100) of six-word embedding baseline models against three concept categorization datasets

Figure 4

Table 5. Performance comparison of six-word embedding baseline models against outlier detection datasets

Figure 5

Table 6. QVEC performance comparison (×100) of six-word embedding baseline models

Figure 6

Table 7. Datasets for POS tagging, chunking, and NER

Figure 7

Table 8. Sentiment analysis datasets

Figure 8

Table 9. Extrinsic evaluation results

Figure 9

Fig. 1. Pearson's correlation between intrinsic and extrinsic evaluator, where the x-axis shows extrinsic evaluators while the y-axis indicates intrinsic evaluators. The warm indicates the positive correlation while the cool color indicates the negative correlation.