1. Introduction
Models for the distributed representation of words (word embeddings) have drawn great interest in recent years because of their ability to acquire syntactic and semantic information from large unannotated corpora (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013a; Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014; Sun et al. Reference Sun, Guo, Lan, Xu and Cheng2016). Likewise, more and more ontologies have been compiled with high-quality lexical knowledge, including WordNet (Miller Reference Miller1998), Roget’s 21st Century Thesaurus (Roget) (Kipfer Reference Kipfer1993), and the paraphrase database (PPDB) (Pavlick et al. Reference Pavlick, Rastogi, Ganitkevitch, Van Durme and Callison-Burch2015). Based on lexical knowledge, early linguistic approaches such as the Leacock Chodorow similarity measure (Leacock and Chodorow Reference Leacock and Chodorow1998), the Lin similarity measure (Lin Reference Lin1998), and the Wu–Palmer similarity measure (Wu and Palmer Reference Wu and Palmer1994) have been proposed to compute semantic similarity. Although these linguistic resource-based approaches are somewhat logical and interpretable, they do not scale easily (in terms of vocabulary size). Furthermore, approaches based on modern neural networks outperform most linguistic resource-based approaches with better linearity.
While the recently proposed contextualized word representation models (Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018; Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019; Radford et al. Reference Radford, Wu, Child, Luan, Amodei and Sutskever2019) can have different representations given the context of a target word, evidence showed that the contextualized word representation may perform worse than static word embedding in some semantic relatedness datasets (Ethayarajh Reference Ethayarajh2019). Moreover, they may not be able to incorporate the knowledge in the ontologies into the models. On the contrary, some researches proposed models to incorporate word embedding models and lexical ontologies, using either joint training or post-processing (Yu and Dredze Reference Yu and Dredze2014; Faruqui et al. Reference Faruqui, Dodge, Jauhar, Dyer, Hovy and Smith2015). However, these word embedding models use only one vector to represent a word and are problematic in some natural language processing applications that require sense-level representation (e.g., word sense disambiguation and semantic relation identification). One way to take into account such polysemy and homonymy is to introduce sense-level embedding, via either pre-processing (Iacobacci, Pilehvar, and Navigli Reference Iacobacci, Pilehvar and Navigli2015) or post-processing (Jauhar, Dyer, and Hovy Reference Jauhar, Dyer and Hovy2015).
In this work, we focus on a post-processing sense retrofitting model GenSense (Lee et al. Reference Lee, Yen, Huang, Shiue and Chen2018), which is a generalized sense embedding learning framework that retrofits a pre-trained word embedding (i.e., Word2Vec Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013a, GolVe Pennington et al. Reference Pennington, Socher and Manning2014) with semantic relations between the senses, the relation strength, and the semantic strength.Footnote a The GenSense for generating low-dimensional sense embedding is inspired from the Retro sense model (Jauhar et al. Reference Jauhar, Dyer and Hovy2015) but has three major differences. First, it generalizes semantic relations from positive relations (e.g., synonyms, hyponyms, paraphrasing Lin and Pantel Reference Lin and Pantel2001; Dolan, Quirk, and Brockett Reference Dolan, Quirk and Brockett2004; Quirk, Brockett, and Dolan Reference Quirk, Brockett and Dolan2004; Ganitkevitch, Van Durme, and Callison-Burch Reference Ganitkevitch, Van Durme and Callison-Burch2013; Pavlick et al. Reference Pavlick, Rastogi, Ganitkevitch, Van Durme and Callison-Burch2015) to both positive and negative relations (e.g., antonyms). Second, each relation incorporates both semantic strength and relation strength. Within a semantic relation, there should be a weighting for each semantic strength. For example, although jewel has the synonyms gem and rock, it is clear that the similarity between (jewel, gem) is higher than (jewel, rock); thus, a good model should assign a higher weight to (jewel, gem). Last, GenSense assigns different relation strengths to different relations. For example, if the objective is to train a sense embedding that distinguishes between positive and negative senses, then the weight for negative relations (e.g., antonyms) should be higher, and vice versa. Experimental results suggest that relation strengths play an important role in balancing relations and are application dependent. Given an objective function that takes into consideration these three parts, sense vectors can be learned and updated using a belief propagation process on the relation constrained network. A constraint on the update formula is also considered using a threshold criterion.
Apart from the GenSense framework, some work suggests using a standardization process to improve the quality of vanilla word embeddings (Lee et al. Reference Lee, Ke, Huang and Chen2016). Thus, we propose a standardization process on GenSense’s embedding dimensions with four settings, including (1) performing standardization after all of the iteration process (once); (2) performing standardization after every iteration (every time); (3) performing standardization before the sense retrofitting process (once); and (4) performing standardization before each iteration of the sense retrofitting process (every time). We also propose a sense neighbor expansion process from the nearest neighbors; this is added into the sense update formula to improve the quality of the sense embedding. Finally, we combine the standardization process and neighbor expansion process in four different ways: (1) GenSense with neighbor expansion, followed by standardization (once); (2) GenSense with neighbor expansion, followed by standardization in each iteration (each time); (3) standardization and then retrofitting of the sense vectors with neighbor expansion (once); and (4) in each iteration, standardization and then retrofitting of the sense vectors with neighbor expansion (each time).
Though GenSense can retrofit sense vectors connected within a given ontology, words outside of the ontology are not learned. To address this issue, we introduce a bilingual mapping method (Mikolov, Le, and Sutskever Reference Mikolov, Le and Sutskever2013b; Xing et al. Reference Xing, Wang, Liu and Lin2015; Artetxe, Labaka, and Agirre Reference Artetxe, Labaka and Agirre2016; Smith et al. Reference Smith, Turban, Hamblin and Hammerla2017; Joulin et al. Reference Joulin, Bojanowski, Mikolov, Jégou and Grave2018) for learning the mapping between the original word embedding and the learned sense embedding that utilize the Procrustes analysis. The goal of orthogonal Procrustes analysis is to find a transformation matrix W such that the representations before the sense retrofitting are close to the representations after the sense retrofitting. After obtaining W, we can apply the transformation matrix to the senses that are not retrofitted.
In the experiments, we show that the GenSense model outperforms previous approaches on four types of datasets: semantic relatedness, contextual word similarity, semantic difference, and synonym selection. With an experiment to evaluate the benefits yielded by the relation strength, we find a $87.7$ % performance difference between the worst and the best cases in WordSim-353 semantic relatedness benchmark dataset (Finkelstein et al. Reference Finkelstein, Gabrilovich, Matias, Rivlin, Solan, Wolfman and Ruppin2002). While the generalized model which considers all the relations performs well in the semantic relatedness tasks, we also find that antonym relations perform particularly well in the semantic difference experiment. We also find that the proposed standardization, neighbor expansion, and the combination of these two processes improve performance on the semantic relatedness experiment.
The remainder of this paper is organized as follows. Section 2 introduces some related works. Section 3 describes our proposed generalized sense retrofitting model. In Section 4, we show the Procrustes analysis on GenSense. Section 5 describes the details of the experiments. The results discussions are shown in Section 6. In Section 7, we show the limitation of the model and point out some future directions. Finally, we conclude this research in Section 8.
2. Related work
The study of word representations has a long history. Early approaches include utilizing the term-document occurrence matrix from a large corpus and then using dimension reduction techniques such as singular value decomposition (Deerwester et al. Reference Deerwester, Dumais, Furnas, Landauer and Harshman1990; Bullinaria and Levy Reference Bullinaria and Levy2007). Beyond that, recent unsupervised word embedding approaches (sometimes referred to as corpus-based approaches) based on neural networks (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013a; Pennington et al. Reference Pennington, Socher and Manning2014; Dragoni and Petrucci Reference Dragoni and Petrucci2017) have performed well on syntactic and semantic tasks. Among these, Word2Vec word embeddings were released using the continuous bag-of-words (CBOW) model and the skip-gram model (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013a). The CBOW model predicts the center word using contextual words, while the skip-gram model predicts contextual words using the center word. GloVe, another widely adopted word embedding model, is a log-bilinear regression model that mitigates the drawbacks of global factorization approaches (such as latent semantic analysis; Deerwester et al. Reference Deerwester, Dumais, Furnas, Landauer and Harshman1990) and local context window approaches (such as the skip-gram model) on the word analogy and semantic relatedness tasks (Pennington et al. Reference Pennington, Socher and Manning2014). The global vectors in GloVe for word embedding are trained using unsupervised learning on aggregated global word–word co-occurrence statistics from a corpus; this encodes the relationship between words and yields vectorized representations for words that satisfy ratios between words. The objective functions of Word2Vec and GloVe are slightly different. Word2Vec utilizes negative sampling to make words that do not frequently co-occur more dissimilar, whereas GloVe instead uses a weighting function to adjust the word–word co-occurrence counts; Word2Vec does not use this method. To deal with the out-of-vocabulary (oov) issue in word representations, FastText (Bojanowski et al. Reference Bojanowski, Grave, Joulin and Mikolov2017) is a more advanced model which leverages subword information. For example, when considering the word asset with tri-gram, it will be represented by the following character tri-gram: $\lt$ as, ass, sse, set, et $\gt$ . This technique not only resolved the oov issue but also better represented low frequency words.
Apart from unsupervised word embedding learning models, there exist ontologies that contain lexical knowledge such as WordNet (Miller Reference Miller1998), Roget’s 21st Century Thesaurus (Kipfer Reference Kipfer1993), and PPDB (Pavlick et al. Reference Pavlick, Rastogi, Ganitkevitch, Van Durme and Callison-Burch2015). Although these ontologies are useful in some applications, different ontologies contain different structure. In Roget, a synonym set contains all the words of the same sense and has its unique definition. For example, the word free in Roget has at least two adjective senses: (free, complimentary, comp, unrecompensed) and (free, available, clear). The definition of the first sense (without charge) is different from the second sense (not busy; unoccupied). The synonym’s relevance can be different in each set: the ranking in the first sense of free is (free, complimentary) > (free, comp) > (free, unrecompensed). PPDB is an automatically created massive resource of paraphrases of three types: lexical, phrasal, and syntactic (Ganitkevitch et al. Reference Ganitkevitch, Van Durme and Callison-Burch2013; Pavlick et al. Reference Pavlick, Rastogi, Ganitkevitch, Van Durme and Callison-Burch2015). For each type, there are several sizes with different trade-offs in terms of precision and recall. Each pair of words is semantically equivalent in some degree. For example, (automobile, car, auto, wagon, …) is listed in the coarsest size, while the finest size contains only (automobile, car, auto).
As the development of these lexical ontologies and word embedding models has matured, many have attempted combining them either with joint training (Bian, Gao, and Liu Reference Bian, Gao and Liu2014; Yu and Dredze Reference Yu and Dredze2014; Bollegala et al. Reference Bollegala, Alsuhaibani, Maehara and Kawarabayashi2016; Liu, Nie, and Sordoni Reference Liu, Nie and Sordoni2016; Mancini et al. Reference Mancini, Camacho-Collados, Iacobacci and Navigli2017) or post-processing (Faruqui et al. Reference Faruqui, Dodge, Jauhar, Dyer, Hovy and Smith2015; Ettinger, Resnik, and Carpuat Reference Ettinger, Resnik and Carpuat2016; Mrkšic et al. Reference Mrkšic, OSéaghdha, Thomson, Gašic, Rojas-Barahona, Su, Vandyke, Wen and Young2016; Lee et al. Reference Lee, Yen, Huang and Chen2017; Lengerich et al. Reference Lengerich, Maas and Potts2017; Glavaš and Vulić Reference Glavaš and Vulić2018). When the need for sense embedding becomes more obvious, some researches focus on learning sense-level embedding with lexical ontology.
Joint training for sense embedding utilizes information contained in the lexical database during the intermediate word embedding generation steps. For example, as the SensEmbed model utilizes Babelfy to annotate the Wikipedia corpus, it generates sense-level representations (Iacobacci et al. Reference Iacobacci, Pilehvar and Navigli2015). NASARI uses WordNet and Wikipedia to generate word-based and synset-based representations and then linearly combines the two embeddings (Camacho-Collados, Pilehvar, and Navigli Reference Camacho-Collados, Pilehvar and Navigli2015). Mancini et al. (Reference Mancini, Camacho-Collados, Iacobacci and Navigli2017) proposed to learn word and sense embeddings in the same space via a joint neural architecture.
In contrast, this research focuses on post-processing approach. For retrofitting on word embedding, a new word embedding model is learned by retrofit (refine) the pre-trained word embedding with the lexical database’s information. One of the advantages of post-processing is the nonnecessity of training a word embedding from scratch, which often takes huge amounts of time and computation power. Faruqui et al. (Reference Faruqui, Dodge, Jauhar, Dyer, Hovy and Smith2015) proposed an objective function for retrofitting which minimizes the Euclidean distance of synonymic or hypernym–hyponym relation words in WordNet, while at the same time it preserves the original word embedding’s structure. Similar to retrofitting, a counter-fitting model is proposed to not only minimize the distance between vectors of words with synonym relations but also maximize the distance between vector of words with antonym relations (Mrkšic et al. Reference Mrkšic, OSéaghdha, Thomson, Gašic, Rojas-Barahona, Su, Vandyke, Wen and Young2016). Their qualitative analysis shows that before counter-fitting, words are related but not similar. After counter-fitting, the closest words are similar words.
For post-processing sense models, the Retro model (Jauhar et al. Reference Jauhar, Dyer and Hovy2015) applies graph smoothing with WordNet as a retrofitting step to tease the vectors of different senses apart. Li and Jurafsky (Reference Li and Jurafsky2015) proposed to learn sense embedding through Chinese restaurant processes and show a pipelined architecture for incorporating sense embeddings into language understanding. Ettinger et al. (Reference Ettinger, Resnik and Carpuat2016) proposed to use parallel corpus to build sense graph and then perform retrofitting on the constructed sense graph. Yen et al. (Reference Yen, Lee, Huang and Chen2018) proposed to learn sense embedding through retrofitting on sense and contextual neighbors jointly; however, the negative relations were not considered in their model. Remus and Biemann (Reference Remus and Biemann2018) used unsupervised sense inventory to perform retrofitting on word embedding to learn the sense embedding, though the quality of the unsupervised sense inventory is questionable. Although it has been shown that sense embedding does not improve every natural language processing task (Li and Jurafsky Reference Li and Jurafsky2015), there is still a great need for sense embedding for tasks that need sense-level representation (i.e., synonym selection, word similarity rating, and word sense induction) (Azzini et al. Reference Azzini, da Costa Pereira, Dragoni and Tettamanzi2011; Ettinger et al. Reference Ettinger, Resnik and Carpuat2016; Qiu, Tu, and Yu Reference Qiu, Tu and Yu2016). A survey on word and sense embeddings can be found in Camacho-Collados and Pilehvar (Reference Camacho-Collados and Pilehvar2018). After the proposal of the transformer model, researchers either utilize the transformer model directly (Wiedemann et al. Reference Wiedemann, Remus, Chawla and Biemann2019) or trained with ontologies to better capture the sense of a word in specific sentences (Shi et al. Reference Shi, Chen, Zhou and Chang2019; Loureiro and Jorge Reference Loureiro and Jorge2019); Scarlini, Pasini, and Navigli Reference Scarlini, Pasini and Navigli2020).
3. Generalized sense retrofitting model
3.1. The GenSense model
The GenSense model is to learn a better sense representation such that each new representation is close to its word form representation, its synonym neighbors, and its positive contextual neighbors, while actively pushing away from its antonym neighbors and its negative contextual neighbors (Lee et al. Reference Lee, Yen, Huang, Shiue and Chen2018). Let $V=\left\{w_1,...,w_n\right\}$ be a vocabulary of a trained word embedding and $|V|$ be its size. The matrix $\hat{Q}$ is the pre-trained collection of vector representations $\hat{Q}_i\in \mathbb{R}^d$ , where d is the dimensionality of a word vector. Each $w_i\in V$ is learned using a standard word embedding technique (e.g., GloVe Pennington et al. Reference Pennington, Socher and Manning2014 or word2vec Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013a). Let $\Omega=\left(T,E\right)$ be an ontology that contains the semantic relationship, where $T=\left\{t_1,...,t_m\right\}$ is a set of senses and $|T|$ the total number of senses. In general, $|T|>|V|$ since one word may contain more than one sense. For example, in WordNet the word gay has at least two senses $gay.n.01$ (the first noun sense; homosexual, homophile, homo, gay) and $gay.a.01$ (the first adjective sense; cheery, gay, sunny). Edge $\left(i,j\right)\in E$ indicates a semantic relationship of interest (e.g., synonym) between $t_i$ and $t_j$ . In our scenario, the edge set E consists of several disjoint subsets of interest. For example, the set of synonyms and antonyms as there is no case of a semantic relationship of a pair of senses belong to synonym and antonym at the same time, and thus $E=E_{r_1}\cup E_{r_2}\cup ... \cup E_{r_k}$ . If $r_1$ denotes the synonym relationship, then $\left(i,j\right)\in E_{r_1}$ if and only if $t_j$ is the synonym of $t_i$ . We use $\hat{q}_{t_j}$ to denote the word form vector of $t_j$ .Footnote b Then the goal is to learn a new matrix $S=\left(s_1,...,s_m\right)$ such that each new sense vector is close to its word form vertex and its synonym neighbors. The basic form that considers only synonym relations for the objective of the sense retrofitting model is
where the $\alpha$ ’s balance the importance of the word form vertex and the synonym, and the $\beta$ ’s control the strength of the semantic relations. When $\alpha_1=0$ and $\alpha_2=1$ , the model only considers the synonym neighbors and may be too deviate from the original vector. From Equation (1), a learned sense vector approaches its synonyms, meanwhile constraining its distance with its original word form vector. In addition, this equation can be further generalized to consider all relations as
Apart from the positive sense relation, we now introduce three types of special relations. The first is the positive contextual neighbor relation $r_2$ . $\left(i,j\right)\in E_{r_2}$ if and only if $t_j$ is the synonym of $t_i$ and the surface form of $t_j$ has only one sense. In the model, we use the word form vector to represent the neighbors of the $t_i$ ’s in $E_{r_2}$ . These neighbors are viewed as positive contextual neighbors, as they are learned from the context of a corpus (e.g., Word2Vec trained on the Google News corpus) with positive meaning. The second is the negative sense relation $r_3$ . $\left(i,j\right)\in E_{r_3}$ if and only if $t_j$ is the antonym of $t_i$ . The negative senses are used in a subtractive fashion to push the sense away from the positive meaning. The last is the negative contextual neighbor relation $r_4$ . $\left(i,j\right)\in E_{r_4}$ if and only if $t_j$ is the antonym of $t_i$ and the surface form of $t_j$ has only one sense. As with the positive contextual neighbors, negative contextual neighbors are learned from the context of a corpus, but with negative meaning. Table 1 summarizes the aforementioned relations.
In Figure 1, which contains an example of the relation network, the word gay may have two meanings: (1) bright and pleasant; promoting a feeling of cheer and (2) someone who is sexually attracted to persons of the same sex. If we focus on the first sense, then our model attracts $s_{gay_1}$ to its word form vector $\hat{q}_{gay_1}$ , its synonym $s_{glad_1}$ , and its positive contextual neighbor $\hat{q}_{jolly}$ . At the same time, it pushes $s_{gay_1}$ from its antonym $s_{sad_1}$ and its negative contextual neighbor $\hat{q}_{dull}$ .
Formalizing the above scenario and considering all its parts, Equation (2) becomes
We therefore apply an iterative updating method to the solution of the above convex objective function (Bengio, Delalleau, and Le Roux Reference Bengio, Delalleau and Le Roux2006). Initially, the sense vectors are set to their corresponding word form vectors (i.e., $s_i\leftarrow\hat{q}_{t_i}\forall i$ ). Then in the following iterations, the updating formula for $s_i$ is
A formal description of the GenSense method is shown in Algorithm 1 (Lee et al. Reference Lee, Yen, Huang, Shiue and Chen2018), in which the $\beta$ parameters are retrieved from the ontology, and $\varepsilon$ is a threshold for deciding whether to update the sense vector or not, and as such is used as a stopping criterion when the difference between the new sense vector and the original sense vector is small. Empirically, 10 iterations are sufficient to minimize the objective function from a set of starting vectors to produce effective sense-retrofitted vectors. Based on the GenSense model, the next three subsections will describe three approaches to further improve the sense representations.
3.2. Standardization on dimensions
Although the original GenSense model considers the semantic relations between the senses, the relation strength, and the semantic strength, the literature indicates that the vanilla word embedding model benefits from standardization on the dimensions (Lee et al. Reference Lee, Ke, Huang and Chen2016). In this approach, let $1\leq j\leq d$ be the d dimensions in the sense embedding. Then for every sense vector $s_i\in\mathbb{R}^d$ , the z-score is computed on each dimension as
where $\mu$ is the mean and $\sigma$ is the standard deviation of dimension j. After this process, the sense vector is then divided by its norm to ensure a summation of 1:
where $\left\|s_i\right\|$ is the norm of the sense vector $s_i$ . As this standardization process can be placed in multiple places, we consider the following four situations:
-
(1) GenSense-Z: the standardization process is performed after every iteration.
-
(2) GenSense-Z-last: the standardization process is performed only at the end of the whole algorithm.
-
(3) Z-GenSense: the standardization process is performed at the beginning of each iteration.
-
(4) Z-first-GenSense: the standardization process is performed only once, before iteration.
The details of this approach are shown in Algorithms 2 and 3. Note that although further combinations or adjustments of these situations are possible, in the experiments we analyze only these four situations.
3.3. Neighbor expansion from nearest neighbors
In this approach, we utilize the nearest neighbors of the target sense vector to refine GenSense. Intuitively, if the sense vector $s_i$ ’s nearest neighbors are uniformly distributed around $s_i$ , then they may not be helpful. In contrast, if the neighbors are clustered and gathered in a distinct direction, then utilization of these neighbors is crucial. Figure 2 contains examples of nearest neighbors that may or may not be helpful. In Figure 2(a), cheap ′s neighbors are not helpful since they are uniformly distributed and thus make the effect of the neighbors canceled. In Figure 2(b), love ′s neighbors are helpful since they are gathered in the same quadrant and make the new sense vector of love closer to its related senses.
In practice, we pre-build the sense embedding k-d tree for rapid lookups of the nearest neighbors of vectors (Maneewongvatana and Mount Reference Maneewongvatana and Mount1999). After building the k-d tree and take into consideration of the nearest neighbor term, the update formula for $s_i$ becomes
where $NN\left(s_i\right)$ is the set of N nearest neighbors of $s_i$ and $\alpha_6$ is a newly added parameter for weighting the importance of the nearest neighbors. Details of the proposed neighbor expansion approach are shown in Algorithm 4. The main procedure of Algorithm 4 is similar to that of Algorithm 1 (GenSense) with two differences: (1) in line 4 we need to build the k-d tree and (2) in line 7 we need to compute the nearest neighbors for Equation (7).
3.4. Combination of standardization and neighbor expansion
With the standardization and neighbor expansion approaches, a straightforward and natural way to further improve the sense embedding’s quality is to combine these two approaches. In this study, we propose four combination situations:
-
(1) GenSense-NN-Z: in each iteration, GenSense is performed with neighbor expansion, after which the sense embedding is standardized.
-
(2) GenSense-NN-Z-last: in each iteration, GenSense is performed with neighbor expansion. The standardization process is performed only after the last iteration.
-
(3) GenSense-Z-NN: in each iteration, the sense embedding is standardized and GenSense is performed with neighbor expansion.
-
(4) GenSense-Z-NN-first: standardization is performed only once, before the iteration process. After that, GenSense is performed with neighbor expansion.
As with standardization on the dimensions, although different combinations of the standardization and neighbor expansion approaches are possible, we analyze only these four situations in our experiments.
4. Procrustes analysis on GenSense
Although the GenSense model retrofits sense vectors connected within a given ontology, words outside of the ontology are not learned. We address this by introducing a bilingual mapping method (Mikolov et al. Reference Mikolov, Le and Sutskever2013b; Xing et al. Reference Xing, Wang, Liu and Lin2015; Artetxe et al. Reference Artetxe, Labaka and Agirre2016; Smith et al. Reference Smith, Turban, Hamblin and Hammerla2017; Joulin et al. Reference Joulin, Bojanowski, Mikolov, Jégou and Grave2018) for learning the mapping between the original word embedding and the learned sense embedding. Let $\left\{x_i,y_i\right\}_{i=1}^n$ be the pairs of corresponding representations before and after sense retrofitting, where $x_i\in\mathbb{R}^d$ is the representation before sense retrofitting and $y_i\in\mathbb{R}^d$ is that after sense retrofitting.Footnote c The goal of orthogonal Procrustes analysis is to find a transformation matrix W such that $Wx_i$ approximates $y_i$ :
where we select the typical square loss $l_2\!\left(x,y\right)=\left\|x-y\right\|_2^2$ as the loss function. When constraining W to be orthogonal (i.e., $W^{T}W=I_d$ , where $I_d$ is an identity matrix with d-dimensionality and the dimensionality of the representation before and after retrofitting is the same), selecting the square loss function makes formula 8 a least-squares problem, solvable with a closed-form solution. From formula 8,
Let $X=\left(x_1,...,x_n\right)$ and $Y=\left(y_1,...,y_n\right)$ . Equation (9) can be expressed as
where $\textrm{Tr}\!\left(\cdot\right)$ is the trace operator $\textrm{Tr}\!\left(A\right)=\sum_{i=1}^{n}a_{ii}$ . We first rearrange the matrices in the trace operator
then, using the singular value decomposition $XY^{T}=U\Sigma V^{T}$ , formula 11 becomes
Since $V^{T}$ , W, and U are orthogonal matrices, $P=V^{T}WU$ must be an orthogonal matrix. From Equation (12), it follows that
From Equation (13), $\left|p_{ii}\right|\leq 1$ ; given the orthogonality of P, $P=I_n$ . As a result, $V^{T}WU=I$ and $W=VU^{T}$ .
4.1. Inference from Procrustes method
After obtaining the transformation matrix W, the next step is to infer the sense representations that cannot be retrofitted from the GenSense model (out-of-ontology senses). For a bilingual word embedding, there are two methods for representing these mappings. The first is finding a corresponding mapping of the word that is to be translated:
for word i, where $t\!\left(i\right)$ denotes the translation and $\left\{1,...,n,n+1,...,N\right\}$ denotes the vocabulary of the target language. Though this process is commonly used in bilingual translation, there is a simpler approach for retrofitting. The second method is simply to apply the transformation matrix on the word that is to be translated. However, in bilingual embedding this method yields only the mapped vector and not the translated word in the target language. One advantage with GenSense is that all corresponding senses (those before and after applying GenSense) are known. As a result, we simply apply the transformation matrix to the senses that are not retrofitted (i.e., $Wx_i$ for sense i). In the experiments, we show only the results of the second method, as it is more natural within the context of GenSense.
5. Experiments
We evaluated GenSense using four types of experiments: semantic relatedness, contextual word similarity, semantic difference, and synonym selection. In the testing phase, if a test dataset had missing words, we used the average of all sense vectors to represent the missing word. Note that the results we report for vanilla sense embedding may be slightly different from other work due to the handling of missing words and the similarity computation method. Some work uses a zero vector to represent missing words, whereas some removes missing words from the dataset. Thus, the reported performance should be compared given the same missing word processing method and the same similarity computation method.
5.1. Ontology construction
Roget’s 21st Century Thesaurus (Kipfer Reference Kipfer1993) was used to build the ontology in the experiments as it includes strength information for the senses. As Roget does not directly provide an ontology, we used it to manually construct synonym and antonym ontologies. We first fetched all the words with their sense sets. Let Roget’s word vocabulary be $V=\left\{w_1,...,w_n\right\}$ and the initial sense vocabulary be $\big\{w_{11},w_{12}, ..., w_{1m_1}, ...,w_{n1},w_{n2},...,w_{nm_n}\big\}$ , where word $w_i$ has $m_i$ senses, including all parts of speeches (POSes) of $w_i$ . For example, love has four senses: (1) noun, adoration; very strong liking; (2) noun, person who is loved by another; (3) verb, adore, like very much; and (4) verb, have sexual relations. The initial sense ontology would be $O=\big\{W_{11},W_{12},...,W_{1m_1},...,W_{n1},W_{n2},...,W_{nm_n}\big\}$ . In the ontology, each word $w_i$ had $m_i$ senses, of which $w_{i1}$ was the default sense. The default sense is the first sense in Roget and is usually the most common sense when people use this word. Each $W_{ij}$ carried an initial word set $\left\{w_k|w_k\in V, w_k\text{ is the synonym of }w_{ij}\right\}$ . For example, bank has two senses: $bank_1$ and $bank_2$ . The initial word set of $bank_1$ is store and treasury (which refers to financial institution). The initial word set of $bank_2$ is beach and coast (which refers to ground bounding waters). Then we attempted to assign a corresponding sense to the words in the initial word set. For each word $w_k$ in the word set of $W_{ij}$ , we first computed all the intersections of $W_{ij}$ with $W_{k1},W_{k2},...,W_{km_k}$ , after which we selected the largest intersection according to the cardinalities. If all the cardinalities were zero, then we assigned the default sense to the target word. The procedure for the construction of the ontology for Roget’s thesaurus is given in Algorithm 5. The building of the antonym ontology from Roget’s thesaurus was similar to Algorithm 5; it differed in that the initial word set was set to $\left\{w_k|w_k\in V,w_k\text{ is the antonym of } w_{ij}\right\}$ .
The vocabulary from the pre-trained GloVe word embedding was used to fetch and build the ontology from Roget. In Roget, the synonym relations were provided in three relevance levels; we set the $\beta$ ’s to $1.0$ , $0.6$ , and $0.3$ for the most relevant synonyms to the least. The antonym relations were constructed in the same way. For each sense, $\beta_{ii}$ was set to the sum of all the relation’s specific weights. Unless specified otherwise, in the experiments we set the $\alpha$ ’s to 1. Although in this study we show only the Roget results, other ontologies (e.g., PPDB Ganitkevitch et al. Reference Ganitkevitch, Van Durme and Callison-Burch2013; Pavlick et al. Reference Pavlick, Rastogi, Ganitkevitch, Van Durme and Callison-Burch2015 and WordNet Miller Reference Miller1998) could be incorporated into GenSense as well.
5.2. Semantic relatedness
Measuring semantic relatedness is a common way to evaluate the quality of the proposed embedding models. We downloaded four semantic relatedness benchmark datasets from the web.
5.2.1. MEN
MEN (Bruni, Tran, and Baroni Reference Bruni, Tran and Baroni2014)) contains 3000 word pairs crowdsourced from Amazon Mechanical Turk. Each word pair has a similarity score ranging from 0 to 50. In their crowdsourcing procedure, the annotators were asked to select the more related word pair from two candidate word pairs. For example, between the candidates $\left(wheels, car\right)$ and $\left(dog, race\right)$ , the annotators were to select $\left(wheels, car\right)$ since every car has wheels, but not every dog is involved in a race. We further labeled the POS in MEN: 81% were nouns, 7% were verbs, and 12% were adjectives. In the MEN dataset, there are two versions of the word pairs: lemma and natural form. We show the natural form in the experimental results, but the performance on the two datasets is quite similar.
5.2.2. MTurk
MTurk (Radinsky et al. Reference Radinsky, Agichtein, Gabrilovich and Markovitch2011) contains 287 human-labeled examples of word semantic relatedness. Each word pair has a similarity score ranging from 1 to 5 from 10 subjects. A higher score value indicates higher similarity. In MTurk, we labeled the POS: 61% were nouns, 29% were verbs, and 10% were adjectives.
5.2.3. WordSim353 (WS353)
WordSim-353 (Finkelstein et al. Reference Finkelstein, Gabrilovich, Matias, Rivlin, Solan, Wolfman and Ruppin2002) contains 353 noun word pairs. Each pair has a human-rated similarity score ranging from 0 to 10. A higher score value indicates higher semantic similarity. For example, the score of $\left(journey, voyage\right)$ is $9.29$ and the score of $\left(king, cabbage\right)$ is $0.23$ .
5.2.4. Rare words
Rare words (RWs) (Luong, Socher, and Manning Reference Luong, Socher and Manning2013) contain 2034 word pairs crowdsourced from Amazon Mechanical Turk. Each word pair has a similarity score ranging from 0 to 10. A higher score value indicates higher similarity. In RW, frequencies of some words are very low. Table 2 shows the word frequency statistics of WS353 and RW based on Wikipedia.
In RW, the number of unknown words is 801; 41 words other appear no more than 100 times in Wikipedia. In WS353, in contrast, all words appear more than 100 times in Wikipedia. As some of the words are challenging even for native English speakers, the annotators were asked if they knew the first and second words. Word pairs unknown to most raters were discarded. We labeled the POS: 47% were nouns, 32% were verbs, 19% were adjectives, and 2% were adverbs.
To measure the semantic relatedness between a word pair $\left(w,w^{\prime}\right)$ in the datasets, we adopted the sense evaluation metrics AvgSim and MaxSim (Reisinger and Mooney Reference Reisinger and Mooney2010)
where $K_w$ and $K_w^{\prime}$ denote the number of senses of w and w ′, respectively. AvgSim can be seen as a soft metric as it takes the average of the similarity scores, whereas the MaxSim can be seen as a hard metric as it selects only those senses with the maximum similarity score. To measure the performance of the sense embeddings, we computed the Spearman correlation between the human-rated scores and the AvgSim/MaxSim scores. Table 3 shows a summary of the benchmark datasets and their relationships with the ontologies. Row 3 shows the number of words that were listed both in the datasets and the ontology. As some words in Roget were not retrofitted, rows 4 and 5 show the number and ratio of words that were affected by the retrofitting model. The word count for Roget was 63,942.
5.3. Contextual word similarity
Although the semantic relatedness datasets have been used often, one disadvantage is that the words in these word pairs lack contexts. Therefore, we also conducted experiments with Stanford’s contextual word similarities (SCWS) dataset (Huang et al. Reference Huang, Socher, Manning and Ng2012), which consists of 2003 word pairs (1713 words in total, as some words are shown in multiple questions) together with human-rated scores. A higher score value indicates higher semantic relatedness. In contrast to the semantic relatedness datasets, SCWS words have contexts and POS tags, that is, the human subjects knew the usage of the word when they rated the similarity. For each word pair, we computed its AvgSimC/MaxSimC scores from the learned sense embedding (Reisinger and Mooney Reference Reisinger and Mooney2010)
where is the likelihood of context c belonging to cluster $\pi_k$ , and , the maximum likelihood cluster for w in context c. We used a window size of 5 for words in the word pairs (i.e., 5 words prior to $w/w^{\prime}$ and 5 words after $w/w^{\prime}$ ). Stop words were removed from the context. To measure the performance, we computed the Spearman correlation between the human-rated scores and the AvgSimC/MaxSimC scores.
5.4. Semantic difference
This task was to determine if a given word had a closer semantic feature to a concept than another word (Krebs and Paperno Reference Krebs and Paperno2016). In this dataset, there were 528 concepts, 24,963 word pairs, and 128,515 items. Each word pair came with a feature. For example, in the test $\left(airplane,helicopter\right):\,wings$ , the first word was to be chosen if and only if $\cos\left(airplane,wings\right)>\cos\left(helicopter,wings\right)$ ; otherwise, the second word was chosen. As this dataset did not provide context for disambiguation, we used strategies similar to the semantic relatedness task:
In AvgSimD, we chose the first word iff $AvgSimD\!\left(w_1,w^{\prime}\right)> AvgSimD\!\left(w_2,w^{\prime}\right)$ . In MaxSimD, we chose the first word iff $MaxSimD\!\left(w_1,w^{\prime}\right)> MaxSimD\!\left(w_2,w^{\prime}\right)$ . The performance was determined by computing the accuracy.
5.5. Synonym selection
Finally, we evaluated the proposed GenSense on three benchmark synonym selection datasets: ESL-50 (acronym for English as a Second Language) (Turney Reference Turney2001), RD-300 (acronym for Reader’s Digest Word Power Game) (Jarmasz and Szpakowicz Reference Jarmasz and Szpakowicz2004), and TOEFL-80 (acronym for Test of English as a Foreign Language) (Landauer and Dumais Reference Landauer and Dumais1997). The numbers in each task represent the numbers of questions in the dataset. In each question, there was a question word and a set of answer words. For each sense embedding, the task was to determine which word in the answer set was most similar to the question word. For example, with brass as the question word and metal, wood, stone, and plastic the answer words, the correct answer was metal.Footnote d As with the semantic relatedness task for the synonym selection task, we used AvgSim and MaxSim. We first used AvgSim/MaxSim to compute the scores for the question word and the words in the answer sets and then selected the answer with the maximum score. Performance was determined by computing the accuracy. Table 4 summarizes the synonym selection benchmark datasets.
5.6. Training models
In the experiments, we use GloVe’s 50d version unless otherwise specified as the base model (Pennington et al. Reference Pennington, Socher and Manning2014). The pre-trained GloVe word embedding was trained on Wikipedia and Gigaword-5 (6B tokens, 400k vocab, uncased, 50d vectors). We also test GloVe’s 300d version and two well-known vector representation models from the literature: Word2Vec’s 300d version (trained on part of Google News dataset which contains 100 billion words) (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013a) and FastText’s 300d version (2 million word vectors trained on Common Crawl which contains 600B tokens) (Bojanowski et al. Reference Bojanowski, Grave, Joulin and Mikolov2017). Since Word2Vec and FastText did not release a 50d version, we extract the first 50 dimensions from the 300d version to explore the impact of dimensionality. We also conduct experiments on four contextualized word embedding models BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), DistilBERT (Sanh et al. Reference Sanh, Debut, Chaumond and Wolf2019), RoBERTa (Liu et al. Reference Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer and Stoyanov2019), and Transformer-XL (T-XL) (Dai et al. Reference Dai, Yang, Yang, Carbonell, Le and Salakhutdinov2019). To control the experiment, we extract the last four layers for all pre-trained transformer models to represent the word. We tried other layer settings and found that the concatenation of the last four layers can generate good results experimentally. We choose the base uncased version for BERT (3072d) and DistilBERT (3072d), the base version for RoBERTa (3072d), and the transfo-xl-wt103 version for T-XL (4096d).
We set the convergence criterion for the sense vectors to $\varepsilon=0.1$ and the number of iterations to 10. We used three types of generalization: GenSense-syn (only the synonyms and positive contextual neighbors were considered), GenSense-ant (only the antonyms and negative contextual neighbors were considered), and GenSense-all (everything was considered). Specifically, the objective function of the GenSense-syn was
and the objective function of the GenSense-ant was
6. Results and discussion
6.1. Semantic relatedness
Table 5 shows the Spearman correlation ( $\rho\times100$ ) of AvgSim and MaxSim between the human scores and the sense embedding scores on each benchmark dataset. For each version (except the transformer ones), the first row shows the performance of the vanilla word embedding. Note that the MaxSim and AvgSim scores are equal when there is only one sense for each word (word embedding). The second row shows the performance of the Retro model (Jauhar et al. Reference Jauhar, Dyer and Hovy2015). The third, fourth, and fifth rows show the GenSense performance for three versions: synonym, antonym, and all, respectively. The last row shows the Procrustes method using GenSense-all. The macro-averaged (average over the four benchmark datasets) and the micro-averaged (weighted average, considering the number of word pairs in every benchmark dataset) results are in the rightmost two columns.
From Table 5, we observe that the proposed model outperforms the Retro and GloVe in all datasets (macro and micro). The best overall model is Procrustes-all. All versions of the GenSense model outperform Retro in almost all the tasks. Retro performs poorly on the RW dataset. In RW, GenSense-syn’s MaxSim score exceeds Retro by $22.6$ (GloVe 300d), $29.3$ (FastText 300d), and $29.1$ (Word2Vec 300d). We also observe a significant growth in the Spearman correlation between GenSense-syn and GenSense-all. Surprisingly, the model with only synonyms and positive contextual information outperforms Retro and GloVe. After utilizing the antonym knowledge from Roget, its performance is further improved in all but the RW dataset. This suggests that the antonyms in Roget are quite informative and useful. Moreover, GenSense adapts information from synonyms and antonyms to boost its performance. Although the proposed model pulls sense vectors away from their reverse senses with the help of the antonyms and negative contextual information, this shift does not guarantee that the new sense vectors move to a better place every time with only negative relations. As a result, the GenSense-ant does not perform as well as GenSense-syn in general. Procrustes-all performs better than GenSense-all in most tasks, but the improvement is marginal. This is due to the high ratio of retrofitted words (see Table 3). In other words, the Procrustes method is applied only to a small portion of the sense vectors. In both of the additional evaluation metrics, the GenSense model outperforms Retro by a large margin. Procrustes-all is the best among the proposed models. These two metrics attest the robustness of our proposed model in comparison to the Retro model. For classic word embedding models, FastText outperforms Word2Vec, and Word2Vec outperforms GloVe, but not in all tasks. There is a clear gap between the 50d and 300d in all the models. Similar trend can be found in other nlp tasks (Yin and Shen Reference Yin and Shen2018).
The performance between transformer models and GenSense models may not be able to compare directly as their training corpus and dimensionality are very different. From the result, only DistilBERT 3072d outperforms the vanilla GloVe 50d model in RW and WS353 and has a performance gap when comparing to GloVe 300d, FastText 300d, and Word2Vec 300d in all the datasets. RoBERTa is the worst among the transformer models. The poor performance of the contextualized word representation model may be due to the fact that they cannot accurately capture the semantic equivalence of contexts (Shi et al. Reference Shi, Chen, Zhou and Chang2019). Another possible reason is the best configuration of the transformer models is not explored. Configurations like: How many layer(s) should we select?; How to combine the selected layers (concatenate, average, max pooling, or min pooling)? Although the performance of the transformers is poor here, we will show transformer models outperform GenSense using the same setting in the later experiment that involves context (Section 6.2).
We also conducted an experiment to evaluate the benefits yielded by the relation strength. We ran GenSense-syn over the Roget ontology with a grid of $\left(\alpha_1,\alpha_2,\alpha_3\right)$ parameters. Each parameter was tuned from $0.0$ to $1.0$ with a step size of $0.1$ . The default setting of GenSense was set to $\left(1.0,1.0,1.0\right)$ . Table 6 shows the MaxSim results and Table 7 shows the AvgSim results. Note that the $\alpha_1/\alpha_2/\alpha_3$ parameter combinations of the worst or the best case may be more than one. In that case, we only report one $\alpha_1/\alpha_2/\alpha_3$ setting in Tables 6 and 7. From Table 6, we observe that the default setting yields relatively good results in comparison to the best case. Another point worth mentioning is that the worst performance occurs under the $0.1/1/0.1$ setting, except for the WS353 dataset. Similar results can be found in Table 7’s AvgSim metric. Since $\alpha_1$ is to control the importance of the distance between the original vector and the retrofitted vector, small $\alpha_1$ leads to poor performance suggests that the original trained word vector should not deviate too far. When observing the best cases, we found $\alpha_1$ , $\alpha_2$ , and $\alpha_3$ are closed to each other in many tasks. For a deeper analysis on how the parameters affect the performance, Figure 3 shows the histogram of the performance when tuning $\alpha_1/\alpha_2/\alpha_3$ on all the dataset. From Figure 3, we find that in the tasks the distribution is left-skewed except the AvgSim of RW, suggesting the robustness of GenSense-syn.
We also ran GenSense-syn over the Roget ontology with another grid of parameters that contains a higher range. Specifically, the grid parameters were tuned from the parameter set $\left\{0.1,0.5,1.0,1.5,2.0,2.5,3.0,5.0,10.0\right\}$ . The results are shown in Tables 8 and 9. From the results, we find that the improvements for the best cases are almost the same as those in Tables 6 and 7 (MEN, MTurk, and WS353). Only RW’s performance increase is larger. In contrast, the worst case drops considerably for all datasets. For example, MEN’s MaxSim drops from $52.4$ to $37.0$ , a $15.4$ drop. This shows the importance of carefully selecting parameters in the learning model. Also worth mentioning is that these worst cases happen when $\alpha_2$ is large, showing the negative effect of too much weight for the synonym neighbors.
In addition to the parameters, it is also worth analyzing the impact of dimensionality. Figure 4 shows the $\rho\times100$ of MaxSim on the semantic relatedness benchmark datasets as a function of the vector dimension. All GloVe pre-trained models were trained on the 6-billion-token corpus of 50d, 100d, 200d, and 300d. We used the GenSense-all model on the GloVe pre-trained models. In Figure 4, the proposed GenSense-all outperforms GloVe in all the datasets for all the tested dimensions. In GloVe’s original paper, they showed GloVe’s performance (in terms of accuracy) to be proportional to the dimension between 50d and 300d. In this experiment, we show that both GloVe and GenSense-all’s performance is proportional to dimension between 50d and 300d in terms of $\rho\times100$ of MaxSim. Similar results are found for the AvgSim metric.
Table 10 shows the selected MEN’s word pairs and their corresponding GenSense-all, GloVe, and Retro scores for case study. For GenSense-all, GloVe, and Retro, we sorted the MaxSim scores and re-scaled them to MEN’s score distribution. From Table 10, we find that Gensense-all improves the pre-trained word-embedding model (in terms of closeness to MEN’s score; smaller is better) in the following situations: (1) both words have few senses $\left(lizard, reptiles\right)$ , (2) both words have many senses $\left(stripes, train\right)$ , and (3) one word has many senses and one word has few senses $\left(rail, railway\right)$ . In other words, GenSense-all handles all possible situations well. In some cases, the Retro model increases the closeness to MEN’s score.
6.1.1. Standardization on dimensions
Table 11 shows results of the vanilla word embedding (GloVe), the standardized vanilla word embedding (GloVe-Z), GenSense-all, and the four standardized GenSense methods (rows 4 to 7). The results show that standardization on the vanilla word embedding (GloVe-Z) improves some datasets but not all. In contrast, standardization on GenSense outperforms both GenSense-all and the GloVe-Z models. This suggests that the vanilla word embedding may not be optimized well. Although there is no model that consistently performs the best of all the standardization models, overall Z-GenSense performs the best in terms of Macro and Micro metrics.
6.1.2. Neighbor expansion from nearest neighbors
Table 12 shows the vanilla word embedding (GloVe), the Retro model, and the neighbor expansion model. The results verify our assumption that nearest neighbors play an important role in the GenSense model. From Table 12, GenSense-NN outperforms GloVe and Retro on all the benchmark datasets.
6.1.3. Combination of standardization and neighbor expansion
Table 13 shows the experimental results of the vanilla word embedding (GloVe), the Retro model, and four combination models (GenSense-NN-Z, GenSense-NN-Z-last, GenSense-Z-NN, and GenSense-Z-NN-first). The overall best model is GenSense-NN-Z-last (in terms of Macro and Micro). GenSense-Z-NN also performs the best in Macro’s AvgSim. Again, there is no dominant model (that outperforms all other models) among the combination models, but almost all models outperform the baseline models GloVe and Retro. Tables 12 and 13 suggest that all the benchmark datasets can be further improved.
6.2. Contextual word similarity
Table 14 shows the Spearman correlation $\rho\times100$ of SCWS dataset. Unlike other models, in the DistilBERT we firstly embed the entire sentence and then extract the embedding of the word. After extracting the embedding of the pair of the words, we compute their cosine similarity. With the sense-level information, both GenSense and Retro outperform the word embedding model GloVe. The GenSense model slightly outperforms Retro. The results suggest that the negative relation information in GenSense-ant may not be helpful. We suspect that the quality of SCWS may not be controlled well. As there are 10 subjects for each question in SCWS, we further analyzed the distribution of the ranges in SCWS. We found that there are many questions with a large range, reflecting the vagueness of the questions. Overall, $35.5$ % of the questions had a range of 10 (i.e., some subjects assigned the highest score and some assigned the lowest score), and $50.0$ % had a range of 9 or 10. Unlike the semantic related experiment’s result, all the transformer models outperform the GenSense models. The result is not surprising as the contextualized word embedding models are pre-trained to better represent the target word given its context through masked language modeling and next sentence prediction tasks.
6.3. Semantic difference
Table 15 shows the results of the semantic difference experiment. We observe that GenSense outperforms Retro and GloVe with small margin, and the accuracy of Retro decreases in this experiment. As this task focuses on concepts, we find that synonym and antonym information is not very useful when comparing the results with GloVe. This experiment also suggests that further information about concepts is required to improve performance. Surprisingly, the antonym relation plays an important role when computing the semantic difference, especially in the AvgSimD metric.
6.4. Synonym selection
Table 16 shows the results of the synonym selection experiment: GenSense-all outperforms the baseline models on the RD-300 and TOEFL-80 datasets. In ESL-50, the best model is Retro, showing that improvements are still possible for the default GenSense model. Nevertheless, in ESL-50 GenSense-syn and GenSense-all outperform the vanilla GloVe embedding by a large margin. We also note the relatively poor performance of GenSense-ant in comparison to GenSense-syn and GenSense-all; this shows that the antonym information is relatively unimportant in the synonym selection task.
7. Limitations and future directions
This research focuses on generalizing sense retrofitting models and evaluates the models on semantic relatedness, contextual word similarity, semantic difference, and synonym selection datasets. In the semantic relatedness experiment, we compare GenSense and BERT family models and show that GenSense can outperform BERT models. However, the experiment may be unfair to the BERT family models as they are context sensitive language models, while each phrase pair in the dataset is context-free. The slight difference of the dataset and models nature makes comparisons are not as easy to interpret. A possible way to address the issue is to apply the generalized sense representation learnt by the proposed method in downstream natural language processing applications to conduct extrinsic evaluations. If the downstream tasks involve context-sensitive features, the tasks themselves will give advantage to the context sensitive models. Nevertheless, it would be interesting to evaluate the embeddings extrinsically via training neural network models of the same architecture for downstream tasks (e.g., named entity recognition (Santos, Consoli, and Vieira Reference Santos, Consoli and Vieira2020) and comparing different language models. Another research direction that relates to this research is fair NLP. Since word embeddings are largely affected by corpus statistics, many works focus on debiasing word embeddings, especially in gender, human race, and society (Bolukbasi et al. Reference Bolukbasi, Chang, Zou, Saligrama and Kalai2016; Caliskan, Bryson, and Narayanan Reference Caliskan, Bryson and Narayanan2017; Brunet et al. Reference Brunet, Alkalay-Houlihan, Anderson and Zemel2019). How to incorporate the techniques in debiasing word embeddings to GenSense model is worth exploring. Finally, social sciences, psychology, and history research fields largely depend on high quality word or sense embedding models (Hamilton, Leskovec, and Jurafsky Reference Hamilton, Leskovec and Jurafsky2016; Caliskan et al. Reference Caliskan, Bryson and Narayanan2017; Jonauskaite et al. Reference Jonauskaite, Sutton, Cristianini and Mohr2021). Our proposed GenSense embedding model can bring great value to these research fields.
8. Conclusions
In this paper, we present GenSense, a generalized framework for learning sense embeddings. As GenSense belongs to post-processing retrofitting model family, it enjoys all the benefits of retrofitting, such as shorter training times and lower memory requirements. In the generalization, (1) we extend the synonym relation to the positive contextual neighbor relation, the antonym relation, and the negative contextual neighbor relation; (2) we consider the semantic strength of each relation; and (3) we use the relation strength between relations to balance different components. We conduct experiments on four types of tasks: semantic relatedness, contextual word similarity, semantic difference, and synonym selection. Experimental results show that GenSense outperforms previous approaches. In the experiment with grid search to evaluate the benefits yielded by the relation strength, we find a $87.7$ % performance difference between the worst and the best cases in WS353 dataset. Based on the proposed GenSense, we also propose a standardization process on the dimensions with four settings, a neighbor expansion process from the nearest neighbors, and four different combinations of the two approaches. Finally, we propose a Procrustes analysis approach that inspired from bilingual mapping models for learning representations that outside of the ontology. The experimental results show the advantages of these modifications on the semantic relatedness task. Finally, we have released the source code and the pre-trained model as a resource for the research community.Footnote e,Footnote f Other versions of the sense-retrofitted embeddings can be found on the website.
Acknowledgements
This research was partially supported by Ministry of Science and Technology, Taiwan, under grants MOST-107-2634-F-002-019, MOST-107-2634-F-002-011, and MOST-106-2923-E-002-012-MY3.