An overview of word and sense similarity

Abstract Over the last two decades, determining the similarity between words as well as between their meanings, that is, word senses, has been proven to be of vital importance in the field of Natural Language Processing. This paper provides the reader with an introduction to the tasks of computing word and sense similarity. These consist in computing the degree of semantic likeness between words and senses, respectively. First, we distinguish between two major approaches: the knowledge-based approaches and the distributional approaches. Second, we detail the representations and measures employed for computing similarity. We then illustrate the evaluation settings available in the literature and, finally, discuss suggestions for future research.


Introduction
Measuring the degree of semantic similarity between linguistic items has been a great challenge in the field of Natural Language Processing (NLP), a sub-field of Artificial Intelligence concerned with the handling of human language by computers. Over the last two decades, several different approaches have been put forward for computing similarity using a variety of methods and techniques. However, before examining such approaches, it is crucial to provide a definition of similarity: what is meant exactly by the term 'similar'? Are all semantically related items 'similar'? Resnik (1995) and Budanitsky and Hirst (2001) make a fundamental distinction between two apparently interchangeable concepts, that is, similarity and relatedness. In fact, while similarity refers to items which can be substituted in a given context (such as cute and pretty) without changing the underlying semantics, relatedness indicates items which have semantic correlations but are not substitutable. Relatedness encompasses a much larger set of semantic relations, ranging from antonymy (beautiful and ugly) to correlation (beautiful and appeal). As is apparent from Figure 1, beautiful and appeal are related but not similar, whereas pretty and cute are both related and similar. In fact, similarity is often considered to be a specific instance of relatedness (Jurafsky 2000), where the concepts evoked by the two words belong to the same ontological class. In this paper, relatedness will not be discussed and the focus will lie on similarity.
In general, semantic similarity can be classified on the basis of two fundamental aspects. The first concerns the type of resource employed, whether it be a lexical knowledge base (LKB), that is, a wide-coverage structured repository of linguistic data, or large collections of raw textual data, that is, corpora. Accordingly, we distinguish between knowledge-based semantic similarity, in the former case, and distributional semantic similarity, in the latter. Furthermore, hybrid semantic similarity combines both knowledge-based and distributional methods. The second aspect concerns the type of linguistic item to be analysed, which can be:

Pretty Beautiful
Cute Ugly Appeal Flower Related Similar Figure 1. An explicative illustration of word similarity and relatedness.
• words, which are the basic building blocks of language, also including their inflectional information. • word senses, that is, the meanings that words convey in given contexts (e.g., the device meaning vs. the animal meaning of mouse). • sentences, that is, grammatical sequences of words which typically include a main clause, made up of a predicate, a subject and, possibly, other syntactic elements. • paragraphs and texts, which are made up of sequences of sentences. This paper focuses on the first two items, that is, words and senses, and provides a review of the approaches used for determining to which extent two or more words or senses are similar to each other, ranging from the earliest attempts to recent developments based on embedded representations.

Outline
The rest of this paper is structured as follows. First, we describe the tasks of word and sense similarity (Section 2). Subsequently, we detail the main approaches that can be employed for performing these tasks (Sections 3-5) and describe the main measures for comparing vector representations (Section 6). We then move on to the evaluation of word and sense similarity measures (Section 7). Finally, we draw conclusions and propose some suggestions for future research (Section 8).

Task description
Given two linguistic items i 1 and i 2 , either words or senses in our case, the task consists in calculating some function sim(i 1 , i 2 ) which provides a numeric value that quantifies the estimated similarity between i 1 and i 2 . More formally, the similarity function is of the kind: where I is the set of linguistic items of interest and the output of the function typically ranges between 0 and 1, or between −1 and 1. Note that the set of linguistic items can be cross-level, that is, it can include (and therefore enable the comparison of) items of different types, such as words and senses (Jurgens 2016). In order to compute the degree of semantic similarity between items, two major steps have to be carried out. First, it is necessary to identify a suitable representation of the items to be analysed. The way a linguistic item is represented has a fundamental impact on the effectiveness of the computation of semantic similarity, as a consequence of the expressiveness of the representation. For example, a representation which counts the number of occurrences and co-occurrences of words can be useful when operating at the lexical level, but can lead to more difficult calculations when moving to the sense level, for example, due to the paucity of sense-tagged training data. Second, an effective similarity measure has to be selected, that is, a way to compare items on the basis of a specific representation.
Word and sense similarity can be performed following two main approaches: • Knowledge-based similarity exploits explicit representations of meaning derived from widecoverage lexical-semantic knowledge resources (introduced in Section 3). • Distributional similarity draws on distributional semantics, also known as vector space semantics, and exploits the statistical distribution of words within unstructured text (introduced in Section 4).
Hybrid similarity measures, introduced in Section 5, combine knowledge-based and distributional similarity approaches, that is, knowledge from LKBs and occurrence information from texts.

Knowledge-based word and sense similarity
Knowledge-based approaches compute semantic similarity by exploiting the information stored in an LKB. With this aim in view, two main methods can be employed. The first method computes the semantic similarity between two given items i 1 and i 2 by inferring their semantic properties on the basis of structural information concerning i 1 and i 2 within a specific LKB. The second method performs the extraction and comparison of a vector representation of i 1 and i 2 obtained from the LKB. It is important to note that the first method is now deprecated as the best performance can be achieved by using more sophisticated techniques, both knowledge-based and distributional, which we will detail in the following sections. We now introduce the most common LKBs (Section 3.1), and then overview methods and measures employed for knowledge-based word and sense similarity (Section 3.2).

Lexical knowledge resources
Here we will review the most popular lexical knowledge resources, which are widely used not only for computing semantic similarity, but also in many other NLP tasks. WordNet. WordNet a (Fellbaum 1998) is undoubtedly the most popular LKB for the English language, originally developed on the basis of psycholinguistic theories. WordNet can be viewed as a graph, whose nodes are synsets, that is, sets of synonyms, and whose edges are semantic relations between synsets. WordNet encodes the meanings of an ambiguous word through the synsets which contain that word and therefore the corresponding senses. For instance, for the word In the above example, the term tabular array is a synonym for table in the data matrix sense, while mesa is a synonym in the tableland meaning. WordNet makes clear the important distinction between words, senses and synsets: a word is a possibly ambiguous string which represents a single, meaningful linguistic element (e.g., table), a sense is a given meaning of a certain word (e.g., the matrix sense of table, also denoted as table#n#1, table.n.1 or table 1 n to indicate it is the first nominal sense in the WordNet inventory for that word) and a synset is a set of senses all expressing the same concept. A synset has a one-to-one correspondence with a concept, which is purely semantic. A sense (e.g., Note that some of the above relations are semantic, in that they connect synsets, whereas others, such as the derivationally related form and the pertainymy relations, hold between word senses (i.e., words occurring in synsets). However, it is a common practice, for the purposes of many NLP tasks, to take lexical relations to the semantic level, so as to connect the corresponding enclosing synsets (Navigli and Ponzetto 2012). Roget's thesaurus. Created by the English lexicographer Peter Mark Roget in 1805, the Roget's thesaurus is a historical lexicographic resource, used in NLP as an alternative to WordNet for knowledge acquisition and semantic similarity (Jarmasz and Szpakowicz 2003). The Roget's thesaurus was made available for the first time in 1852 and was one of the resources employed for creating WordNet (Miller et al. 1990).
Wikipedia. Started in 2001, Wikipedia b has become the largest and most reliable online encyclopaedia in the space of a few years and has gained momentum quickly in several NLP tasks, such as text classification (Navigli et al. 2011), Word Sense Disambiguation (WSD) (Navigli, Jurgens, and Vannella 2013;Moro and Navigli 2013), entity linking (Moro and Navigli 2013) and many others (Hovy, Navigli, and Ponzetto 2013). Wikipedia can be viewed as a lexical knowledge resource with a graph structure whose nodes are Wikipedia pages and whose relations are given by the hyperlinks that connect one page to another. Compared to WordNet, Wikipedia provides three key features which make it very popular in NLP: first, it covers world knowledge in terms of named entities (such as well-known people, companies and works of art) on a large scale; second, it provides coverage of multiple languages, by linking a given page to its counterparts in dozens of other languages, whenever these are available; and third, it is continuously updated. Wiktionary. Another resource which has become popular in NLP is Wiktionary, a sister project to Wikipedia. Available in almost 200 languages, Wiktionary is a free, Web-based collaborative dictionary that is widely employed in several NLP tasks such as WSD and semantic similarity (Zesch, Müller, and Gurevych 2008). BabelNet. Built on top of WordNet and Wikipedia, BabelNet c (Navigli and Ponzetto 2012) is the most popular wide-coverage multilingual lexical knowledge resource, used in dozens of tasks among which we cite state-of-the-art multilingual disambiguation (Moro, Raganato, and Navigli 2014), semantic similarity (Camacho-Collados, Pilehvar, and Navigli 2016) and semantically enhanced machine translation (Moussallem, Wauer, and Ngonga Ngomo 2018). BabelNet is the result of the automatic interlinking and integration of different knowledge resources, such as WordNet, Wikipedia, Wiktionary, Wikidata and other resources. The underlying structure is modelled after that of WordNet: multilingual synsets are created which contain lexicalizations that, in different languages, express the same concept. For instance, the car synset includes, among others, the following lexicalizations (the language code is subscripted): { car EN , automobile EN , macchina IT , voiture FR , coche ES , ..., Wagen DE }.
The relations interconnecting the BabelNet multilingual synsets come from the integrated resources, such as those from WordNet and Wikipedia (where hyperlinks are labelled as semantic relatedness relations). As a result, similar to WordNet and Wikipedia, BabelNet can also be viewed as a graph and its structure exploited to perform semantic similarity.

Knowledge-based representations and measures 3.2.1 Earlier attempts
Knowledge-based representations and measures always rely on the availability of LKBs. Earlier efforts aimed at calculating word and sense similarity by exploiting solely the taxonomic structure of an LKB, such as WordNet. The structural information usually exploited by these measures is based on the following ingredients: • The depth of a given concept (i.e., synset) in the LKB taxonomy; • The length of the shortest path between two concepts in the LKB; • The Least Common Subsumer (LCS), that is, the lowest concept in the taxonomical hierarchy which is a common hypernym of two target concepts.
In knowledge-based approaches, computing the similarity between two senses s 1 and s 2 is straightforward, because it involves the calculation of a measure concerning the two corresponding nodes in the LKB graph. When two words w 1 and w 2 are involved, instead, the similarity between them can be computed as the maximum similarity across all their sense combinations: sim(w 1 , w 2 ) = max s 1 ∈Senses(w 1 ), s 2 ∈Senses(w 2 ) sim(s 1 , s 2 ) where Senses(w i ) is the set of senses provided in the LKB for word w i .

Path.
One of the earliest and simplest knowledge-based algorithms for the computation of semantic similarity is based on the assumption that the shorter the path in a specific LKB graph between two senses, the more semantically similar they are. Given two senses s 1 and s 2 , the path length (Rada et al. 1989) can be computed as follows: where we adjusted the original formula to a similarity measure by calculating its reciprocal. Related to this approach, but based on the structural distance within the Roget's thesaurus (see Section 3.1), a similar algorithm has been put forward by Jarmasz and Szpakowicz (2003).
The key idea behind this type of algorithms is that the farther apart the senses of the two words of interest in the LKB are, the lower the degree of similarity between the two words is.

Leacock and Chodorow.
A variant of the path measure was proposed by Leacock (1998), and this computes semantic similarity as: where length refers to the shortest path between the two senses and D is the maximum depth of the (nominal) taxonomy of a given LKB (historically, WordNet).

Wu and Palmer.
In order to take into account the taxonomical information shared by two senses, Wu and Palmer (1994) put forward the following measure: where the higher the LCS, the lower the similarity between s 1 and s 2 .

Resnik.
A more sophisticated approach was proposed by Resnik (1995) who developed a notion of information content which determines the amount of information covered by a certain WordNet synset in terms of all its descendants (i.e., hyponyms). Formally, this similarity measure is computed as follows: where IC, that is, the information content, is defined as: where P(S) is the probability that a word, randomly selected within a large corpus, is an instance of a given synset S. Such probability is calculated as: where words(S) is the set of words contained in synset S and all its hyponyms, count(w) is the number of occurrences of w in the reference corpus and N is the total number of word tokens in the corpus.

Lin.
A refined version of Resnik's measure was put forward by Lin (1998) which exploits the information content not only of the commonalities, but also of the two senses individually. Formally: Jiang and Conrath. A variant of Lin's measure that has been widely used in the literature is the following (Jiang and Conrath 1997): Extended gloss overlaps or Extended Lesk. All of the above approaches are hinged on taxonomic information, which however is only a portion of the information that is provided in LKBs such as WordNet. Other kinds of relations can indeed be used, such as meronymy and pertainymy (cf. Section 3.1). To do this, Banerjee and Pedersen (2003) proposed an improvement of the Lesk algorithm (Lesk 1986), which has been used historically in WSD for determining the overlap between the textual definitions of two senses under comparison. The measure designed by Banerjee and Pedersen (2003) extends this idea by considering the overlap between definitions not only of the target senses, but also of their neighbouring synsets in the WordNet graph: where R is the set of lexical-semantic relations in WordNet, gloss is a function that provides the textual definition for a given synset (or sense), overlap determines the number of common words between two definitions and r(s) provides the set of the other endpoints of the relation edges of type r connecting s.

Wikipedia-based semantic relatedness.
One of the advantages of using Wikipedia as opposed to WordNet is the former's network of interlinked articles. A key hunch is that two articles are deemed similar if they are linked by a similar set of pages (Milne and Witten 2008). Such similarity can be computed with the following formula: where a and b are two Wikipedia articles, in(a) is the set of articles linking to a, and W is the full set of Wikipedia articles. This measure aims at determining the degree of relatedness between two articles, nonetheless when the two articles are close enough (i.e., the value gets close to 1) the computed value can be considered a degree of similarity.

Recent developments
More recent knowledge-based approaches extract vector-based representations of meaning, which are then used to determine semantic similarity. Unlike previous techniques where the main form of linguistic knowledge representation was the LKB itself, in this case a second form of linguistic knowledge representation is involved, namely, a vector encoding. Accordingly, word and sense similarity is computed in two steps: • a vector-based word and sense representation is obtained by exploiting the structural information of an LKB. • the obtained vector representations are compared by applying a similarity measure.
In what follows we overview approaches to the first step, while deferring an introduction to similarity measures to Section 6.

Personalized PageRank-based representations.
A key idea introduced in the scientific literature is the exploitation of Markov chains and random walks to determine the importance of nodes in the graph, and this was popularized with the PageRank algorithm (Page et al. 1998). In order to obtain probability distributions specific to a node, that is, a concept of interest, topic-sensitive or Personalized PageRank (PPR) (Haveliwala 2002) is employed for the calculation of a semantic signature for each WordNet synset (Pilehvar, Jurgens and Navigli 2013).
Given the WordNet adjacency matrix M (possibly enriched with further edges, for example, from disambiguated WordNet glosses), the following formula is computed: where v 0 denotes the probability distribution for restart of the random walker in the network and α is the so-called damping factor (typically set to 0.85). The result of the computation of the above PageRank formula in the topic-sensitive setting (i.e., when v 0 is highly skewed) provides a distribution with most of the probability mass concentrated on the nodes, which are at easy reach from the nodes initialized for restart in v 0 . Depending on how v 0 is initialized, an explicit semantic representation for a target word or sense can be obtained. For the target word w, it is sufficient to initialize the components of v 0 corresponding to the senses of w to 1/|Senses(w)| (i.e., uniformly distributed across the synsets of w in WordNet), and 0 for all other synsets. For computing a representation of a target sense s of a word w, v 0 is, instead, initialized to 1 on the corresponding synset, and 0 otherwise. An alternative approach has been proposed (Hughes and Ramage 2007) which interconnects not only synsets, but also words and POS-tagged words. Some variants also link synset and words in their definition, and use sense-occurrence frequencies to weight edges. However, this approach is surpassed in performance by purely synset-based semantic signatures when using a suitable similarity measure (Pilehvar et al. 2013).

Distributional word and sense similarity
Knowledge-based approaches can only be implemented if a lexical-semantic resource such as WordNet or BabelNet is available. A radically different approach which does not rely on structured knowledge bases exploits the statistical distribution of words occurring in corpora. The fundamental assumption behind distributional approaches is that the semantic properties of a given word w can be inferred from the contexts in which w appears. That is, the semantics of w is determined by all the other words which co-occur with it (Harris 1954;Firth 1957).

Corpora
Distributional approaches rely heavily on corpora, that is, large collections of raw textual data which can be leveraged effectively for computing semantic similarity. In fact, large-scale corpora reflect the behaviour of words in context, that is, they reveal a wide range of relationships between words, making them a particularly suitable resource from which to learn word distributions. These are then used to infer semantic properties and determine the extent of semantic similarity between two words. The most widely employed corpora for word and sense similarity are: • Wikipedia, one of the largest multilingual corpora employed in several NLP tasks.
• UMBC d (Han and Finin 2013), a Web corpus including more than three billion English words derived from the Stanford WebBase project. • ukWaC e (Ferraresi et al. 2008), a 2-billion word corpus constructed using the .uk domain and medium-frequency words from the British National Corpus. • GigaWord f (Graff et al. 2003), a large corpus of newswire text that has been acquired over several years by the Linguistic Data Consortium (LDC).

Distributional representations and measures
In the distributional approach, a vector representation typically encodes the behavioural use of specific words and/or senses. Two types of distributional representation can be distinguished: • Explicit representation, which refers to a form of representation in which every dimension can be interpreted directly (e.g., when words or senses are used as the meanings of the vector's dimensions). • Implicit or latent representation, which encodes the linguistic information in a form which cannot be interpreted directly.
In the case of an explicit representation vector, given a word w and a vocabulary of size N, a feature vector specifies whether each vocabulary entry, that is, each word w , occurs in the neighbourhood of w. The size of a feature vector can range from the entire vocabulary size, that is, N, to two dimensions referring to the words preceding and succeeding the target word w. In many cases, most frequent words, such as articles, are not included in feature vectors as they do not contain useful semantic information regarding a particular word. Given the feature vector of the target word w, its dimensions can be: • binary values, that is, 0 or 1 depending on whether a specific word co-occurs with the target word or not. • association and probabilistic measures which provide the score or probability that a specific word co-occurs with the target word. A typical example of a binary-valued explicit vector representation is the so-called one-hot encoding of a word w, which is a unit vector (0, 0, . . . , 0, 1, 0, . . . , 0), where only the dimension corresponding to word w is valued with 1. Latent representations, such as embeddings, instead, encode features which are not human-readable and are not directly associated with linguistic items.
In the rest of this section we introduce the two types of representations.

Explicit representations
Early distributional approaches aimed at capturing semantic properties directly between words depending on their distributions. To this end, different measures were proposed in the literature. Sørensen-Dice index, also known as Dice's coefficient, is used to measure the similarity of two words. Formally: where w i is the number of occurrences of the corresponding word and w 1,2 is the number of co-occurrences of w 1 and w 2 in the same context (e.g., sentence). Jaccard Index. Also known as Jaccard similarity coefficient, Jaccard Index (JI) is defined as follows: which has clear similarities to the Sørensen-Dice index defined above. This measure was used for detecting word and sense similarity by Grefenstette (1994). Pointwise Mutual Information. Given two words, the Pointwise Mutual Information (PMI) quantifies the discrepancy between their joint distribution and their individual distributions, assuming independence: where w 1 and w 2 are two words, c(w i ) is the count of w i , c(w 1 , w 2 ) is the number of times the two words co-occur in a context and D is the number of contexts considered. This measure was introduced into NLP by Church and Hanks (1990). Positive PMI. Because many entries of word pairs are never observed in a corpus, and therefore have their PMI equal to log 0 = −∞, a frequently used version of PMI is one in which all negative values are flattened to zero: PPMI is among the most popular distributional similarity measures in the NLP literature. The above association measures can be used to populate an explicit representation vector in which each component values the correlation strength between the word represented by the vector and the word identified by the component.
We now overview an approach that uses concepts as a word's vector components. Explicit Semantic Analysis. An effective method, called Explicit Semantic Analysis (ESA), encodes semantic information in the word vector's components starting from Wikipedia (Gabrilovich and Markovitch 2007). The dimensionality of the vector space is given by the set of Wikipedia pages and the vector v w for a given word w is computed by setting its i-th component v w,i to the TF-IDF of w in the i-th Wikipedia page p i . Formally: where p i is the i-th page in Wikipedia, tf (w, p i ) is the frequency of w in page p i , N is the total number of Wikipedia pages and N w is the number of Wikipedia pages in which w occurs. It has been shown that using Wiktionary instead of Wikipedia leads to higher results in semantic similarity and relatedness (Zesch et al. 2008).

Implicit or latent representations
Latent Semantic Analysis. Latent Semantic Analysis (LSA) (Deerwester et al. 1990) is a technique used for inferring semantic properties of linguistic items starting from a corpus. A term-passage matrix is created whose rows correspond to words and whose columns correspond to passages in the corpus where words occur. At the core of LSA lies the singular value decomposition of the term-passage matrix which decomposes it into the product of three matrices. The dimensionality of the decomposed matrices is then reduced. As a result, latent representations of terms and passages are produced and comparisons between terms can be performed by just considering the rows of the lower-ranking term-latent dimension matrix.

Word embeddings.
In the last few years, LSA has been superseded by neural approaches aimed at learning latent representations of words, called word embeddings. Different embedding techniques have been developed and refined. Earliest approaches representing words by means of continuous vectors date back to the late 1980s (Rumelhart, Hinton, and Williams 1988). A wellknown technique which aims at acquiring distributed real-valued vectors as a result of learning a language model was put forward by Bengio (2003). Collobert (2011) proposed a unified neural network architecture for various NLP tasks. More recently, Mikolov (2013) proposed a simple technique which speeds up the learning and has proven to be very effective.
Word2vec. Undoubtedly, the most popular yet simple approach to learning word embeddings is called word2vec (Mikolov et al. 2013). g As in all distributional methods, the assumption behind word2vec is that the meaning of a word can be inferred effectively using the context in which that word occurs. Word2vec is based on a two-layer neural network which takes a corpus as input and learns vector-based representations for each word. Two variants of word2vec have been put forward: (1) continuous bag of words (CBOW), which exploits the context to predict a target word; (2) skip-gram, which, instead, uses a word to predict a target context word.
Focusing on the skip-gram approach, for each given target word w t , the objective function of the neural network is set to maximize the conditional probabilities of the words surrounding w t in a window of m words. Formally, the following log likelihood is calculated: where T is the number of words in the training corpus and θ are the embedding parameters. Word2vec can be viewed as a close approximation of traditional window-based distributional approaches.
A crucial feature of word2vec consists in preserving relationships between vectors such as analogy. For instance, London -UK + Italy should be very close to Rome. Standard word2vec embeddings are available in English, obtained from the Google News dataset. Embeddings for dozens of languages can also be derived from Wikipedia or other corpora. g https://code.google.com/archive/p/word2vec/.
fastText. Recently, an extension of word2vec's skip-gram model called fastText h has been proposed which integrates subword information (Joulin et al. 2017). A key difference between fastText and word2vec is that the former is capable of building vectors for misspellings or out-of-vocabulary words. This is taken into account thanks to encoding words as a bag of character n-grams (i.e., substrings of length n) together with the word itself. For instance, to encode the word table, the following 3-grams are considered: { <ta, tab, abl, ble, le> } ∪ { <table> }. Thanks to this, compared to the standard skip-gram model, the input vocabulary includes the word and all the n-grams that can be calculated from it. As a result, the meanings of prefixes and suffixes are also considered in the final representation, therefore reducing data sparsity. fastText provides ready sets of embeddings for around 300 languages, which makes it appealing for multilingual processing of text. i GloVe. A key difference between LSA and word2vec is that the latter produces latent representations with the useful property of preserving analogies, therefore indicating appealing linear substructures of the word vector space, whereas the former takes better advantage of the overall statistical information present in the input documents. However, the advantage of either approach is the drawback of the other. GloVe (Global Vectors) j addresses this issue by performing unsupervised learning of latent word vector representations starting from global word-word co-occurrence information (Pennington, Socher, and Manning 2014). At the core of the approach is a counting method which calculates a co-occurrence vector count i for a word w i with its j-th component counting the times w j co-occurs with w i within a context window of a certain size, where each individual count is weighted by 1/d, and d is the distance between the two words in the context under consideration. The key constraint put forward by GloVe is that, for any two words w i and w j : where v i and v j are the embeddings to learn and b i and b j are scalar biases for the two words. A least square regression model is then calculated which aims at learning latent word vector representations such that the loss function is driven by the above soft constraint for any pair of words in the vocabulary: where f (count i, j ) is a weighting function that reduces the importance of overly frequent word pairs and V is the size of the vocabulary. While both word2vec and GloVe are popular approaches, a key difference between the two is that the former is a predictive model, whereas the latter is a count-based model. While it has been found that standard count-based models such as LSA fare worse than word2vec (Mikolov et al. 2013;Baroni, Dinu, and Kruszewski 2014), Levy and Goldberg (2014) showed that a predictive model such as word2vec's skip-gram is essentially a factorization of a variant of the PMI co-occurrence matrix of the vocabulary, which is countbased. Experimentally, there is contrasting evidence as to the superiority of word2vec over GloVe, with varying (and, in several cases, not very different) results. SensEmbed. Word embeddings conflate different meanings of a word into a single vector-based representation and are therefore unable to capture polysemy. In order to address this issue, Iacobacci et al. (2015) proposed an approach for obtaining embedded representations of word senses called sense embeddings. To this end, first, a text corpus is disambiguated with a stateof-the-art WSD system, that is, Babelfy (Moro et al. 2014 is processed with word2vec, in particular with the CBOW architecture, in order to produce embeddings for each sense of interest. AutoExtend. An alternative approach put forward by Rothe and Schütze (1995) takes into account the interactions and constraints between words, senses and synsets, as made available in the WordNet LKB, and -starting from arbitrary word embeddings -acquires the latent, embedded representations of senses and synsets by means of an autoencoder neural architecture, where word embeddings are at the input and output layers and the hidden layer provides the synset embeddings.
Contextual Word Embeddings. Recent approaches exploit the distribution of words to learn latent encodings which represent word occurrences in a given context. Two prominent examples of such approaches are ELMo (Peters et al. 2018) and BERT (Devlin et al. 2018). These approaches can be employed for tasks such as question answering, textual entailment and semantic role labelling, but also in tasks such as Word in Context (WiC) similarity (Pilehvar and Camacho-Collados 2019). While such approaches could potentially be used to produce word representations based on the vectors output at their first layer, their main goal is to work on contextualized linguistic items.

Hybrid word and sense similarity
Recently, some approaches have been put forward which bring together knowledge-based and distributional similarity by combining the knowledge provided in LKBs and the occurrence information obtained from texts. The key advantage of such approaches is their ability to embed words and meanings in the same vector space model.

Salient Semantic Analysis.
A development along the lines of Explicit Semantic Analysis (cf. Section 4.2.1) has been put forward which exploits the hyperlink information in Wikipedia pages to determine the salience of concepts (Hassan and Mihalcea 2011). Specifically, given a Wikipedia page and a hyperlink, all the occurrences of its surface form (i.e., the hyperlinked text) are searched across the page and the sense annotations propagated to those occurrences. Additional non-linked phrases in the page are tagged with a Wikification system. A semantic profile for each page is then created by building a PMI vector of the co-occurrences of each term with the concept of interest in the entire Wikipedia corpus. Since Salient Semantic Analysis relies not only on the distribution of words occurring in Wikipedia pages, but also on the usage of the Wikipedia sense inventory and the manual linking of salient concepts to Wikipedia pages, this technique can be considered both distributional and knowledge-based. Novel Approach to a Semantically-Aware Representation of Items. As described previously, a knowledge-based approach such as PPR obtains a vector representation for each synset in a WordNet-like graph, which contains semantic information consisting mostly of a prescribed nature, that is, the kind of information that can be found in a dictionary. Instead, to create vectors that account for descriptive information about the concept of interest, a Novel Approach to a Semantically-Aware Representation of Items (NASARI) has been put forward by Camacho-Collados et al. (2016). This approach exploits the distributional semantics of texts which describe the concept. For this purpose, given a target concept c = (p, S) identified by the pair of Wikipedia page p and WordNet synset S which are linked in BabelNet (cf. Section 3.1), two steps are carried out. First, the contextual information for c is collected according to the following equation: where T c is the contextual information for a specific concept c = (p, S), L p is the set of Wikipedia pages containing the page p and all the pages pointing to p, B is a function which maps each WordNet synset S to the corresponding Wikipedia page p, R S is a set of synsets which contains S and all its related, that is, connected, synsets. Second, the contextual information T c is encoded in a lexical vector representation. After tokenization, lemmatization and stopword removal, a bag of words is created from the Wikipedia pages in L p . Lexical specificity is used for the identification of the most characteristic words in the bag of words extracted (Lafon 1980). As a result, a lexical vector is created which represents c.
In order to overcome the sparseness which can occur in such lexical vector representation, two additional versions of NASARI vectors are provided: • semantic vectors: a synset-based representation is created whose dimensions are not potentially ambiguous words, but concepts (represented by synsets): for each word w in the bag of words of concept c, the set of all hypernyms in common between pairs of words in the vector is considered (e.g., table, lamp and seat are grouped under the hypernym { furniture, piece of furniture, article of furniture }) and encoded in a single dimension represented by that common hypernym. Finally, lexical specificity is calculated in order to determine the most relevant hypernyms that have to be encoded in the vector representation. The weight associated with each semantic dimension is determined by computing lexical specificity on the words grouped by each hypernym as if they were a single word in the underlying texts. • embedded vectors: an alternative representation, which is latent and compact, rather than explicit like the semantic vectors, is also provided in the NASARI framework. Starting from the lexical vector v lex (S) of a synset S, the embedded vector e(S) of S is computed as the weighted average of the embeddings of the words in the lexical vector. Formally: where e(w) is the word embedding (e.g., a word2vec embedding) of w and rank(w, v lex (S)) is the ranking of word w in the sorted vector v lex (S).
DeConf. The main idea behind de-conflated semantic representations (DeConf) (Pilehvar and Collier 2016) is to develop a method for obtaining a semantic representation which embeds word senses into the same semantic space of words, analogously to the NASARI embedded vectors. At the core of this approach lies the computation of a list of sense-biasing words for each word sense of interest which is used to 'bend' the sense representations in the right direction. Specifically, DeConf is computed in two steps: • identification of sense-biasing words, that is, a list of words is extracted from WordNet which most effectively represent the semantics of a given synset S. Such list is obtained by, first, applying the PPR algorithm to the WordNet graph with restart on S and, second, progressively adding new words from the WordNet synsets sorted in descending order by their PPR probability. • learning a sense representation, calculated for a target sense s of a word w as follows: whose numerator is an average of the word embedding v w weighted with a factor α and the embeddings of the various words in the list B s of sense-biasing words calculated as a result of the first step, and weighted with a function δ s (b) of their ranking in the list.

Measures for comparing vector-based representations
In this section, we focus on the main measures which are employed widely whenever two or more vector-based representations have to be compared for determining the degree of semantic similarity. The following measures are used in knowledge-based and distributional approaches, whenever they resort to vector-based representations.

Cosine similarity.
Widely used for determining the similarity between two vectors, the cosine similarity is formalized as follows: where w 1 and w 2 are two vectors to be compared. The above formula determines the closeness of the two vectors by calculating the dot product between them divided by their norms. Weighted overlap. This measure (Camacho-Collados et al. 2016) compares the similarity between vectors in two steps. First, a ranking of the dimensions of each vector is calculated. Such ranking considers only dimensions with values different from 0 for both vectors, assigning higher scores to more relevant dimensions. Second, it sums the ranking of the two vectors normalized by a factor which computes the best rank pairing. The weighted overlap is formalized as follows: where O indicates the set of overlapping dimensions and r i q refers to the ranking of the q-th dimension of vector v i (i ∈ {1, 2}). While cosine similarity is applicable to both latent and explicit vectors, Weighted Overlap is suitable only for explicit and potentially sparse vectors which have humaninterpretable components, like in the knowledge-based vector representations produced with PPR or NASARI (see Sections 3.2.2 and 5).

Evaluation
We now describe how to evaluate and compare measures for word and sense similarity. We distinguish between: • in vitro or intrinsic evaluation, that is, by means of measures that assess the quality of the similarity compared to human judgments (Section 7.1), and • in vivo or extrinsic evaluation, that is, where the quality of a similarity approach is evaluated by measuring the impact on the performance of an application when integrating such approach therein (Section 7.2).
In vitro evaluation may suffer from several issues, such as the subjectivity of the annotation and the representativeness of the dataset. In contrast, in vivo evaluation is ideal, in that it shows a clear effectiveness on a separate application.

In vitro evaluation
We introduce the key measures used in the literature in Section 7.1.1 and overview several manually annotated datasets to which the measures are applied in Section 7.1.2.

Measures
The quality of a semantic similarity measure, be it knowledge-based or distributional, is estimated by computing a correlation coefficient between the similarity results obtained with the measure and those indicated by human annotators. Different measures can be employed for determining the correlation between variables. The two most common correlation measures are: Pearson's coefficient and Sperman's rho coefficient. Pearson correlation coefficient. The Pearson correlation coefficient, also called Pearson productmomentum correlation coefficient, is a popular measure employed for computing the degree of correlation between two variables X and Y: where cov(X, Y) is the covariance between X and Y, and σ X is the standard deviation of X (analogously for Y). For a dataset of n word pairs for which the similarity has to be calculated, the following formula is computed: where the dataset is made up of n word pairs {(w i , w i )} n i=1 , x i is the similarity between w i and w i computed with the similarity measure under evaluation, y i is the similarity provided by the human annotators for the same word pair and x is the mean of all values x i . Spearman's rank correlation. In determining the effectiveness of word similarity, the Pearson correlation coefficient has sometimes been criticized because it determines how well the similarity measure fits the values provided in the gold-standard, humanly produced datasets. However, it is suggested that for similarity it might be more important to determine how well the ranks of the similarity values correlate, which makes the measure non-parametric, that is, independent of the underlying distribution of the data. This can be calculated with Spearman's rank correlation, which is Pearson's coefficient applied to the ranked variables rank X and rank Y of X and Y. Given a dataset of n word pairs as above, the following formula can be computed: where rank(x i ) is the rank value of the i-th item according to the similarity measure under evaluation and rank(y i ) is the rank value of the same item according to the overall ranking of similarity scores provided by human annotators in the evaluation dataset.

Datasets
Several datasets have been created as evaluation benchmarks for semantic similarity. Here we overview the most popular of these datasets.

Rubenstein & Goodenough (RG-65) and its translations.
A dataset made up of 65 pairs of nouns selected to cover several types of semantic similarities was created by Rubenstein and Goodenough (1965). Annotators were asked to assign each pair with a value between 0.0 and 4.0 where the higher the score, the higher the similarity. Due to the paucity of datasets in languages other than English, some of these datasets have been entirely or partially translated into various languages obtaining similar scores, including German (Gurevych 2005), French (Joubarne andInkpen 2011), Spanish (Camacho-Collados, Pilehvar, and, Portuguese (Granada, Trojahn, and Vieira 2014) and many other languages (Bengio et al. 2018).

Miller & Charles (MC-30).
From these 65 word pairs, Miller and Charles (1991)  SimLex-999. One of the largest resources providing word similarity scores was produced by Hill (2015). This dataset distinguishes clearly between similarity and relatedness by assigning related items with lower scores. Furthermore, it contains a large and differentiated set of adjectives, nouns and verbs, thus enabling a fine-grained evaluation of the performance. Cross-lingual datasets. Camacho-Collados et al. (2015) addressed the issue of comparing words across languages by providing fifteen cross-lingual datasets which contain items for any pair of the English, French, German, Spanish, Portuguese and Farsi languages. More data aimed at multilingual and cross-lingual similarity were made available at SemEval-2017 (Camacho-Collados, Pilehvar, and Navigli 2017).

In vivo evaluation
An alternative way of evaluating and comparing semantic similarity measures is by integrating them into an end-to-end application and then measuring the performance change (hopefully, the improvement) of the latter compared to the baseline performance. Word and sense similarity are, indeed, intermediate tasks that lend themselves to the integration into an application. Among the most popular applications we cite: (1) Information retrieval: word similarity has been applied historically to Information Retrieval (IR) since the development of the SMART system (Salton and Lesk 1968). More recent work performs IR using ESA (Gabrilovich and Markovitch 2007), or employs similarity in geographic IR (Janowicz, Raubal, and Kuhn 2011), in semantically enhanced IR (Hliaoutakis et al. 2006) and domain-oriented IR (Ye et al. 2016).
(2) Text classification: word similarity has also been used for classification since the early days (Rocchio 1971). More recently, word embeddings have been used to compute the similarity between words in the text classification task (Liu et al. 2018). Topical word, that is, contextbased, representations (Liu et al. 2015) and bag-of-embeddings representations (Jin et al. 2016) have also been proposed which achieve performance improvement in text classification: NASARI embeddings have been used to create rich representations of documents and perform an improved classification of text (Sinoara et al. 2019).
(3) Word sense disambiguation: in order to choose the right sense of a given word, the similarity between sense vector representations, such as those available in NASARI, and the other words in the context has been computed (Camacho-Collados et al. 2016). Word similarity has been employed also in the context of word sense induction, that is, the task k http://alfonseca.org/eng/research/wordsim353.html.
of automatically determining the senses of words (Schütze 1998), with the creation of the so-called multi-prototype embeddings (Tian et al. 2014;Liu et al. 2015;Pelevina et al. 2016). (4) Text summarization: word similarity has been used to determine the correlation between summary pairs (Lin and Hovy 2003). (5) Machine translation: word similarity has been proposed as a tool to improve the wellknown issues with the n-gram overlap-based evaluation performed with the BLEU score (Banerjee and Lavie 2005; Castillo and Estrella 2012). (6) Synonym identification: word and sense similarity have been used in the literature (Pilehvar et al. 2013) to determine the best synonym in the TOEFL synonymy recognition task (Landauer and Dumais 1997). (7) Coreference resolution: word similarity has been used for anaphora resolution (Gasperin and Vieira 2004); in a more complex coreference resolution system, measures of semantic relatedness have been used as features for classifying referring expressions (Strube and Ponzetto 2006). (8) Malapropism detection: semantic similarity has been employed to detect and correct malapropisms (Budanitsky and Hirst 2006), that is, real-word spelling errors. (9) Dictionary linking: a key effort in electronic lexicography and related fields concerns linking dictionary entries. This can be done by determining the similarity between senses in different entries of two dictionaries (Pilehvar, Jurgens, and Navigli 2014). In like manner, sense clustering can be performed to reduce the dictionary granularity based on semantic similarity between concepts (Navigli 2006).

Conclusions and future directions
Similarity is at the core of NLP, in that all kinds of linguistic items need to be compared to perform tasks at all levels. In this article we have provided an overview of semantic similarity at the word and sense level. We have introduced the two main approaches to similarity, that is, the knowledge-based approach and the distributional approach, including recent developments based on neural networks, and we have described the various evaluation settings, including the most popular datasets in the field. The reader might be wondering when and why sense representations should be preferred over word representations. There are several points to consider when facing this choice, which are the following: (1) sense representations are preferable in any case if they improve the performance in word similarity tasks (cf. Section 7). (2) sense representations provide linkage to existing lexical knowledge resources such as WordNet and BabelNet, in some cases performing disambiguation implicitly (Pilehvar et al. 2013;Camacho-Collados et al. 2016). (3) meaning representations often enable multilingual or cross-lingual similarity without the need either to retrain the word embeddings or to use bilingual/multilingual lexical embeddings, which work well with predominant senses, but less so with infrequent meanings.
There are several promising directions in which word and sense similarity can develop. The first, and probably the most important, is multilinguality: being able to compare linguistic items in multiple languages is still understudied, even though recent developments in bilingual and multilingual embeddings are very promising (Ammar et al. 2016;Smith et al. 2017;Conneau et al. 2018). More interestingly, being able to contrast words and senses across languages, that is, by pairing items in different languages, might boost current new developments in fields such as machine translation ; Artetxe, Labaka, and Agirre 2018).
A second important direction is the creation of solid benchmarks for evaluating linguistic items at different levels. Currently, there are several datasets of different types. Unfortunately, not all datasets contain the same kind of information. For instance, some are more geared towards measuring the degree of similarity between pairs, while still others more towards relatedness, others mix the two kinds of correlation in the same data. It is not obvious whether a solid in-vitro evaluation benchmark would be better than a single effective end-to-end application, probably because similarity measures tend to behave differently depending on the task and the nature of the data under study.
Importantly, we have seen that similarity is in many cases achieved with two ingredients: an effective representation of the linguistic item and a measure that exploits that representation to determine similarity. As the two main approaches to similarity tend to use different kinds of information, we believe strongly that a key further research direction should be to achieve a tighter integration of knowledge with distributional semantics, that is, to enhance neural architectures with explicit knowledge. This can be achieved, on the one hand, by leveraging multilingual lexicalsemantic knowledge resources such as BabelNet, and, on the other hand, by learning relationships between distributional information in multiple languages, such as multilingual embeddings which share the same semantic vector space.
A final important direction, related to the above point, concerns the adaptability of the similarity approaches to different domains, tasks and applications. This direction is underexplored and calls for more investigation in order to establish the extent to which representations and measures have to be adapted to each new need (Bollegala, Maehara, and Kawarabayashi 2015;Prathusha, Liang, and Sethares 2018;Yang, Lu, and Zheng 2019).

Further reading
For additional information on the topic, we refer the reader to other surveys in word similarity (Mihalcea, Corley, and Strapparava 2006) and relatedness (Zhang, Gentile, and Ciravegna 2013), word and sense embeddings (Camacho-Collados and Pilehvar 2018) and semantic vector space models (Turney and Pantel 2010).