Hostname: page-component-77f85d65b8-2tv5m Total loading time: 0 Render date: 2026-03-28T09:41:52.093Z Has data issue: false hasContentIssue false

Text classification with semantically enriched word embeddings

Published online by Cambridge University Press:  06 April 2020

N. Pittaras*
Affiliation:
Institute of Informatics & Telecommunications, NCSR “Demokritos”, Patr. Gregoriou E & 27 Neapoleos Str, 15341 Agia Paraskevi, Greece Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Panepistimiopolis, Ilisia, 15784 Athens, Greece
G. Giannakopoulos
Affiliation:
Institute of Informatics & Telecommunications, NCSR “Demokritos”, Patr. Gregoriou E & 27 Neapoleos Str, 15341 Agia Paraskevi, Greece
G. Papadakis
Affiliation:
Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Panepistimiopolis, Ilisia, 15784 Athens, Greece
V. Karkaletsis
Affiliation:
Institute of Informatics & Telecommunications, NCSR “Demokritos”, Patr. Gregoriou E & 27 Neapoleos Str, 15341 Agia Paraskevi, Greece
*
*Corresponding author. E-mail: pittarasnikif@iit.demokritos.gr
Rights & Permissions [Opens in a new window]

Abstract

The recent breakthroughs in deep neural architectures across multiple machine learning fields have led to the widespread use of deep neural models. These learners are often applied as black-box models that ignore or insufficiently utilize a wealth of preexisting semantic information. In this study, we focus on the text classification task, investigating methods for augmenting the input to deep neural networks (DNNs) with semantic information. We extract semantics for the words in the preprocessed text from the WordNet semantic graph, in the form of weighted concept terms that form a semantic frequency vector. Concepts are selected via a variety of semantic disambiguation techniques, including a basic, a part-of-speech-based, and a semantic embedding projection method. Additionally, we consider a weight propagation mechanism that exploits semantic relationships in the concept graph and conveys a spreading activation component. We enrich word2vec embeddings with the resulting semantic vector through concatenation or replacement and apply the semantically augmented word embeddings on the classification task via a DNN. Experimental results over established datasets demonstrate that our approach of semantic augmentation in the input space boosts classification performance significantly, with concatenation offering the best performance. We also note additional interesting findings produced by our approach regarding the behavior of term frequency - inverse document frequency normalization on semantic vectors, along with the radical dimensionality reduction potential with negligible performance loss.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s) 2020. Published by Cambridge University Press
Figure 0

Figure 1. Overview of our approach to semantically augmenting the classifier input vector.

Figure 1

Figure 2. Example of the basic disambiguation strategy. Given the list of candidate synsets from the NLTK WordNet API, the first item is selected.

Figure 2

Figure 3. Example of the POS disambiguation strategy. Given the candidate synsets retrieved from the NLTK WordNet API, the ones that are annotated with a POS tag that does not match the respective tag of the query word are discarded. After this filtering process, the basic selection strategy is applied.

Figure 3

Figure 4. Example of the synset vector generation for context-embedding disambiguation strategy. The context of each synset is tokenized into words, with each word mapped to a vector representation via the learned embedding matrix. The synset vector is the centroid produced by averaging all context word embeddings.

Figure 4

Figure 5. Example of the disambiguation phase of the context-embedding disambiguation strategy. A candidate word is mapped to its embedding representation and compared to the list of available synset vectors. The synset with the vector representation closest to the word embedding is selected.

Figure 5

Figure 6. Example of the spreading activation process for the input word “dog”, executed for three levels with a decay factor of 0.6. The solid line (a) denotes the semantic disambiguation phase with one of the covered strategies. Dashed lines ((b) through (d)) represent the extraction of synsets linked with a hypernymy (is-a) relation to any synset in the source list. The numeric values represent the weight associated with synsets of each level, with the final semantic vector for the input word being listed to the right.

Figure 6

Figure 7. An example of the semantic augmentation process leading up to classification with a DNN classifier. The image depicts the case of concat fusion, that is, the concatenation of the word embedding with the semantic vector. The dashed box is repeated for each word in the document. Green/red colors denote semantic/textual information, respectively.

Figure 7

Table 1. Technical characteristics of the 20-Newsgroups and Reuters datasets. Class samples refers to the range of the number of instances per class, while the last three rows report mean values, with the corresponding standard deviation in parenthesis. The values in the last two rows (POS and WordNet) are expressed as ratios with respect to the number of words per document

Figure 8

Figure 8. (a) The diagonal-omitted confusion matrix, and (b) the label-wise performance chart for our best-performing configuration over the 20-Newsgroups dataset.

Figure 9

Table 2. 20-Newsgroups main experimental results. Underlined values outperform the “embedding-only” baseline method, while bold values indicate the best dataset-wise performance. Values in italics represent a performance boost achieved by the spreading activation in comparison to the identical configuration without it. “N/A” stands for non-applicable

Figure 10

Table 3. Misclassification cases for the best-performing configuration over the 20-Newsgroups dataset, where (a) the predicted label is semantically similar to the ground truth, (b) the test instance can be reasonably considered semantically ambiguous, given the labelset, (c) the error is related to the existence of critical named entities, or (d) the error is linked to context misidentification. True/predicted labels refer to the instance ground truth and the erroneous prediction of our model for that test instance, respectively. The listed slash-separated text segments from each instance are indicative samples believed to have had a contribution to misclassification

Figure 11

Table 4. 20-Newsgroups main experimental pairwise t-test results, with respect to the “embedding-only” baseline. Single- and double-starred values represent statistical significance at $5 \%$ and $1 \%$ confidence levels, respectively

Figure 12

Table 5. Experiments over the 20-Newsgroups dataset for a concept-wise frequency threshold of 20. Underlined values outperform the “embedding-only” baseline

Figure 13

Table 6. Experiments over the 20-Newsgroups dataset for a dataset-wise frequency threshold of 50. No configuration outperforms the “embedding-only” baseline

Figure 14

Table 7. Reuters main experimental results. Underlined values outperform the “embedding-only” baseline, while bold values indicate the best dataset-wise performance. Values in italics denote a performance boost by the spreading activation, with respect to the identical configuration without it

Figure 15

Figure 9. (a) The diagonal-omitted confusion matrix and (b) the label-wise performance chart for our best-performing configuration over the Reuters dataset. For better visualization, only the 26 classes with at least 20 samples are illustrated.

Figure 16

Table 8. Misclassification cases for the best-performing configuration over the Reuters dataset, where (a) the predicted label is semantically similar to the ground truth or (b) the test instance can be reasonably considered semantically ambiguous, given the labelset. True/predicted labels refer to the instance ground truth and the erroneous prediction of our model for that test instance, respectively. The listed slash-separated text segments from each instance are indicative samples believed to have had a contribution to misclassification

Figure 17

Table 9. Reuters main experimental pairwise t-test results, with respect to the “embedding-only” baseline. Single- and double-starred values represent statistical significance at $5 \%$ and $1 \%$ confidence levels, respectively

Figure 18

Table 10. Experiments over the Reuters dataset for a concept-wise frequency threshold of 20. Underlined values outperform the “embedding-only” baseline

Figure 19

Table 11. Experiments over the Reuters dataset for a dataset-wise frequency threshold of 50. No configurations outperforms the “embedding-only” baseline

Figure 20

Table 12. Dataset-wise comparison with the state of the art in terms of accuracy and macro F1 score. Underlined values outperform the “embedding-only” (50-dimensional fitted word2vec) baseline, while bold values denote column-wise maxima

Figure 21

Table 13. Evaluation of representative configurations on additional datasets. Underlined values outperform the “embedding-only” baseline, while bold values denote column-wise maxima