Hostname: page-component-89b8bd64d-4ws75 Total loading time: 0 Render date: 2026-05-06T10:38:18.765Z Has data issue: false hasContentIssue false

Learning semantic sentence representations from visually grounded language without lexical knowledge

Published online by Cambridge University Press:  31 July 2019

Danny Merkx*
Affiliation:
Centre for Language Studies, Radboud University, Nijmegen, The Netherlands
Stefan L. Frank*
Affiliation:
Centre for Language Studies, Radboud University, Nijmegen, The Netherlands
*
*Corresponding author. Emails: d.merkx@let.ru.nl; s.frank@let.ru.nl
Rights & Permissions [Opens in a new window]

Abstract

Current approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the corresponding caption can be retrieved and vice versa. We show that our model achieves results comparable to the current state of the art on two popular image-caption retrieval benchmark datasets: Microsoft Common Objects in Context (MSCOCO) and Flickr8k. We evaluate the semantic content of the resulting sentence embeddings using the data from the Semantic Textual Similarity (STS) benchmark task and show that the multimodal embeddings correlate well with human semantic similarity judgements. The system achieves state-of-the-art results on several of these benchmarks, which shows that a system trained solely on multimodal data, without assuming any word representations, is able to capture sentence level semantics. Importantly, this result shows that we do not need prior knowledge of lexical level semantics in order to model sentence level semantics. These findings demonstrate the importance of visual information in semantics.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s) 2019
Figure 0

Figure 1. Model architecture: The model consists of two branches with the image encoder on the left and the caption encoder on the right. The character embeddings are denoted by et and the RNN hidden states by ht. Each hidden state has n features which are concatenated for the forward and backward RNN into 2n dimensional hidden states. Then attention is applied which weighs the hidden states and then sums over the hidden states resulting in the caption embedding. At the top we calculate the cosine similarity between the image and caption embedding (emb_img and emb_cap).

Figure 1

Table 1. Description of the various STS tasks and their subtasks. Some subtasks appear in multiple STS tasks, but consist of different sentence pairs drawn from the same source. The image description datasets are drawn from the PASCAL VOC-2008 dataset (Everingham, Van Gool, Williams et al. 2008) and so do not overlap with Flickr8k or MSCOCO

Figure 2

Figure 2. Model performance on the semantic (SICK-R, STS-B and STS12-16) and training task (image-caption retrieval) measures including the 95% confidence interval. Training task performance is measured in recall@10. The semantic performance measure is Pearson’s r. The horizontal axis shows the embedding size with ‘max’ indicating the max pooling model.

Figure 3

Table 2. Image-caption retrieval results on the Flickr8k and MSCOCO test sets. R@N is the percentage of items for which the correct image or caption was retrieved in the top N (higher is better). Med r is the median rank of the correct image or caption (lower is better). We also report the 95% confidence interval for the R@N scores. For MSCOCO we report the results on the full test set (5,000 items) and the average results on five folds of 1,000 image-caption pairs

Figure 4

Figure 3. Semantic evaluation task results: Pearson correlation coefficients with their 95% confidence interval for the various subtasks (see Table 1). BOW is a bag of words approach using GloVe embeddings and InferSent is the model reported by Conneau et al. (2017). A supplement with a table of the results shown here is included in the github repository.

Figure 5

Table 3. Example sentence pairs with their human-annotated similarity score taken from STS tasks

Figure 6

Figure 4. The training task performance (R@10) and the semantic task performance (Pearson’s r × 100) as they develop over training, with the number of epochs on a logarithmic scale. For MSCOCO (right) we show the training task performance on the 5,000 item test set.