Skip to main content Accessibility help

Learning semantic sentence representations from visually grounded language without lexical knowledge

  • Danny Merkx (a1) and Stefan L. Frank (a1)


Current approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the corresponding caption can be retrieved and vice versa. We show that our model achieves results comparable to the current state of the art on two popular image-caption retrieval benchmark datasets: Microsoft Common Objects in Context (MSCOCO) and Flickr8k. We evaluate the semantic content of the resulting sentence embeddings using the data from the Semantic Textual Similarity (STS) benchmark task and show that the multimodal embeddings correlate well with human semantic similarity judgements. The system achieves state-of-the-art results on several of these benchmarks, which shows that a system trained solely on multimodal data, without assuming any word representations, is able to capture sentence level semantics. Importantly, this result shows that we do not need prior knowledge of lexical level semantics in order to model sentence level semantics. These findings demonstrate the importance of visual information in semantics.


Corresponding author

*Corresponding author. Emails:;


Hide All
Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W., Lopez-Gazpio, I., Maritxalar, M., Mihalcea, R., Rigau, G., Uria, L. and Wiebe, J. (2015). SemEval-2015 task 2: Semantic textual similarity, English, Spanish and Pilot on interpretability. In SemEval. Denver, Colorado: ACL.
Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W., Mihalcea, R., Rigau, G. and Wiebe, J. (2014). Semeval-2014 task 10: Multilingual semantic textual similarity. In SemEval. Dublin, Ireland: ACL.
Agirre, E., Banea, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Mihalcea, R., Rigau, G. and Wiebe, J. (2016). Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval. San Diego, California: ACL.
Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A. and Guo, W. (2013). *SEM 2013 shared task: Semantic textual similarity. In SemEval. Atlanta, Georgia: ACL.
Agirre, E., Diab, M., Cer, D. and Gonzalez-Agirre, A. (2012). Semeval-2012 task 6: A pilot on semantic textual similarity. In SemEval. Montréal, Canada: ACL.
Bahdanau, D., Cho, K. and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations. San Diego, California: ICLR. pp. 115.
Bentivogli, L., Bernardi, R., Marelli, M., Menini, S., Baroni, M. and Zamparelli, R. (2016). SICK through the SemEval glasses. Lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. LRE 50(1).
Boom, C.D., Canneyt, S.V., Bohez, S., Demeester, T. and Dhoedt, B. (2015). Learning semantic similarity for very short texts. In ICDMW. Atlantic City, New Jersey: IEEE.
Bowman, S.R., Angeli, G., Potts, C. and Manning, C.D. (2015). A large annotated corpus for learning natural language inference. In EMNLP. Lisbon, Portugal: ACL.
Braine, M.D.S. and Bowerman, M. (1976). Children’s first word combinations. Monographs of the Society for Research in Child Development 41(1).
Cer, D., Diab, M., Agirre, E.E., Lopez-Gazpio, I. and Specia, L. (2017). SemEval-2017 task 1: Semantic textual similarity multilingual and cross-lingual focused evaluation. In SemEval. Vancouver, Canada.
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollar, P. and Zitnick, C. L. (2015). Microsoft COCO Captions: Data Collection and Evaluation Server. 17. arXiv: 1504. 00325.
Chrupała, G., Gelderloos, L. and Alishahi, A. (2017). Representations of language in a model of visually grounded speech signal. In Proceedings of the 55th Annual Meeting of the ACL. Vancouver, Canada: ACL, pp. 613622
Chung, J., Gulcehre, C., Cho, K. and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. In Paper presented at the NIPS Workshop on Deep Learning. Montreal, Canada: NIPS, pp. 19.
Collell, G., Zhang, T. and Moens, M.-F. (2017). Imagined visual representations as multimodal embeddings. In AAAI. San Francisco, California: Association for the Advancement of Artificial Intelligence.
Conneau, A. and Kiela, D. (2018). SentEval: An evaluation toolkit for universal Sentence representations. In LREC. Miyazaki, Japan: LREC. pp. 16991704.
Conneau, A., Kiela, D., Schwenk, H., Barrault, L. and Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. In EMNLP. Copenhagen, Denmark: ACL.
De Deyne, S., Perfors, A. and Navarro, D.J. (2017). Predicting human similarity judgments with distributional models: The value of word associations. In ICJAI. Melbourne, Australia: AAAI Press, pp. 48064810
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the Association for Information Science 41(6).
Deng, J., Dong, W., Socher, R., Li, L., Li, K. and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR. Miami, Florida: IEEE.
Derby, S., Miller, P., Murphy, B. and Devereux, B. (2018). Using sparse semantic embeddings learned from multimodal text and image data to model human conceptual knowledge. In CoNLL. Brussels, Belgium: ACL.
Dong, J., Li, X. and Snoek, C.G.M. (2018). Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20(12).
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J. and Zisserman, A. (2008). The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results.
Faghri, F., Fleet, D.J., Kiros, R. and Fidler, S. (2017). VSE++: improved visual-semantic embeddings. 113. CoRR abs/1707.05612.
Greff, K., Srivastava, R.K., Koutník, J., Steunebrink, B.R. and Schmidhuber, J. (2017). LSTM: A search space odyssey. Transactions on Neural Networks and Learning Systems 28(10).
Harwath, D. and Glass, J. (2015). Deep multimodal semantic embeddings for speech and images. In ASRU. Scottsdale, Arizona: IEEE.
Harwath, D., Torralba, A. and Glass, J. (2016). Unsupervised learning of spoken language with visual context. In NIPS.
He, K., Zhang, X., Ren, S. and Sun, J. (2016). Deep residual learning for image recognition. In CVPR. Las Vegas, Nevada: IEEE. pp. 770778.
Hill, F., Cho, K. and Korhonen, A. (2016). Learning distributed representations of sentences from unlabelled data. In NAACL HLT. San Diego, California: ACL.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9(8).
Hodosh, M., Young, P. and Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. JAIR 47(1).
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E. and Weinberger, K.Q. (2017). Snapshot ensembles: Train 1, get M for free. In ICLR. Vancouver, Canada: ICLR. pp. 114.
Jaakkola, T. and Haussler, D. (1999). Exploiting generative models in discriminative classifiers. In NIPS.
Karpathy, A. and Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In CVPR. Boston, MA: IEEE. pp. 664676.
Kiela, D., Conneau, A., Jabri, A. and Nickel, M. (2018). Learning visually grounded sentence representations. In NAACL-HLT. New Orleans, Louisiana: New Orleans, Louisiana: ACL.
Kingma, D.P. and Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR. San Diego, California: ICLR. pp. 115.
Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A. and Fidler, S. (2015). Skip-thought vectors. In NIPS.
Klein, B., Lev, G., Sadeh, G. and Wolf, L. (2015). Associating neural word embeddings with deep image representations using Fisher vectors. In CVPR. Boston, MA: IEEE.
Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In International Conference on Machine Learning, vol. 32. Beijing, China: PMLR.
Leidal, K., Harwath, D. and Glass, J. (2017). Learning modality-invariant representations for speech and images. In ASRU. Okinawa, Japan: IEEE.
Lieven, E., Behrens, H., Speares, J. and Tomasello, M. (2003). Early syntactic creativity: a usage-based approach. Journal of Child Language 30(2).
Ma, L., Lu, Z., Shang, L. and Li, H. (2015). Multimodal convolutional neural networks for matching image and sentence. In ICCV. Santiago, Chile: IEEE.
Madhyastha, P., Wang, J. and Specia, L. (2018). The role of image representations in vision to language tasks. NLE 24(3).
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. 112. arXiv: 1301.3781.
Patel, R.N., Pimpale, P.B. and Sasikumar, M. (2016). Recurrent Neural Network based Part-of-Speech Tagger for Code-Mixed Social Media Text. 17. arXiv: 1611.04989.
Pennington, J., Socher, R. and Manning, C. (2014). Glove: Global vectors for word representation. In EMNLP. Doha, Qatar: ACL.
Pine, J.M. and Lieven, E. (1993). Reanalysing rote-learned phrases: individual differences in the transition to multi-word speech. Journal of Child Language 20.
Qi, Y., Sachan, D.S., Felix, M., Padmanabhan, S.J. and Neubig, G. (2018). When and why are pre-trained word embeddings useful for neural machine translation? In NAACL-HLT. New Orleans, Louisiana: ACL.
Rubenstein, H. and Goodenough, J.B. (1965). Contextual correlates of synonymy. Communications of the ACM 8(10).
Smith, L.N. (2017). Cyclical learning rates for training neural networks. In WACV. Santa Rosa, California: IEEE.
Tomasello, M. (2000). First steps toward a usage-based theory of language acquisition. Cognitive Linguistics 11(1/2).
Vendrov, I., Kiros, R., Fidler, S. and Urtasun, R. (2016). Order-embeddings of images and language. In ICLR. San Juan, Puerto Rico: ICLR. pp. 112.
Wehrmann, J., Mattjie, A. and Barros, R.C. (2018). Order embeddings and character-level convolutions for multimodal alignment. Pattern Recognition Letters 102.
Yang, Y., Yuan, S., Cer, D., Kong, S.-y., Constant, N., Pilar, P., Ge, H., Sung, Y.-H., Strope, B. and Kurzweil, R. (2018). Learning semantic textual similarity from conversations. In RepL4NLP. Melbourne, Australia: ACL.
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A. and Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV. Santiago, Chile: IEEE. pp. 1927.



Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed