The role of image representations in vision to language tasks

PRANAVA MADHYASTHA; JOSIAH WANG; LUCIA SPECIA

doi:10.1017/S1351324918000116

The role of image representations in vision to language tasks

Published online by Cambridge University Press: 21 March 2018

and

PRANAVA MADHYASTHA: Affiliation:
Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello St., Sheffield S1 4DP, UK e-mail: p.madhyastha@sheffield.ac.uk, j.k.wang@sheffield.ac.uk, l.specia@sheffield.ac.uk
JOSIAH WANG: Affiliation:
Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello St., Sheffield S1 4DP, UK e-mail: p.madhyastha@sheffield.ac.uk, j.k.wang@sheffield.ac.uk, l.specia@sheffield.ac.uk
LUCIA SPECIA: Affiliation:
Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello St., Sheffield S1 4DP, UK e-mail: p.madhyastha@sheffield.ac.uk, j.k.wang@sheffield.ac.uk, l.specia@sheffield.ac.uk

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Tasks that require modeling of both language and visual information, such as image captioning, have become very popular in recent years. Most state-of-the-art approaches make use of image representations obtained from a deep neural network, which are used to generate language information in a variety of ways with end-to-end neural-network-based models. However, it is not clear how different image representations contribute to language generation tasks. In this paper, we probe the representational contribution of the image features in an end-to-end neural modeling framework and study the properties of different types of image representations. We focus on two popular vision to language problems: The task of image captioning and the task of multimodal machine translation. Our analysis provides interesting insights into the representational properties and suggests that end-to-end approaches implicitly learn a visual-semantic subspace and exploit the subspace to generate captions.

Information

Type: Articles
Information: Natural Language Engineering , Volume 24 , Special Issue 3: Language for Images , May 2018 , pp. 415 - 439

DOI: https://doi.org/10.1017/S1351324918000116 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Anderson, P., Fernando, B., Johnson, M., and Gould, S. 2016. SPICE: semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision (ECCV).CrossRef Google Scholar

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. 2015. VQA: visual question answering. In Proceedings of the 2015 IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRef Google Scholar

Arora, S., Liang, Y., and Ma, T. 2017. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the International Conference on Learning Representations, Workshop Contributions.Google Scholar

Bahdanau, D., Cho, K., and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representation (ICLR).Google Scholar

Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., and Plank, B., 2016. Automatic description generation from images: a survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research 55 : 409–42.CrossRef Google Scholar

Caglayan, O., Aransa, W., Wang, Y., Masana, M., García-Martínez, M., Bougares, F., Barrault, L., and van de Weijer, J. 2016. Does multimodality help human and machine for translation and image captioning? In Proceedings of the Conference on Machine Translation (WMT).CrossRef Google Scholar

Calixto, I., Elliott, D., and Frank, S. 2016. DCU-UvA multimodal MT system report. In Proceedings of the Conference on Machine Translation (WMT).CrossRef Google Scholar

Calixto, I., Liu, Q., and Campbell, N. 2017. Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL).CrossRef Google Scholar

Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., and Zitnick, C. L. 2015. Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325.Google Scholar

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Proceedings of the NIPS 2014 Workshop on Deep Learning and Representation Learning.Google Scholar

Clevert, D.-A., Unterthiner, T., and Hochreiter, S. 2015. Fast and accurate deep network learning by exponential linear units (ELUs). In Proc. of the International Conference on Learning Representation (ICLR).Google Scholar

Denkowski, M., and Lavie, A. 2014. Meteor universal: language specific translation evaluation for any target language. In Proceedings of the EACL Workshop on Statistical Machine Translation.CrossRef Google Scholar

Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., and Mitchell, M. 2015. Language models for image captioning: the quirks and what works. In Proceedings of the Association for Computational Linguistics (ACL).CrossRef Google Scholar

Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. 2014. Decaf: a deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar

Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRef Google Scholar

Elliott, D., and de Vries, A. 2015. Describing images using inferred visual dependency representations. In Proceedings of the Association for Computational Linguistics (ACL), arxiv preprint arxiv:1510.04709.Google Scholar

Elliott, D., and Kádár, A. 2017. Imagination improves multimodal translation. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP).Google Scholar

Elliott, D., and Keller, F. 2014. Comparing automatic evaluation measures for image description. In Proceedings of the Association for Computational Linguistics (ACL).CrossRef Google Scholar

Elliott, D., Frank, S., and Hasler, E. 2015. Multi-language image description with neural sequence models. arxiv preprint arxiv:1510.04709.Google Scholar

Elliott, D., Frank, S., Barrault, L., Bougares, F., and Specia, L. 2017. Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the Conference on Machine Translation (WMT).CrossRef Google Scholar

Elliott, D., Frank, S., Sima’an, K., and Specia, L. 2016. Multi30K: multilingual English-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language.CrossRef Google Scholar

Elman, J. L., 1990. Finding structure in time. Cognitive Science 14 : 179–211.CrossRef Google Scholar

Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., and Platt, J. C. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRef Google Scholar

Farhadi, A., Hejrati, M., Sadeghi, M., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth, D. 2010. Every picture tells a story: generating sentences from images. In Proceedings of the European Conference on Computer Vision (ECCV).CrossRef Google Scholar

Ferraro, F., Mostafazadeh, N., Vanderwende, L., Devlin, J., Galley, M., and Mitchell, M. 2015. A survey of current datasets for vision and language research. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).CrossRef Google Scholar

Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., and Xu, W. 2015. Are you talking to a machine? Dataset and methods for multilingual image question. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).Google Scholar

Grubinger, M., Clough, P., Müller, H., and Deselaers, T. 2006. The IAPR TC-12 benchmark: a new evaluation resource for visual information systems. In Proceedings of the International Workshop on Language Resources for Content-Based Image Retrieval, OntoImage’2006.Google Scholar

He, K., Zhang, X., Ren, S., and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRef Google Scholar

Hitschler, J., Schamoni, S., and Riezler, S. 2016. Multimodal pivots for image caption translation. In Proceedings of the Association for Computational Linguistics (ACL).CrossRef Google Scholar

Hochreiter, S., and Schmidhuber, J., 1997. Long short-term memory. Neural Computation 9 (8): 1735–80.CrossRef Google Scholar PubMed

Hodosh, M., Young, P., and Hockenmaier, J., 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 : 853–99.CrossRef Google Scholar

Huang, P.-Y., Liu, F., Shiang, S.-R., Oh, J., and Dyer, C. 2016. Attention-based multimodal neural machine translation. In Proceedings of the Conference on Machine Translation (WMT).CrossRef Google Scholar

Karpathy, A. 2016. Connecting Images and Natural Language. PhD Thesis, Department of Computer Science, Stanford University.Google Scholar

Karpathy, A., and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRef Google Scholar

Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., and Erdem, E. 2017. Re-evaluating automatic metrics for image captioning. In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL).CrossRef Google Scholar

Kiros, R., Salakhutdinov, R., and Zemel, R. S. 2014. Multimodal neural language models. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar

Kolář, M., Hradiš, M., and Zemčík, P. 2015. Technical report: Image captioning with semantically similar images. arXiv preprint arXiv:1506.03995.Google Scholar

Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).Google Scholar

Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., and Berg, T. L. 2011. Baby talk: understanding and generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRef Google Scholar

Kuznetsova, P., Ordonez, V., Berg, A., Berg, T., and Choi, Y. 2012. Collective generation of natural image descriptions. In Proceedings of the Association for Computational Linguistics (ACL).Google Scholar

Kuznetsova, P., Ordonez, V., Berg, A., Berg, T., and Choi, Y. 2013. Generalizing image captions for image-text parallel corpus. In Proceedings of the Association for Computational Linguistics (ACL).Google Scholar

Kuznetsova, P., Ordonez, V., Berg, T. L., and Choi, Y. 2014. TREETALK: composition and compression of trees for image descriptions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar

Lala, C., Madhyastha, P., Wang, J., and Specia, L., 2017. Unraveling the contribution of image captioning and neural machine translation for multimodal machine translation. The Prague Bulletin of Mathematical Linguistics 108 : 197–208.CrossRef Google Scholar

Lebret, R., Pinheiro, P. O., and Collobert, R. 2015. Phrase-based image captioning. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar

Li, S., Kulkarni, G., Berg, T. L., Berg, A. C., and Choi, Y. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the SIGNLL Conference on Computational Natural Language Learning (CoNLL).Google Scholar

Libovický, J., Helcl, J., Tlustý, M., Bojar, O., and Pecina, P. 2016. CUNI system for WMT16 automatic post-editing and multimodal translation tasks. In Proceedings of the Conference on Machine Translation (WMT).CrossRef Google Scholar

Luong, M.-T., Pham, H., and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).CrossRef Google Scholar

Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the International Conference on Learning Representation (ICLR).Google Scholar

Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., and Khudanpur, S. 2010. Recurrent neural network based language model. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech).CrossRef Google Scholar

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).Google Scholar

Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Han, X., Mensch, A., Berg, A., Berg, T., and Daume, H III. 2012. Midge: generating image descriptions from computer vision detections. In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL).Google Scholar

Ordonez, V., Kulkarni, G., and Berg, T. L. 2011. Im2Text: describing images using 1 million captioned photographs. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).Google Scholar

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the Association for Computational Linguistics (ACL).CrossRef Google Scholar

Rashtchian, C., Young, P., Hodosh, M., and Hockenmaier, J. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In Proceedings of the Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.Google Scholar

Razavian, A. S., Azizpour, H., Sullivan, J., and Carlsson, S. 2014. CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRef Google Scholar

Redmon, J., and Farhadi, A. 2017. YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRef Google Scholar

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L., 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3): 211–52.CrossRef Google Scholar

Shah, K., Wang, J., and Specia, L. 2016. SHEF-Multimodal: grounding machine translation on images. In Proceedings of the Conference on Machine Translation (WMT).CrossRef Google Scholar

Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representation (ICLR).Google Scholar

Socher, R., Karpathy, A., Le, Q., Manning, C., and Ng, A., 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 : 207–18.CrossRef Google Scholar

Specia, L., Frank, S., Simaan, K., and Elliott, D. 2016. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the Conference on Machine Translation (WMT).CrossRef Google Scholar

Sutskever, I., Vinyals, O., and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).Google Scholar

van der Maaten, L., and Hinton, G., 2008. Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR) 9 : 2579–605.Google Scholar

van Miltenburg, E., and Elliott, D. 2017. Room for improvement in automatic image description: an error analysis. arXiv preprint arXiv:1704.04198.Google Scholar

Vedantam, R., Zitnick, C. L., and Parikh, D. 2015. Cider: consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRef Google Scholar

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. 2015. Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRef Google Scholar

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D., 2016. Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4): 652–663.CrossRef Google Scholar PubMed

Wu, Q., Shen, C., Liu, L., Dick, A., and van den Hengel, A. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRef Google Scholar

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., and Bengio, Y. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar

Yang, Y., Teo, C., Daumé, H. III, and Aloimonos, Y. 2011. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar

Yao, B. Z., Yang, X., Lin, L., Lee, M. W., and Zhu, S. C. 2010. I2T: image parsing to text description. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRef Google Scholar

Yao, T., Pan, Y., Li, Y., Qiu, Z., and Mei, T. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).CrossRef Google Scholar

Yin, X., and Ordonez, V. 2017. Obj2Text: generating visually descriptive language from object layouts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).CrossRef Google Scholar

You, Q., Jin, H., Wang, Z., Fang, C., and Luo, J. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRef Google Scholar

Young, P., Lai, A., Hodosh, M., and Hockenmaier, J., 2014. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 : 67–78.CrossRef Google Scholar

Zaremba, W., Sutskever, I., and Vinyals, O. 2014. Recurrent neural network regularization. In Proc. of the International Conference on Learning Representation (ICLR), arXiv preprint arXiv:1409.2329.Google Scholar

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Torralba, A. 2017. Places: a ten million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (99), http://ieeexplore.ieee.org/document/7968387/.Google Scholar

Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva, A. 2014. Learning deep features for scene recognition using places database. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).Google Scholar

Article contents

The role of image representations in vision to language tasks

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests