Skip to main content
×
×
Home

The role of image representations in vision to language tasks

  • PRANAVA MADHYASTHA (a1), JOSIAH WANG (a1) and LUCIA SPECIA (a1)
Abstract

Tasks that require modeling of both language and visual information, such as image captioning, have become very popular in recent years. Most state-of-the-art approaches make use of image representations obtained from a deep neural network, which are used to generate language information in a variety of ways with end-to-end neural-network-based models. However, it is not clear how different image representations contribute to language generation tasks. In this paper, we probe the representational contribution of the image features in an end-to-end neural modeling framework and study the properties of different types of image representations. We focus on two popular vision to language problems: The task of image captioning and the task of multimodal machine translation. Our analysis provides interesting insights into the representational properties and suggests that end-to-end approaches implicitly learn a visual-semantic subspace and exploit the subspace to generate captions.

Copyright
References
Hide All
Anderson, P., Fernando, B., Johnson, M., and Gould, S. 2016. SPICE: semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision (ECCV).
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. 2015. VQA: visual question answering. In Proceedings of the 2015 IEEE Conference on Computer Vision & Pattern Recognition (CVPR).
Arora, S., Liang, Y., and Ma, T. 2017. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the International Conference on Learning Representations, Workshop Contributions.
Bahdanau, D., Cho, K., and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representation (ICLR).
Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., and Plank, B., 2016. Automatic description generation from images: a survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research 55 : 409–42.
Caglayan, O., Aransa, W., Wang, Y., Masana, M., García-Martínez, M., Bougares, F., Barrault, L., and van de Weijer, J. 2016. Does multimodality help human and machine for translation and image captioning? In Proceedings of the Conference on Machine Translation (WMT).
Calixto, I., Elliott, D., and Frank, S. 2016. DCU-UvA multimodal MT system report. In Proceedings of the Conference on Machine Translation (WMT).
Calixto, I., Liu, Q., and Campbell, N. 2017. Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL).
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., and Zitnick, C. L. 2015. Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325.
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Proceedings of the NIPS 2014 Workshop on Deep Learning and Representation Learning.
Clevert, D.-A., Unterthiner, T., and Hochreiter, S. 2015. Fast and accurate deep network learning by exponential linear units (ELUs). In Proc. of the International Conference on Learning Representation (ICLR).
Denkowski, M., and Lavie, A. 2014. Meteor universal: language specific translation evaluation for any target language. In Proceedings of the EACL Workshop on Statistical Machine Translation.
Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., and Mitchell, M. 2015. Language models for image captioning: the quirks and what works. In Proceedings of the Association for Computational Linguistics (ACL).
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. 2014. Decaf: a deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning (ICML).
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).
Elliott, D., and de Vries, A. 2015. Describing images using inferred visual dependency representations. In Proceedings of the Association for Computational Linguistics (ACL), arxiv preprint arxiv:1510.04709.
Elliott, D., and Kádár, A. 2017. Imagination improves multimodal translation. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP).
Elliott, D., and Keller, F. 2014. Comparing automatic evaluation measures for image description. In Proceedings of the Association for Computational Linguistics (ACL).
Elliott, D., Frank, S., and Hasler, E. 2015. Multi-language image description with neural sequence models. arxiv preprint arxiv:1510.04709.
Elliott, D., Frank, S., Barrault, L., Bougares, F., and Specia, L. 2017. Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the Conference on Machine Translation (WMT).
Elliott, D., Frank, S., Sima’an, K., and Specia, L. 2016. Multi30K: multilingual English-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language.
Elman, J. L., 1990. Finding structure in time. Cognitive Science 14 : 179211.
Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., and Platt, J. C. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).
Farhadi, A., Hejrati, M., Sadeghi, M., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth, D. 2010. Every picture tells a story: generating sentences from images. In Proceedings of the European Conference on Computer Vision (ECCV).
Ferraro, F., Mostafazadeh, N., Vanderwende, L., Devlin, J., Galley, M., and Mitchell, M. 2015. A survey of current datasets for vision and language research. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., and Xu, W. 2015. Are you talking to a machine? Dataset and methods for multilingual image question. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).
Grubinger, M., Clough, P., Müller, H., and Deselaers, T. 2006. The IAPR TC-12 benchmark: a new evaluation resource for visual information systems. In Proceedings of the International Workshop on Language Resources for Content-Based Image Retrieval, OntoImage’2006.
He, K., Zhang, X., Ren, S., and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).
Hitschler, J., Schamoni, S., and Riezler, S. 2016. Multimodal pivots for image caption translation. In Proceedings of the Association for Computational Linguistics (ACL).
Hochreiter, S., and Schmidhuber, J., 1997. Long short-term memory. Neural Computation 9 (8): 1735–80.
Hodosh, M., Young, P., and Hockenmaier, J., 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 : 853–99.
Huang, P.-Y., Liu, F., Shiang, S.-R., Oh, J., and Dyer, C. 2016. Attention-based multimodal neural machine translation. In Proceedings of the Conference on Machine Translation (WMT).
Karpathy, A. 2016. Connecting Images and Natural Language. PhD Thesis, Department of Computer Science, Stanford University.
Karpathy, A., and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).
Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., and Erdem, E. 2017. Re-evaluating automatic metrics for image captioning. In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL).
Kiros, R., Salakhutdinov, R., and Zemel, R. S. 2014. Multimodal neural language models. In Proceedings of the International Conference on Machine Learning (ICML).
Kolář, M., Hradiš, M., and Zemčík, P. 2015. Technical report: Image captioning with semantically similar images. arXiv preprint arXiv:1506.03995.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., and Berg, T. L. 2011. Baby talk: understanding and generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).
Kuznetsova, P., Ordonez, V., Berg, A., Berg, T., and Choi, Y. 2012. Collective generation of natural image descriptions. In Proceedings of the Association for Computational Linguistics (ACL).
Kuznetsova, P., Ordonez, V., Berg, A., Berg, T., and Choi, Y. 2013. Generalizing image captions for image-text parallel corpus. In Proceedings of the Association for Computational Linguistics (ACL).
Kuznetsova, P., Ordonez, V., Berg, T. L., and Choi, Y. 2014. TREETALK: composition and compression of trees for image descriptions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
Lala, C., Madhyastha, P., Wang, J., and Specia, L., 2017. Unraveling the contribution of image captioning and neural machine translation for multimodal machine translation. The Prague Bulletin of Mathematical Linguistics 108 : 197208.
Lebret, R., Pinheiro, P. O., and Collobert, R. 2015. Phrase-based image captioning. In Proceedings of the International Conference on Machine Learning (ICML).
Li, S., Kulkarni, G., Berg, T. L., Berg, A. C., and Choi, Y. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the SIGNLL Conference on Computational Natural Language Learning (CoNLL).
Libovický, J., Helcl, J., Tlustý, M., Bojar, O., and Pecina, P. 2016. CUNI system for WMT16 automatic post-editing and multimodal translation tasks. In Proceedings of the Conference on Machine Translation (WMT).
Luong, M.-T., Pham, H., and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the International Conference on Learning Representation (ICLR).
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., and Khudanpur, S. 2010. Recurrent neural network based language model. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).
Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Han, X., Mensch, A., Berg, A., Berg, T., and Daume, H III. 2012. Midge: generating image descriptions from computer vision detections. In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL).
Ordonez, V., Kulkarni, G., and Berg, T. L. 2011. Im2Text: describing images using 1 million captioned photographs. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the Association for Computational Linguistics (ACL).
Rashtchian, C., Young, P., Hodosh, M., and Hockenmaier, J. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In Proceedings of the Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.
Razavian, A. S., Azizpour, H., Sullivan, J., and Carlsson, S. 2014. CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).
Redmon, J., and Farhadi, A. 2017. YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L., 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3): 211–52.
Shah, K., Wang, J., and Specia, L. 2016. SHEF-Multimodal: grounding machine translation on images. In Proceedings of the Conference on Machine Translation (WMT).
Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representation (ICLR).
Socher, R., Karpathy, A., Le, Q., Manning, C., and Ng, A., 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 : 207–18.
Specia, L., Frank, S., Simaan, K., and Elliott, D. 2016. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the Conference on Machine Translation (WMT).
Sutskever, I., Vinyals, O., and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).
van der Maaten, L., and Hinton, G., 2008. Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR) 9 : 2579–605.
van Miltenburg, E., and Elliott, D. 2017. Room for improvement in automatic image description: an error analysis. arXiv preprint arXiv:1704.04198.
Vedantam, R., Zitnick, C. L., and Parikh, D. 2015. Cider: consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. 2015. Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D., 2016. Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4): 652663.
Wu, Q., Shen, C., Liu, L., Dick, A., and van den Hengel, A. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., and Bengio, Y. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning (ICML).
Yang, Y., Teo, C., Daumé, H. III, and Aloimonos, Y. 2011. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
Yao, B. Z., Yang, X., Lin, L., Lee, M. W., and Zhu, S. C. 2010. I2T: image parsing to text description. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).
Yao, T., Pan, Y., Li, Y., Qiu, Z., and Mei, T. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Yin, X., and Ordonez, V. 2017. Obj2Text: generating visually descriptive language from object layouts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
You, Q., Jin, H., Wang, Z., Fang, C., and Luo, J. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).
Young, P., Lai, A., Hodosh, M., and Hockenmaier, J., 2014. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 : 6778.
Zaremba, W., Sutskever, I., and Vinyals, O. 2014. Recurrent neural network regularization. In Proc. of the International Conference on Learning Representation (ICLR), arXiv preprint arXiv:1409.2329.
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Torralba, A. 2017. Places: a ten million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (99), http://ieeexplore.ieee.org/document/7968387/.
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva, A. 2014. Learning deep features for scene recognition using places database. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed