Skip to main content Accessibility help

Where to put the image in an image caption generator



When a recurrent neural network (RNN) language model is used for caption generation, the image information can be fed to the neural network either by directly incorporating it in the RNN – conditioning the language model by ‘injecting’ image features – or in a layer following the RNN – conditioning the language model by ‘merging’ image features. While both options are attested in the literature, there is as yet no systematic comparison between the two. In this paper, we empirically show that it is not especially detrimental to performance whether one architecture is used or another. The merge architecture does have practical advantages, as conditioning by merging allows the RNN’s hidden state vector to shrink in size by up to four times. Our results suggest that the visual and linguistic modalities for caption generation need not be jointly encoded by the RNN as that yields large, memory-intensive models with few tangible advantages in performance; rather, the multimodal integration should be delayed to a subsequent stage.



Hide All
Banerjee, S., and Lavie, A. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, vol.u 29, pp. 65–72.
Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., and Plank, B., 2016. Automatic description generation from images: a survey of models, datasets, and evaluation measures. JAIR 55: 409–42.
Chen, X. and Zitnick, C. L. (2014). Learning a recurrent visual representation for image caption generation. CoRR, 1411.5654.
Chen, X., and Zitnick, C. L. 2015. Mind’s eye: a recurrent visual representation for image caption generation. In Proceedings of the CVPR’15.
Chung, J., Gülçehre, Ç., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, 1412.3555.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. 2009. ImageNet: a large-scale hierarchical image database. In Proceedings of the CVPR’09.
Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., and Mitchell, M. (2015). Language models for image captioning: the quirks and what works. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Volume 2: Short Papers, Beijing, China, pp. 100105.
Diederik, P. K., and Ba, J. 2014. Adam: a method for stochastic optimization. CoRR, 1412.6980.
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the CVPR’15.
Glorot, X., and Bengio, Y., 2010. Understanding the difficulty of training deep feedforward neural networks. Aistats 9: 249–56.
Harnad, S., 1990. The symbol grounding problem. Physica D 42: 335–46.
Hendricks, L. A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., and Darrell, T. 2016. Deep compositional captioning: describing novel object categories without paired training data. In Proceedings of the CVPR’16.
Hessel, J., Savva, N., and Wilber, M. J. 2015. Image representations and new domains in neural image captioning. CoRR, 1508.02091.
Hochreiter, S., and Schmidhuber, J., 1997. Long short-term memory. Neural Computation 9 (8): 1735–80.
Hodosh, M., Young, P., and Hockenmaier, J., 2013. Framing image description as a ranking task: data, models and evaluation metrics. JAIR 47 (1): 853–99.
Karpathy, A., and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the CVPR’15.
Kiros, R., Salakhutdinov, R., and Zemel, R. S. 2014a. Multimodal neural language models. In Proceedings of the ICML’14, pp. 595–603.
Kiros, R., Salakhutdinov, R., and Zemel, R. S. 2014b. Unifying visual-semantic embeddings with multimodal neural language models. CoRR, 1411.2539.
Lin, C.-Y. and Och, F. J. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the ACL’04.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. 2014. Microsoft COCO: common objects in context. In Proceedings of the ECCV’14, pp. 740–55.
Liu, S., Zhu, Z., Ye, N., Guadarrama, S., and Murphy, K. 2016. Optimization of image description metrics using policy gradient methods. CoRR, 1612.00370.
Lu, J., Xiong, C., Parikh, D., and Socher, R. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 3242–3250.
Ma, S., and Han, Y. 2016. Describing images by feeding LSTM with structural words. In Proceedings of the ICME’16.
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. 2015a. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the ICLR’15.
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. 2015b. Learning like a child: fast novel visual concept learning from sentence descriptions of images. In Proceedings of the ICCV’15.
Mao, J., Xu, W., Yang, Y., Wang, J., and Yuille, A. L. 2014. Explain images with multimodal recurrent neural networks. In Proceedings of the NIPS’14.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. 2013. Efficient estimation of word representations in vector space. CoRR, 1301.3781.
Mnih, A., and Hinton, G. 2007. Three new graphical models for statistical language modelling. In Proceedings of the ICML’07.
Nina, O., and Rodriguez, A. 2015. Simplified LSTM unit and search space probability exploration for image description. In Proceedings of the ICICS’15.
Oruganti, R. M., Sah, S., Pillai, S., and Ptucha, R. 2016. Image description through fusion based recurrent multi-modal learning. In Proceedings of the ICIP’16.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the ACL’02, pp. 311–8.
Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V., 2017. Self-critical sequence training for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, July 21-26, 2017, Honolulu, HI, USA, pp. 11791195.
Roy, D., 2005. Semiotic schemas: A framework for grounding language in action and perception. Artificial Intelligence 167 (1–2): 170205.
Simonyan, K. and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. CoRR, 1409.1556.
Song, M., and Yoo, C. D. 2016. Multimodal representation: Kneser-Ney smoothing/skip-gram based neural language model. In Proceedings of the ICIP’16.
Sutskever, I., Vinyals, O., and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 31043112. Curran Associates, Inc.
Vedantam, R., Zitnick, C. L., and Parikh, D. 2015. CIDEr: consensus-based image description evaluation. In Proceedings of the CVPR’15.
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. 2015. Show and tell: A neural image caption generator. In Proceedings of the CVPR’15.
Wang, M., Song, L., Yang, X., and Luo, C. 2016. A parallel-fusion RNN-LSTM architecture for image caption generation. In Proceedings of the ICIP’16.
Wu, Q., Shen, C., van den Hengel, A., Liu, L., and Dick, A. R. 2015. Image captioning with an intermediate attributes layer. CoRR, 1506.01144.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., and Bengio, Y. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the ICML’15.
Yao, T., Pan, Y., Li, Y., Qiu, Z., and Mei, T., 2017. Boosting image captioning with attributes. In IEEE International Conference on Computer Vision, ICCV 2017, October 22-29, 2017, Venice, Italy, pp. 49044912.
You, Q., Jin, H., Wang, Z., Fang, C., and Luo, J. 2016. Image captioning with semantic attention. In Proceedings of the CVPR’16.
Young, P., Lai, A., Hodosh, M., and Hockenmaier, J., 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2: 6778.
Zhou, L., Xu, C., Koch, P., and Corso, J. J. 2016. Image caption generation with text-conditional semantic attention. CoRR, 1606.04621.

Where to put the image in an image caption generator



Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed