Skip to main content
×
×
Home

Understanding visual scenes

  • CARINA SILBERER (a1), JASPER UIJLINGS (a2) and MIRELLA LAPATA (a3)
Abstract

A growing body of recent work focuses on the challenging problem of scene understanding using a variety of cross-modal methods which fuse techniques from image and text processing. In this paper, we develop representations for the semantics of scenes by explicitly encoding the objects detected in them and their spatial relations. We represent image content via two well-known types of tree representations, namely constituents and dependencies. Our representations are created deterministically, can be applied to any image dataset irrespective of the task at hand, and are amenable to standard NLP tools developed for tree-based structures. We show that we can apply syntax-based SMT and tree kernel methods in order to build models for image description generation and image-based retrieval. Experimental results on real-world images demonstrate the effectiveness of the framework.

Copyright
References
Hide All
Aditya, S., Baral, C., Yang, Y., Aloimonos, Y., and Fermuller, C. 2016. DeepIU: an Architecture for image understanding. In Proceedings of Advances in Cognitive Systems.
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., and Parikh, D. 2015. VQA: visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., and Zitnick, C. L. 2015. Microsoft COCO captions: data collection and evaluation server. ArXiv e-prints, abs/1504.00325v2.
Collins, M. and Duffy, N. 2001. Convolution kernels for natural language. In Proceedings of the 14th International Conference on Advances in Neural Information Processing Systems: Natural and Synthetic, pp. 625–32.
Coyne, B. and Sproat, R. 2001. WordsEye: an automatic text-to-scene conversion system. In SIGGRAPH '01: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques.
Culotta, A. and Sorensen, J. 2004. Dependency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics.
Deng, Y., Kanervisto, A., Ling, J. and Rush, A. M. 2017. Image-to-markup generation with coarse-to-fine attention. In Proceedings of the 34th International Conference on Machine Learning, pp. 980–89.
Devlin, J., Gupta, S., Girshick, R. B., Mitchell, M. and Zitnick, C. L. 2015. Exploring nearest neighbor approaches for image captioning. ArXiv e-prints, abs/1505.04467.
Elliott, D. 2015. Structured Representation of Images for Language Generation and Image Retrieval. PhD Thesis, Edinburgh, Scotland, UK: The University of Edinburgh.
Elliott, D. and de Vries, A. 2015. Describing images using inferred visual dependency representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 42–52.
Elliott, D. and Keller, F. 2013 (October). Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1292–1302.
Elliott, D., Lavrenko, V. and Keller, F. 2014. Query-by-example image retrieval using visual dependency representations. In COLING 2014, 25th International Conference on Computational Linguistics, pp. 109–20.
Girshick, R. 2015. Fast R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), pp. 1440–8.
Gupta, S. and Malik, J. 2015. Visual Semantic Role Labeling. ArXiv e-prints, abs/1505.04474v1.
Heafield, K., Pouzyrevsky, I., Clark, J. H. and Koehn, P. 2013. Scalable modified Kneser–Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 690–6.
Huang, L. 2006. Statistical syntax-directed translation with extended domain of locality. In Proceedings of the Association for Machine Translation in the Americas, pp. 66–73.
Huang, T.-H. (Kenneth), Ferraro, F., Mostafazadeh, N., Misra, I., Agrawal, A., Devlin, J., Girshick, R., He, X., Kohli, P., Batra, D., Zitnick, C. L., Parikh, D., Vanderwende, L., Galley, M., and Mitchell, M. 2016. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: HLT, pp. 1233–9.
Jégou, H., Douze, M. and Schmid, C. 2008. Hamming Embedding and Weak Geometry Consistency for Large Scale Image Search – Extended Version. Research Report 6709. Inria Grenoble, Rhône-Alpes, France.
Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D. A., Bernstein, M. S., and Fei-Fei, L. 2015. Image retrieval using scene graphs. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 3668–78.
Kafle, K. and Kanan, C. 2016. Answer-type prediction for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4976–84.
Karger, D. R., Klin, P. N. and Tarjan, R. E. 1995. A randomized linear-time algorithm to find minimum spanning t rees. Journal of the ACM 42 (2): 321–8.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–80.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., Bernstein, M. S., and Fei-Fei, L. 2017. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), 3273.
Kruskal, J. B. 1956. On the shortest spanning subtree of a graph and the traveling salesman problem. In Proceedings of the American Mathematical Society, 7.
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., and Berg, T. L. 2011. Baby talk: understanding and generating simple image descriptions. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, pp. 1601–8.
Lan, T., Yang, W., Wang, Y. and Mori, G. 2012. Image retrieval with structured object queries using latent ranking SVM. In Proceedings of the 12th European Conference on Computer Vision, pp. 129–42.
Li, S., Kulkarni, G., Berg, T. L., Berg, A. C. and Choi, Y. 2011. Composing simple image descriptions using web-scale N-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning, pp. 220–8.
Lin, D., Fidler, S., Kong, C. and Urtasun, R. 2015. Generating multi-sentence lingual descriptions of indoor scenes. In Proceedings of the British Machine Vision Conference.
Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., Yamaguchi, K., Berg, T., Stratos, K., and Daumé, H. III 2012. Midge: generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747–56.
Moschitti, A. 2006a. Efficient convolution kernels for dependency and constituent syntactic trees. In Proceedings of the 17th European Conference on Machine Learning, pp. 318–29.
Moschitti, A. 2006b. Making tree kernels practical for natural language learning. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 113–20.
Och, F. J. and Ney, H. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 1951.
Ortiz, L. G. M., Wolff, C. and Lapata, M. 2015. Learning to interpret and describe abstract scenes. In Proceedings of the 2015 North American Chapter of the Association for Computational Linguistics: HLT, pp. 1505–15.
Palmer, M., Gildea, D. and Kingsbury, P. 2005. The proposition bank: an annotated corpus of semantic roles. Computational Linguistics 31 (1): 71106.
Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–8.
Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman, A. 2007. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman, A. 2008. Lost in quantization: improving particular object retrieval in large scale image databases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. 2017. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision 123 (1), 7493.
Prim, R. C. 1957. Shortest connection networks And some generalization. Bell System Technical Journal 36 (6), 1389–401.
Roth, M. and Lapata, M. 2016. Neural semantic role labeling with dependency path embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), Berlin, Germany, pp. 1192–1202.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115 (3): 211–52.
Schuster, S., Krishna, R., Chang, A., Fei-Fei, L., and Manning, C. D. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the 4th Workshop on Vision and Language, pp. 70–80.
Simonyan, K. and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. ArXiv e-prints, abs/1409.1556v6.
Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., and Smeulders, A. W. M. 2013. Selective search for object recognition. International Journal of Computer Vision 104 (2): 154–71.
Vedantam, R., Zitnick, C. L. and Parikh, D. 2015. CIDEr: consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–75.
Vinyals, O., Toshev, A., Bengio, S. and Erhan, D. 2015. Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–64.
Vishwanathan, S. V. N. and Smola, A. J. 2002. Fast kernels for string and tree matching. Advances in Neural Information Processing Systems 15: Annual Conference on Neural Information Processing Systems, pp. 569–76.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. 2015. Show, attend and tell: neural image caption generation with visual attention. In Blei, D., and Bach, F. (eds.), Proceedings of the 32nd International Conference on Machine Learning, pp. 2048–57. JMLR Workshop and Conference Proceedings.
Yatskar, M., Ordonez, V. and Farhadi, A. 2016a. Stating the obvious: extracting visual common sense knowledge. In Proceedings of the 2016 Conference of the NAACL: Human Language Technologies, pp. 193–8.
Yatskar, M., Zettlemoyer, L. and Farhadi, A. 2016b. Situation recognition: visual semantic role labeling for image understanding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Young, P., Lai, A., Hodosh, M. and Hockenmaier, J. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2: 6778.
Yu, L., Poirson, P., Yang, S., Berg, A. C., and Berg, T. L. 2016. Modeling context in referring expressions. In ECCV.
Zampogiannis, K., Yang, Y., Fermüller, C., and Aloimonos, Y. 2015. Learning the spatial semantics of manipulation actions through preposition grounding. In Proceedigs of the IEEE International Conference on Robotics and Automation, pp. 1389–96.
Zitnick, C. L., Vedantam, R. and Parikh, D. 2016. Adopting abstract images for semantic scene understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (4): 627–38.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 2
Total number of PDF views: 58 *
Loading metrics...

Abstract views

Total abstract views: 236 *
Loading metrics...

* Views captured on Cambridge Core between 28th March 2018 - 23rd May 2018. This data will be updated every 24 hours.