Skip to main content Accessibility help
×
Home

Learning quantification from images: A structured neural architecture

  • I. SORODOC (a1), S. PEZZELLE (a1), A. HERBELOT (a1), M. DIMICCOLI (a2) (a3) and R. BERNARDI (a1) (a4)...

Abstract

Major advances have recently been made in merging language and vision representations. Most tasks considered so far have confined themselves to the processing of objects and lexicalised relations amongst objects (content words). We know, however, that humans (even pre-school children) can abstract over raw multimodal data to perform certain types of higher level reasoning, expressed in natural language by function words. A case in point is given by their ability to learn quantifiers, i.e. expressions like few, some and all. From formal semantics and cognitive linguistics, we know that quantifiers are relations over sets which, as a simplification, we can see as proportions. For instance, in most fish are red, most encodes the proportion of fish which are red fish. In this paper, we study how well current neural network strategies model such relations. We propose a task where, given an image and a query expressed by an object–property pair, the system must return a quantifier expressing which proportions of the queried object have the queried property. Our contributions are twofold. First, we show that the best performance on this task involves coupling state-of-the-art attention mechanisms with a network architecture mirroring the logical structure assigned to quantifiers by classic linguistic formalisation. Second, we introduce a new balanced dataset of image scenarios associated with quantification queries, which we hope will foster further research in this area.

Copyright

References

Hide All
Anderson, A. J., Bruni, E., Bordignon, U., Poesio, M., and Baroni, M. 2013. Of words, eyes and brains: correlating image-based distributional semantic models with neural representations of concepts. In EMNLP, pp. 1960–70.
Andreas, J., Rohrbach, M., Darrell, T., and Klein, D., 2016a. Learning to compose neural networks for question answering. In Proceedings of NAACL-HLT, San Diego, California: Association for Computational Linguistics, p. 15451554.
Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. 2016b. Neural module networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition.
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. 2015. VQA: Visual question answering. In International Conference on Computer Vision (ICCV).
Baroni, M., Bernardi, R., Do, N.-Q., and Shan, C.-c. 2012. Entailment above the word level in distributional semantics. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 23–32.
Baroni, M., Bernardini, S., Ferraresi, A., and Zanchetta, E. 2009. The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3):209–26.
Baroni, M., Dinu, G., and Kruszewski, G. 2014. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL (1), pp. 238–47.
Barwise, J., and Cooper, R., 1981. Generalized quantifiers and natural language. Linguistics and Philosophy 4 (2): 159219.
Bass, B. M., Cascio, W. F., and O’connor, E. J., 1974. Magnitude estimations of expressions of frequency and amount. Journal of Applied Psychology 59 (3): 313.
Boleda, G., and Herbelot, A., 2016. Formal distributional semantics: Introduction to the special issue. Computational Linguistics 42 (4): 619–35.
Borji, A., Cheng, M., Jiang, H., and Li, J., 2015. Salient object detection: A benchmark. IEEE Transactions on Image Processing 24 (12): 57065722.
Chattopadhyay, P., Vedantam, R., Selvaraju, R. R., Batra, D., and Parikh, D. 2017. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, Hawaii, pp. 1135–1144.
Coventry, K. R., Cangelosi, A., Newstead, S., Bacon, A., and Rajapakse, R. 2005. Grounding natural language quantifiers in visual attention. In Proceedings of the 27th Annual Conference of the Cognitive Science Society, Mahwah, NJ: Lawrence Erlbaum Associates.
Coventry, K. R., Cangelosi, A., Newstead, S. E., and Bugmann, D., 2010. Talking about quantities in space: Vague quantifiers, context and similarity. Language and Cognition 2 (2): 221–41.
Dehaene, S., and Changeux, J., 1993. Development of elementary numerical abilities: A neuronal model. Journal of Cognitive Neuroscience 5 (4): 390407.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 248–55.
Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., and Rohrbach, M. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
Gao, H., Mao, J., Zhou, J., Huang, Z., and Yuille, A. 2015. Are you talking to a machine? dataset and methods for multilingual image question answering. In International Conference on Learning Representations.
Geman, D., GErman, S., Hallonquist, N., and Younes, L., 2015. Visual turing test for computer vision systems. PNAS 112 (12): 3618–23.
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D., 2016. Making the V in VQA matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, Hawaii, pp. 69046913.
Halberda, J., Taing, L., and Lidz, J., 2008. The development of “most” comprehension and its potential dependence on counting ability in preschoolers. Language Learning and Development 4 (2): 99121.
Hammerton, M. 1976. How much is a large part? Applied ergonomics 7 (1): 1012.
Herbelot, A., and Vecchi, E. M. 2015. Building a shared world: mapping distributional to model-theoretic semantic spaces. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal.
Hodosh, M., Young, P., and Hockenmaier, J., 2013. Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 : 853–99.
Holyoak, K. J., and Glass, A. L., 1978. Recognition confusions among quantifiers. Journal of verbal learning and verbal behavior 17 (3): 249–64.
Hurewitz, F., Papafragou, A., Gleitman, L., and Gelman, R., 2006. Asymmetries in the acquisition of numbers and quantifiers. Language learning and development 2 (2): 7796.
Johnson, J., Hariharan, B., van~der~Maaten, L., Fei-Fei, L., Zitnick, C. L., and Girshick, R. 2017. Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of CVPR.
Keenan, E., and Paperno, D., editors 2012. Handbook of Quantifiers in Natural Language. Springer Netherlands, Dordrecht.
Khemlani, S., Leslie, S.-J., and Glucksberg, S., 2009. Generics, prevalence, and default inferences. In Proceedings of the 31st annual conference of the Cognitive Science Society, Austin, TX: Cognitive Science Society, pp. 443–8.
Kumar, A., Irsoy, O., Su, J., Bradbury, J., E, R.., Pierce, B., Ondruska, P., Gulrajani, I., and Socher, R. 2016. Ask me anything: dynamic memory networks for natural language processing. In Proceedings of the International Conference on Machine Learning (ICML).
Lazaridou, A., Pham, N. T., and Baroni, M. 2015. Combining language and vision with a multimodal skip-gram model. In Proceedings of NAACL.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C. L. 2014a. Microsoft COCO: common objects in context. In Proceedings of ECCV (European Conference on Computer Vision).
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C. L. 2014b. Microsoft coco: common objects in context. In Microsoft COCO: Common Objects in Context.
Ma, L., Lu, Z., and Li, H. 2016. Learning to answer questions from image using convolutional neural network. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI).
Malinowski, M., and Fritz, M., 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS14), Montreal, Canada, pp. 16821690.
Malinowski, M., Rohrbach, M., and Fritz, M. 2015. Ask your neurons: a neural-based approach to answering questions about images. In In International Conference on Computer Vision (ICCV’15).
McCrink, K., and Wynn, K., 2004. Large-number addition and subtraction by 9-month-old infants. Psychological Science 15 (11): 776–81.
Mikolov, T., Chen, K., Corrado, G., and Dean, J., 2013. Efficient estimation of word representations in vector space. In Proceedings of the 26th International Conference on Neural Information Processing Systems 26 (NIPS 2013), Lake Tahoe, Nevada, pp. 31113119.
Moxey, L. M., and Sanford, A. J. 1993. Communicating quantities: a psychological perspective. Lawrence Erlbaum Associates, Inc, Mahwah, NJ.
Nouwen, R. 2010. What’s in a quantifier? The Linguistics Enterprise: From knowledge of language to knowledge in linguistics 150: 235.
Patterson, G., and Hays, J. 2016. Coco attributes: attributes for people, animals, and objects. In European Conference on Computer Vision.
Pezzelle, S., Marelli, M., and Bernardi, R. 2017. Be precise or fuzzy: learning the meaning of cardinals and quantifiers from vision. In Proceedings of EACL.
Piantadosi, S. T. 2011. Learning and the language of thought. PhD thesis, Massachusetts Institute of Technology.
Piantadosi, S. T., Tenenbaum, J. B., and Goodman, N. D. 2012. Modeling the acquisition of quantifier semantics: a case study in function word learnability. https://colala.bcs.rochester.edu/papers/piantadosi2012modeling.pdf.
Rajapakse, R., Cangelosi, A., Conventry, K., Newstead, S., and Bacon, A. 2005. Grounding linguistic quantifiers in perception: Experiments on numerosity judgments. In Proceeding of the 2nd Language and Technology Conference, Poland.
Ren, M., Kiros, R., and Zemel, R. 2015a. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems (NIPS).
Ren, M., Kiros, R., and Zemel, R. 2015b. Image question answering: A visual semantic embedding model and a new dataset. In International Conference on Machine Learning Deep Learning Workshop.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L., 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3): 211–52.
Seguí, S., Pujol, O., and Vitria, J. 2015. Learning to count with deep object features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 90–6.
Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.
Sorodoc, I., Lazaridou, A., G. B. A. H., Pezzelle, S., and Bernardi, R., 2016. “Look, some green circles!”: Learning to quantify from image. In Proceedings of the 5th Workshop on Vision and Language, Berlin, Germany: Association for Computational Linguistics, p. 7579.
Stoianov, I., and Zorzi, M., 2012. Emergence of a’visual number sense’in hierarchical generative models. Nature Neuroscience 15 (2): 194–6.
Sukhbaatar, S., Szlam, A., Weston, J., and Fergus, R. 2015. End-to-end memory networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS), vol. 28.
Szabolsci, A., 2010. Quantification. Cambridge, UK: Cambridge University Press.
Trott, A., Xiong, C., and Socher, R. 2017. Interpretable counting for visual question answering. https://arxiv.org/abs/1712.08697.
van Benthem, J., 1986. Essays in logical semantics. Dordrecht, The Netherlands: Reidel Publishing Co.
Vedaldi, A., and Lenc, K. 2015. MatConvNet – Convolutional Neural Networks for MATLAB. In Proceeding of the ACM International Conference on Multimedia.
Weston, J., Chopra, S., and Bordes, A. 2015. Memory networks. In International Conference on Learning Representations (ICLR).
Xiong, C., Merity, S., and Socher, R. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of International Conference on Machine Learning (ICML).
Xu, F., and Spelke, E. S. 2000. Large number discrimination in 6-month-old infants. Cognition 74 (1):B1B11.
Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning (ICML).
Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. 2016. Stacked attention networks for imagequestion answering. In Proceedings of CVPR.
Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. J. 2015. Stacked attention networks for image question answering. CoRR, abs/1511.02274.
Zhang, J., Ma, S., Sameki, M., Sclaroff, S., Betke, M., Lin, Z., Shen, X., Price, B., and ech, R. M. 2015. Salient object subitizing. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., and Parikh, D. 2016. Yin and yang: balancing and answering binary visual questions. In Proceedings of CVPR.
Zhou, B., Tian, Y., Suhkbaatar, S., Szlam, A., and Fergus, R. 2015. Simple baseline for visual question answering. Technical report, arXiv:1512.02167, 2015.

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed