Hostname: page-component-77f85d65b8-g4pgd Total loading time: 0 Render date: 2026-04-17T18:19:42.142Z Has data issue: false hasContentIssue false

Learning quantification from images: A structured neural architecture

Published online by Cambridge University Press:  02 April 2018

I. SORODOC
Affiliation:
Center for Mind/Brain Sciences (CIMeC), University of Trento, Palazzo Fedrigotti - corso Bettini 31, 38068 Rovereto (TN), Italy e-mails: ionutsorodoc@gmail.com, sandro.pezzelle@unitn.it, aurelie.herbelot@unitn.it, bernardi@disi.unitn.it
S. PEZZELLE
Affiliation:
Center for Mind/Brain Sciences (CIMeC), University of Trento, Palazzo Fedrigotti - corso Bettini 31, 38068 Rovereto (TN), Italy e-mails: ionutsorodoc@gmail.com, sandro.pezzelle@unitn.it, aurelie.herbelot@unitn.it, bernardi@disi.unitn.it
A. HERBELOT
Affiliation:
Center for Mind/Brain Sciences (CIMeC), University of Trento, Palazzo Fedrigotti - corso Bettini 31, 38068 Rovereto (TN), Italy e-mails: ionutsorodoc@gmail.com, sandro.pezzelle@unitn.it, aurelie.herbelot@unitn.it, bernardi@disi.unitn.it
M. DIMICCOLI
Affiliation:
University of Barcelona, Gran via de les Corts Catalanes 585, 08007 Barcelona, Spain e-mail: mdimiccoli@cvc.uab.es Computer Vision Center, Edificio O, Campus UAB, 08193 Bellaterra (Cerdanyola), Barcelona, Spain
R. BERNARDI
Affiliation:
Center for Mind/Brain Sciences (CIMeC), University of Trento, Palazzo Fedrigotti - corso Bettini 31, 38068 Rovereto (TN), Italy e-mails: ionutsorodoc@gmail.com, sandro.pezzelle@unitn.it, aurelie.herbelot@unitn.it, bernardi@disi.unitn.it Department of Information Engineering and Computer Science (DISI), University of Trento, Via Sommarive, 9 I-38123 Povo (TN), Italy

Abstract

Major advances have recently been made in merging language and vision representations. Most tasks considered so far have confined themselves to the processing of objects and lexicalised relations amongst objects (content words). We know, however, that humans (even pre-school children) can abstract over raw multimodal data to perform certain types of higher level reasoning, expressed in natural language by function words. A case in point is given by their ability to learn quantifiers, i.e. expressions like few, some and all. From formal semantics and cognitive linguistics, we know that quantifiers are relations over sets which, as a simplification, we can see as proportions. For instance, in most fish are red, most encodes the proportion of fish which are red fish. In this paper, we study how well current neural network strategies model such relations. We propose a task where, given an image and a query expressed by an object–property pair, the system must return a quantifier expressing which proportions of the queried object have the queried property. Our contributions are twofold. First, we show that the best performance on this task involves coupling state-of-the-art attention mechanisms with a network architecture mirroring the logical structure assigned to quantifiers by classic linguistic formalisation. Second, we introduce a new balanced dataset of image scenarios associated with quantification queries, which we hope will foster further research in this area.

Information

Type
Articles
Copyright
Copyright © Cambridge University Press 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable