Hostname: page-component-77f85d65b8-g98kq Total loading time: 0 Render date: 2026-03-27T14:06:32.017Z Has data issue: false hasContentIssue false

Deep problems with neural network models of human vision

Published online by Cambridge University Press:  01 December 2022

Jeffrey S. Bowers
Affiliation:
School of Psychological Science, University of Bristol, Bristol, UK j.bowers@bristol.ac.uk; https://jeffbowers.blogs.bristol.ac.uk/ gaurav.malhotra@bristol.ac.uk marin.dujmovic@bristol.ac.uk m.lleramontero@bristol.ac.uk christian.tsvetkov@bristol.ac.uk valerio.biscione@gmail.com guillermo.puebla@bristol.ac.uk
Gaurav Malhotra
Affiliation:
School of Psychological Science, University of Bristol, Bristol, UK j.bowers@bristol.ac.uk; https://jeffbowers.blogs.bristol.ac.uk/ gaurav.malhotra@bristol.ac.uk marin.dujmovic@bristol.ac.uk m.lleramontero@bristol.ac.uk christian.tsvetkov@bristol.ac.uk valerio.biscione@gmail.com guillermo.puebla@bristol.ac.uk
Marin Dujmović
Affiliation:
School of Psychological Science, University of Bristol, Bristol, UK j.bowers@bristol.ac.uk; https://jeffbowers.blogs.bristol.ac.uk/ gaurav.malhotra@bristol.ac.uk marin.dujmovic@bristol.ac.uk m.lleramontero@bristol.ac.uk christian.tsvetkov@bristol.ac.uk valerio.biscione@gmail.com guillermo.puebla@bristol.ac.uk
Milton Llera Montero
Affiliation:
School of Psychological Science, University of Bristol, Bristol, UK j.bowers@bristol.ac.uk; https://jeffbowers.blogs.bristol.ac.uk/ gaurav.malhotra@bristol.ac.uk marin.dujmovic@bristol.ac.uk m.lleramontero@bristol.ac.uk christian.tsvetkov@bristol.ac.uk valerio.biscione@gmail.com guillermo.puebla@bristol.ac.uk
Christian Tsvetkov
Affiliation:
School of Psychological Science, University of Bristol, Bristol, UK j.bowers@bristol.ac.uk; https://jeffbowers.blogs.bristol.ac.uk/ gaurav.malhotra@bristol.ac.uk marin.dujmovic@bristol.ac.uk m.lleramontero@bristol.ac.uk christian.tsvetkov@bristol.ac.uk valerio.biscione@gmail.com guillermo.puebla@bristol.ac.uk
Valerio Biscione
Affiliation:
School of Psychological Science, University of Bristol, Bristol, UK j.bowers@bristol.ac.uk; https://jeffbowers.blogs.bristol.ac.uk/ gaurav.malhotra@bristol.ac.uk marin.dujmovic@bristol.ac.uk m.lleramontero@bristol.ac.uk christian.tsvetkov@bristol.ac.uk valerio.biscione@gmail.com guillermo.puebla@bristol.ac.uk
Guillermo Puebla
Affiliation:
School of Psychological Science, University of Bristol, Bristol, UK j.bowers@bristol.ac.uk; https://jeffbowers.blogs.bristol.ac.uk/ gaurav.malhotra@bristol.ac.uk marin.dujmovic@bristol.ac.uk m.lleramontero@bristol.ac.uk christian.tsvetkov@bristol.ac.uk valerio.biscione@gmail.com guillermo.puebla@bristol.ac.uk
Federico Adolfi
Affiliation:
School of Psychological Science, University of Bristol, Bristol, UK j.bowers@bristol.ac.uk; https://jeffbowers.blogs.bristol.ac.uk/ gaurav.malhotra@bristol.ac.uk marin.dujmovic@bristol.ac.uk m.lleramontero@bristol.ac.uk christian.tsvetkov@bristol.ac.uk valerio.biscione@gmail.com guillermo.puebla@bristol.ac.uk Ernst Strüngmann Institute (ESI) for Neuroscience in Cooperation with Max Planck Society, Frankfurt am Main, Germany fedeadolfi@gmail.com
John E. Hummel
Affiliation:
Department of Psychology, University of Illinois Urbana–Champaign, Champaign, IL, USA jehummel@illinois.edu rmflood2@illinois.edu
Rachel F. Heaton
Affiliation:
Department of Psychology, University of Illinois Urbana–Champaign, Champaign, IL, USA jehummel@illinois.edu rmflood2@illinois.edu
Benjamin D. Evans
Affiliation:
Department of Informatics, School of Engineering and Informatics, University of Sussex, Brighton, UK b.d.evans@sussex.ac.uk j.mitchell@napier.ac.uk
Jeffrey Mitchell
Affiliation:
Department of Informatics, School of Engineering and Informatics, University of Sussex, Brighton, UK b.d.evans@sussex.ac.uk j.mitchell@napier.ac.uk
Ryan Blything
Affiliation:
School of Psychology, Aston University, Birmingham, UK r.blything@aston.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

Deep neural networks (DNNs) have had extraordinary successes in classifying photographic images of objects and are often described as the best models of biological vision. This conclusion is largely based on three sets of findings: (1) DNNs are more accurate than any other model in classifying images taken from various datasets, (2) DNNs do the best job in predicting the pattern of human errors in classifying objects taken from various behavioral datasets, and (3) DNNs do the best job in predicting brain signals in response to images taken from various brain datasets (e.g., single cell responses or fMRI data). However, these behavioral and brain datasets do not test hypotheses regarding what features are contributing to good predictions and we show that the predictions may be mediated by DNNs that share little overlap with biological vision. More problematically, we show that DNNs account for almost no results from psychological research. This contradicts the common claim that DNNs are good, let alone the best, models of human object recognition. We argue that theorists interested in developing biologically plausible models of human vision need to direct their attention to explaining psychological findings. More generally, theorists need to build models that explain the results of experiments that manipulate independent variables designed to test hypotheses rather than compete on making the best predictions. We conclude by briefly summarizing various promising modeling approaches that focus on psychological data.

Information

Type
Target Article
Copyright
Copyright © The Author(s), 2022. Published by Cambridge University Press
Figure 0

Figure 1. Example images from Kiani et al. (2007) that include images from six categories.

Figure 1

Figure 2. Example images of cars, fruits, and animals at various poses with random backgrounds from Majaj et al. (2015).

Figure 2

Figure 3. RSA calculation. A series of stimuli from a set of categories (or conditions) are used as inputs to two different systems (e.g., a brain and a DNN). The corresponding neural or unit activity for each stimulus is recorded and pairwise distances in the activations within each system are calculated to get the representational geometry of each system. This representational geometry is expressed as a representational dissimilarity matrix (RDM) for each system. Finally, an RSA score is determined by computing the correlation between the two RDMs (image taken from Dujmović et al., 2022).

Figure 3

Figure 4. RSA (left) and Brain-Score (right) for networks trained on predictive pixels. The location of the pixel patches varied across conditions, such that the location was positively, negatively, or uncorrelated with the representational distances between classes in the macaque IT. When the pixel distances are positively correlated in the training set, RSA scores approached scores achieved by networks pretrained on ImageNet and fine-tuned on unperturbed images. When the training images did not contain the pixel confounds, the location of the pixels at test did not impact RSA scores. The dataset dependence of RSA scores extends to neural predictivity as measured by Brain-Score as the same pixel networks explain significantly more macaque IT activity when the confounding feature is present in the stimuli (RSA scores taken from Dujmović et al., 2022, Brain-Score results are part of ongoing, unpublished research).

Figure 4

Figure 5. Different models fall in different parts of the theory landscape. Critically, it is possible to do well on prediction-based experiments despite poor correspondences to human vision, and there is no reason to expect that modifying a model to perform better on these experiments will necessarily result in better models of human vision. Similarly, poor performance does not preclude the model from sharing important similarities with human vision. Noise ceiling refers to how well humans predict one another on prediction-based experiments, and it is the best one can expect a model to perform.

Figure 5

Figure 6. Example of adversarial images for three different stimuli generated in different ways. In all cases the model is over 99% confident in its classification. Images taken from Nguyen, Yosinski, and Clune (2015).

Figure 6

Figure 7. Illustration of a style-transfer image in which (a) the texture of an elephant and (b) the shape of a cat that combine to form (c) the shape of a cat with the texture of an elephant. The top three classifications of a DNN to the three images are listed below each image, with the model classifying the style-transfer image as an elephant with 63.9% confidence (the cat is not in the top three choices of the DNN that together account for 99.9% of its confidence). Images taken from Geirhos et al. (2019).

Figure 7

Figure 8. Examples of novel stimuli defined by shape as well as one other nonshape feature. In (a) global shape and location of one of the patches define a category, and for illustration, the predictive patch is circled. Stimuli in the same category (top row) have a patch with the same color and the same location, while none of the stimuli in any other category (bottom row) have a patch at this location. In (b) global shape and color of one of the segments predicts stimulus category. Only stimuli in the same category (top row) but not in any other category (bottom row) have a segment of this color (red). The right-most stimulus in the top row shows an example of an image containing a nonshape feature (red segment) but no shape feature. Images taken from Malhotra et al. (2022).

Figure 8

Figure 9. Illustration of (a) a silhouette image of a camel, (b) and image of a camel in which local shape features were removed by including jittered contours, and (c) and image of a camel in which global shape was disrupted. The DNNs had more difficulty under conditions (b) than (c). Images taken from Baker et al. (2018b).

Figure 9

Figure 10. Example of (a) a basis object, (b) a relational variant object that was identical to the basis object except that one line was moved so that its “above/below” relation to the line to which it was connected changed (from above to below or vice-versa), as highlighted by the circle, and (c) a pixel variant object that was identical to the basis object except that two lines were moved in a way that preserved the categorical spatial relations between all the lines composing the object, but changed the coordinates of two lines, as highlighted by the oval. Images taken from Malhotra et al. (2021).

Figure 10

Figure 11. Phenomenon of filling-in suggests that edges and textures are initially processed separately and then combined to produce percepts. In this classic example from Krauskopf (1963), an inner green disk (depicted in white) is surrounded by a red annulus (depicted in dark gray). Under normal viewing conditions the stimulus at the top left leads to the percept at the top right. However, when the red-green boundary was stabilized on the retina as depicted in the figure in the lower left, subjects reported that the central disk disappeared and the whole target – disk and annulus – appeared red, as in lower right. That is, not only does the stabilized image (the green-red boundary) disappear (due to photo-receptor fatigue), but the texture from the outer annulus fills-in the entire surface as there is no longer a boundary to block the filling-in process. For more details see Pessoa, Tompson, and Noe (1998).

Figure 11

Figure 12. (a) Under the standard vernier discrimination conditions two vertical lines are offset, and the task of the participant is to judge whether the top line is to the left or right of the bottom line. (b) Under the crowding condition the vernier stimulus is surrounded by a square and discriminations are much worse. (c) Under the uncrowding condition a series of additional squares are presented. Performance is much better here, although not as good as in (a).

Figure 12

Figure 13. Example stimuli taken from nine different stimulus sets, with the same trials depicted on the top row, different trials on the bottom. The level of similarity between stimulus sets varied, with the greatest overlap between the irregular and regular sets, and little overlap between the irregular set on the one hand and the lines or arrow datasets on the other. Image taken from Puebla and Bowers (2022).

Figure 13

Figure 14. For humans the perceptual distance between the top pair of figures (marked d1) is larger than the perceptual distance between the two pairs of objects on the bottom (marked d2). For DNNs, the perceptual distance is the same for all pairs. Images taken from Jacob et al. (2021).

Figure 14

Figure 15. Pomerantz and Portillo (2011) measured Gestalts by constructing a base pair of images (two dots in different locations) and then adding the same context stimulus to each base such that the new image pairs could be distinguished not only using the location of the dots in the base, but also the orientation, linearity, or proximity of the dots. They reported that human participants are faster to distinguish the pair of stimuli under the latter conditions than under the base condition. By contrast, the various DNNs, including DNNs that perform well on Brain-Score, treat the pairs under the orientation, linearity, and proximity conditions as more similar. Images taken from Biscione and Bowers (2023).