Hostname: page-component-77c78cf97d-9lb97 Total loading time: 0 Render date: 2026-04-23T20:06:14.072Z Has data issue: false hasContentIssue false

Building machines that learn and think like people

Published online by Cambridge University Press:  24 November 2016

Brenden M. Lake
Affiliation:
Department of Psychology and Center for Data Science, New York University, New York, NY 10011 brenden@nyu.edu http://cims.nyu.edu/~brenden/
Tomer D. Ullman
Affiliation:
Department of Brain and Cognitive Sciences and The Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139 tomeru@mit.edu http://www.mit.edu/~tomeru/
Joshua B. Tenenbaum
Affiliation:
Department of Brain and Cognitive Sciences and The Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139 jbt@mit.edu http://web.mit.edu/cocosci/josh.html
Samuel J. Gershman
Affiliation:
Department of Psychology and Center for Brain Science, Harvard University, Cambridge, MA 02138, and The Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139 gershman@fas.harvard.edu http://gershmanlab.webfactional.com/index.html
Rights & Permissions [Opens in a new window]

Abstract

Recent progress in artificial intelligence has renewed interest in building systems that learn and think like people. Many advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, and board games, achieving performance that equals or even beats that of humans in some respects. Despite their biological inspiration and performance achievements, these systems differ from human intelligence in crucial ways. We review progress in cognitive science suggesting that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what they learn and how they learn it. Specifically, we argue that these machines should (1) build causal models of the world that support explanation and understanding, rather than merely solving pattern recognition problems; (2) ground learning in intuitive theories of physics and psychology to support and enrich the knowledge that is learned; and (3) harness compositionality and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. We suggest concrete challenges and promising routes toward these goals that can combine the strengths of recent neural network advances with more structured cognitive models.

Information

Type
Target Article
Copyright
Copyright © Cambridge University Press 2017 
Figure 0

Table 1. Glossary

Figure 1

Figure 1. The Characters Challenge: Human-level learning of novel handwritten characters (A), with the same abilities also illustrated for a novel two-wheeled vehicle (B). A single example of a new visual concept (red box) can be enough information to support the (i) classification of new examples, (ii) generation of new examples, (iii) parsing an object into parts and relations, and (iv) generation of new concepts from related concepts. Adapted from Lake et al. (2015a).

Figure 2

Figure 2. Screenshots of Frostbite, a 1983 video game designed for the Atari game console. (A) The start of a level in Frostbite. The agent must construct an igloo by hopping between ice floes and avoiding obstacles such as birds. The floes are in constant motion (either left or right), making multi-step planning essential to success. (B) The agent receives pieces of the igloo (top right) by jumping on the active ice floes (white), which then deactivates them (blue). (C) At the end of a level, the agent must safely reach the completed igloo. (D) Later levels include additional rewards (fish) and deadly obstacles (crabs, clams, and bears).

Figure 3

Figure 3. Comparing learning speed for people versus Deep Q-Networks (DQNs). Performance on the Atari 2600 game Frostbite is plotted as a function of game experience (in hours at a frame rate of 60 fps), which does not include additional experience replay. Learning curves and scores are shown from different networks: DQN (Mnih et al. 2015), DQN+ (Schaul et al. 2016), and DQN++ (Wang et al. 2016). Random play achieves a score of 65.2.

Figure 4

Figure 4. The intuitive physics-engine approach to scene understanding, illustrated through tower stability. (A) The engine takes in inputs through perception, language, memory, and other faculties. It then constructs a physical scene with objects, physical properties, and forces; simulates the scene's development over time; and hands the output to other reasoning systems. (B) Many possible “tweaks” to the input can result in very different scenes, requiring the potential discovery, training, and evaluation of new features for each tweak. Adapted from Battaglia et al. (2013).

Figure 5

Figure 5. A causal, compositional model of handwritten characters. (A) New types are generated compositionally by choosing primitive actions (color coded) from a library (i), combining these sub-parts (ii) to make parts (iii), and combining parts with relations to define simple programs (iv). These programs can create different tokens of a concept (v) that are rendered as binary images (vi). (B) Probabilistic inference allows the model to generate new examples from just one example of a new concept; shown here in a visual Turing test. An example image of a new concept is shown above each pair of grids. One grid was generated by nine people and the other is nine samples from the BPL model. Which grid in each pair (A or B) was generated by the machine? Answers by row: 1,2;1,1. Adapted from Lake et al. (2015a).

Figure 6

Figure 6. Perceiving scenes without intuitive physics, intuitive psychology, compositionality, and causality. Image captions are generated by a deep neural network (Karpathy & Fei-Fei 2017) using code from github.com/karpathy/neuraltalk2. Image credits: Gabriel Villena Fernández (left), TVBS Taiwan/Agence France-Presse (middle), and AP Photo/Dave Martin (right). Similar examples using images from Reuters news can be found at twitter.com/interesting_jpg.

Figure 7

Figure 7. An AI system for playing Go, combining a deep convolutional network (ConvNet) and model-based search through Monte-Carlo Tree Search (MCTS). (A) The ConvNet on its own can be used to predict the next k moves given the current board. (B) A search tree with the current board state as its root and the current “win/total” statistics at each node. A new MCTS rollout selects moves along the tree according to the MCTS policy (red arrows) until it reaches a new leaf (red circle), where the next move is chosen by the ConvNet. From there, play proceeds until the game's end according to a pre-defined default policy based on the Pachi program (Baudiš & Gailly 2012), itself based on MCTS. (C) The end-game result of the new leaf is used to update the search tree. Adapted from Tian and Zhu (2016) with permission.