Hostname: page-component-89b8bd64d-sd5qd Total loading time: 0 Render date: 2026-05-13T22:47:03.928Z Has data issue: false hasContentIssue false

Toward human-centric deep video understanding

Published online by Cambridge University Press:  13 January 2020

Wenjun Zeng*
Affiliation:
Microsoft Research Asia, No. 5 Danling Street, Haidian District, Beijing, China
*
Corresponding author: W. Zeng Email: wezeng@microsoft.com

Abstract

People are the very heart of our daily work and life. As we strive to leverage artificial intelligence to empower every person on the planet to achieve more, we need to understand people far better than we can today. Human–computer interaction plays a significant role in human-machine hybrid intelligence, and human understanding becomes a critical step in addressing the tremendous challenges of video understanding. In this paper, we share our views on why and how to use a human centric approach to address the challenging video understanding problems. We discuss human-centric vision tasks and their status, highlighting the challenges and how our understanding of human brain functions can be leveraged to effectively address some of the challenges. We show that semantic models, view-invariant models, and spatial-temporal visual attention mechanisms are important building blocks. We also discuss the future perspectives of video understanding.

Information

Type
Industrial Technology Advances
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Authors, 2020
Figure 0

Fig. 1. Performance of the winners of the ImageNet classification competitions over the years.

Figure 1

Fig. 2. Accuracy-speed trade-off of top-performing trackers on the OTB-100 benchmark. The speed axis is logarithmic. Reproduced from Fig. 8 of [7]. Please refer to [7] for the notations of different trackers. “Ours” refers to the SPM-Tracker [7].

Figure 2

Fig. 3. Spatial-temporal attention network. Both attention networks use a one-layer LSTM network.

Figure 3

Fig. 4. Illustration of a retail intelligence scenario where multiple cameras are deployed, 3D space is reconstructed, people are detected and tracked, and heatmap (in purple) is generated.