Hostname: page-component-77f85d65b8-7lfxl Total loading time: 0 Render date: 2026-04-17T18:08:51.749Z Has data issue: false hasContentIssue false

Advances in deep learning approaches for image tagging

Published online by Cambridge University Press:  04 October 2017

Jianlong Fu*
Affiliation:
Microsoft Research, No. 5, Dan Ling Street, Haidian District, Beijing, P. R. China
Yong Rui
Affiliation:
Microsoft Research, No. 5, Dan Ling Street, Haidian District, Beijing, P. R. China
*
Corresponding author: J. Fu Email: jianf@microsoft.com

Abstract

The advent of mobile devices and media cloud services has led to the unprecedented growth of personal photo collections. One of the fundamental problems in managing the increasing number of photos is automatic image tagging. Image tagging is the task of assigning human-friendly tags to an image so that the semantic tags can better reflect the content of the image and therefore can help users better access that image. The quality of image tagging depends on the quality of concept modeling which builds a mapping from concepts to visual images. While significant progresses are made in the past decade on image tagging, the previous approaches can only achieve limited success due to the limited concept representation ability from hand-crafted features (e.g., Scale-Invariant Feature Transform, GIST, Histogram of Oriented Gradients, etc.). Further progresses are made, since the efficient and effective deep learning algorithms have been developed. The purpose of this paper is to categorize and evaluate different image tagging approaches based on deep learning techniques. We also discuss the relevant problems and applications to image tagging, including data collection, evaluation metrics, and existing commercial systems. We conclude the advantages of different image tagging paradigms and propose several promising research directions for future works.

Information

Type
Industrial Technology Advances
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Authors, 2017
Figure 0

Fig. 1. Examples of image tagging and tag refinement results. A red tag with a “−” superscript indicates the imprecise tags, which should be removed from initial image tagging results, and a green tag with a “+” superscript indicates the enriched tags by image tag refinement approaches. All the tags are ranked according to relevance scores to the image.

Figure 1

Table 1. The comparison of different CNN architectures on model size, classification error rate, and model depth.

Figure 2

Fig. 2. An example of model-free image tag refinement framework. For an input image on the left, we first find its semantic-related images in training set by searching with its initial image tagging results. Second, we build a star-graph from the semantic-related images based on visual similarity on the right. Both the semantic-related and visual-similar nearest neighbor images for the input image are marked by yellow rectangles in the left image list and the right star graph. The final tagging list for the input image is generated from the voting of those nearest-neighbor images marked by yellow rectangles. [Best viewed in color].

Figure 3

Fig. 3. An example of model-based image tagging framework. (a) The training set contains the labeled source images and the unlabeled target images. (b) The network of the transfer deep learning with ontology priors. It is first trained on both ImageNet (the source domain) and personal photos (the target domain) by pre-training and fine-tuning for discovering shared middle-level feature abstractions across domains. Once the shared feature abstractions are learned, the top layer with ontology priors is further trained. In the testing stage, the resultant parameters W and B can be transferred to the target domain to obtain the middle-level feature representations (a bottom-up transfer) and high-level confidence scores (a top-down transfer). (c) An illustration of the ontology collecting scheme. (d) The input, in the testing stage, is highly flexible which can either be a single photo or a photo collection. (e) The tagging result.

Figure 4

Table 2. The statistics of the number of tags and images for different datasets for image tagging.

Figure 5

Table 3. Image tagging performance (measured by MAP) on MIRFlickr-25K and NUS-WIDE-270K for different comparison approaches.

Figure 6

Fig. 4. Image tagging results by a typical model-based approach [21]. Note that the underlined tags are missing from ImageNet categories, which shows the vocabulary difference between image classification and image tagging.

Figure 7

Table 4. Tagging performance in terms of both mean average precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) for the 1000 testing photos from NUS-WIDE-270K dataset.