Skip to main content
×
Home
    • Aa
    • Aa

A tutorial survey of architectures, algorithms, and applications for deep learning

  • Li Deng (a1)
Abstract

In this invited paper, my overview material on the same topic as presented in the plenary overview session of APSIPA-2011 and the tutorial material presented in the same conference [1] are expanded and updated to include more recent developments in deep learning. The previous and the updated materials cover both theory and applications, and analyze its future directions. The goal of this tutorial survey is to introduce the emerging area of deep learning or hierarchical learning to the APSIPA community. Deep learning refers to a class of machine learning techniques, developed largely since 2006, where many stages of non-linear information processing in hierarchical architectures are exploited for pattern classification and for feature learning. In the more recent literature, it is also connected to representation learning, which involves a hierarchy of features or concepts where higher-level concepts are defined from lower-level ones and where the same lower-level concepts help to define higher-level ones. In this tutorial survey, a brief history of deep learning research is discussed first. Then, a classificatory scheme is developed to analyze and summarize major work reported in the recent deep learning literature. Using this scheme, I provide a taxonomy-oriented survey on the existing deep architectures and algorithms in the literature, and categorize them into three classes: generative, discriminative, and hybrid. Three representative deep architectures – deep autoencoders, deep stacking networks with their generalization to the temporal domain (recurrent networks), and deep neural networks (pretrained with deep belief networks) – one in each of the three classes, are presented in more detail. Next, selected applications of deep learning are reviewed in broad areas of signal and information processing including audio/speech, image/vision, multimodality, language modeling, natural language processing, and information retrieval. Finally, future directions of deep learning are discussed and analyzed.

  • View HTML
    • Send article to Kindle

      To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle.

      Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      A tutorial survey of architectures, algorithms, and applications for deep learning
      Available formats
      ×
      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your Dropbox account. Find out more about sending content to Dropbox.

      A tutorial survey of architectures, algorithms, and applications for deep learning
      Available formats
      ×
      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your Google Drive account. Find out more about sending content to Google Drive.

      A tutorial survey of architectures, algorithms, and applications for deep learning
      Available formats
      ×
Copyright
The online version of this article is published within an Open Access environment subject to the conditions of the Creative Commons Attribution licence http://creativecommons.org/licenses/by/3.0/
Corresponding author
Corresponding author: L. Deng Email: deng@microsoft.com
Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

[2] L. Deng : Expanding the scope of signal processing. IEEE Signal Process. Mag., 25 (3) (2008), 24.

[4] Y. Bengio : Learning deep architectures for AI. Found. Trends Mach. Learn., 2 (1) (2009), 1127.

[5] Y. Bengio ; A. Courville ; P. Vincent : Representation learning: a review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell., 35 (2013), 17981828.

[6] G. Hinton : Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag., 29 (6) (2012), 8297.

[7] D. Yu ; L. Deng : Deep learning and its applications to signal and information processing. IEEE Signal Process. Mag., 28 (2011), 145154.

[8] I. Arel ; C. Rose ; T. Karnowski : Deep machine learning – a new frontier in artificial intelligence, in IEEE Computational Intelligence Mag., 5 (2010), 1318.

[13] J. Baker : Research developments and directions in speech recognition and understanding. IEEE Signal Process. Mag., 26 (3) (2009), 7580.

[14] J. Baker : Updated MINS report on speech recognition and understanding. IEEE Signal. Process. Mag., 26 (4) (2009), 7885.

[15] L. Deng : Computational models for speech production, in Computational Models of Speech Pattern Processing, 199213, Springer-Verlag, 1999, Berlin, Heidelberg.

[21] G. Hinton ; R. Salakhutdinov : Reducing the dimensionality of data with neural networks. Science, 313 (5786) (2006), 504507.

[24] A. Mohamed ; G. Dahl ; G. Hinton : Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process., 20 (1) (2012), 1422.

[25] G. Dahl ; D. Yu ; L. Deng ; A. Acero : Context-dependent DBN-HMMs in large vocabulary continuous speech recognition. IEEE Trans. Audio Speech, Lang. Process., 20 (1) (2012), 3042.

[33] N. Morgan : Deep and wide: multiple layers in automatic speech recognition. IEEE Trans. Audio Speech, Lang. Process., 20 (1) (2012), 713.

[34] L. Deng ; X. Li : Machine learning paradigms in speech recognition: an overview. IEEE Trans. Audio Speech, Lang., 21 (2013), 10601089.

[54] L. Deng : A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal. Signal Process., 27 (1) (1992), 6578.

[55] L. Deng : A stochastic model of speech incorporating hierarchical nonstationarity. IEEE Trans. Speech Audio Process., 1 (4) (1993), 471475.

[56] L. Deng ; M. Aksmanovic ; D. Sun ; J. Wu : Speech recognition using hidden Markov models with polynomial regression functions as nonstationary states. IEEE Trans. Speech Audio Process., 2 (4) (1994), 507520.

[57] M. Ostendorf ; V. Digalakis ; O. Kimball : From HMM's to segment models: a unified view of stochastic modeling for speech recognition. IEEE Trans. Speech Audio Process., 4 (5) (1996), 360378.

[58] L. Deng ; H. Sameti : Transitional speech units and their representation by regressive Markov states: applications to speech recognition. IEEE Trans. Speech Audio Process., 4 (4) (1996), 301306.

[59] L. Deng ; M. Aksmanovic : Speaker-independent phonetic classification using hidden Markov models with state-conditioned mixtures of trend functions. IEEE Trans. Speech Audio Process., 5 (1997), 319324.

[62] H. Zen ; Y. Nankaku ; K. Tokuda : Continuous stochastic feature mapping based on trajectory HMMs. IEEE Trans. Audio Speech, Lang. Process., 19 (2) (2011), 417430.

[63] H. Zen ; M. J. F. Gales ; Y. Nankaku ; K. Tokuda : Product of experts for statistical parametric speech synthesis. IEEE Trans. Audio Speech, Lang. Process., 20 (3) (2012), 794805.

[64] Z. Ling ; K. Richmond ; J. Yamagishi : Articulatory control of HMM-based parametric speech synthesis using feature-space-switched multiple regression. IEEE Trans. Audio Speech Lang. Process., 21 (2013), 207219.

[66] M. Shannon ; H. Zen ; W. Byrne : Autoregressive models for statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process., 21 (3) (2013), 587597.

[71] L. Deng ; X.D. Huang : Challenges in adopting speech recognition. Commun. ACM, 47 (1) (2004), 1113.

[72] J. Ma ; L. Deng : Efficient decoding strategies for conversational speech recognition using a constrained nonlinear state-space model. IEEE Trans. Speech Audio Process., 11 (6) (2003), 590602.

[73] J. Ma ; L. Deng : Target-directed mixture dynamic models for spontaneous speech recognition. IEEE Trans. Speech Audio Process., 12 (1) (2004), 4758.

[74] L. Deng ; D. Yu ; A. Acero : Structured speech modeling. IEEE Trans. Audio Speech Lang. Process., 14 (5) (2006), 14921504.

[75] L. Deng ; D. Yu ; A. Acero : A bidirectional target filtering model of speech coarticulation: two-stage implementation for phonetic recognition. IEEE Trans. Audio Speech Process., 14 (1) (2006a), 256265.

[77] J. Bilmes ; C. Bartels : Graphical model architectures for speech recognition. IEEE Signal Process. Mag., 22 (2005), 89100.

[80] M. Wohlmayr ; M. Stark ; F. Pernkopf : A probabilistic interaction model for multipitch tracking with factorial hidden Markov model. IEEE Trans. Audio Speech, Lang. Process., 19 (4) (2011).

[83] S. Fine ; Y. Singer ; N. Tishby : The Hierarchical Hidden Markov Model: analysis and applications. Mach. Learn., 32 (1998), 4162.

[84] N. Oliver ; A. Garg ; E. Horvitz : Layered representations for learning and inferring office activity from multiple sensory channels. Comput. Vis. Image Understand., 96 (2004), 163180.

[87] B.-H. Juang , W. Chou ; C.-H. Lee : Minimum classification error rate methods for speech recognition. IEEE Trans. Speech Audio Process., 5 (1997), 257265.

[88] R. Chengalvarayan ; L. Deng : Speech trajectory discrimination using the minimum classification error learning. IEEE Trans. Speech Audio Process., 6 (6) (1998), 505515.

[91] H. Jiang ; X. Li : Parameter estimation of statistical models using convex optimization: an advanced method of discriminative training for speech and language processing. IEEE Signal Process. Mag., 27 (3) (2010), 115127.

[94] M. Gibson ; T. Hain : Error approximation and minimum phone error acoustic model estimation. IEEE Trans. Audio Speech, Lang. Process., 18 (6) (2010), 12691279.

[96] D. Yu ; S. Wang ; L. Deng : Sequential labeling using deep-structured conditional random fields. J. Sel. Top. Signal Process., 4 (2010), 965973.

[97] Y. Hifny ; S. Renals : Speech recognition using augmented conditional random fields. IEEE Trans. Audio Speech Lang. Process., 17 (2) (2009), 354365.

[98] I. Heintz ; E. Fosler-Lussier ; C. Brew : Discriminative input stream combination for conditional random field phone recognition. IEEE Trans. Audio Speech Lang. Process., 17 (8) (2009), 15331546.

[101] G. Heigold ; H. Ney ; P. Lehnen ; T. Gass ; R. Schluter : Equivalence of generative and log-liner models. IEEE Trans. Audio Speech Lang. Process., 19 (5) (2011), 11381148.

[104] J. Pinto ; S. Garimella ; M. Magimai-Doss ; H. Hermansky ; H. Bourlard : Analysis of MLP-based hierarchical phone posterior probability estimators. IEEE Trans. Audio Speech Lang. Process., 19 (2) (2011), 225241.

[105] H. Ketabdar ; H. Bourlard : Enhanced phone posteriors for improving speech recognition systems. IEEE Trans. Audio Speech Lang. Process., 18 (6) (2010), 10941106.

[106] N. Morgan : Pushing the envelope – aside [speech recognition]. IEEE Signal Process. Mag., 22 (5) (2005), 8188.

[112] B. Hutchinson ; L. Deng ; D. Yu : Tensor deep stacking networks, IEEE Trans. Pattern Anal. Mach. Intell., 35 (2013), 19441957.

[113] L. Deng ; K. Hassanein ; M. Elmasry : Analysis of correlation structure for a neural predictive model with application to speech recognition. Neural Netw., 7 (2) (1994a), 331339.

[114] A. Robinson : An application of recurrent nets to phone probability estimation. IEEE Trans. Neural Netw., 5 (1994), 298305.

[115] A. Graves ; S. Fernandez ; F. Gomez ; J. Schmidhuber : Connectionist temporal classification: labeling unsegmented sequence data with recurrent neural networks, in Proc. ICML, 2006.

[118] Y. LeCun ; L. Bottou ; Y. Bengio ; P. Haffner : Gradient-based learning applied to document recognition. Proc. IEEE, 86 (1998), 22782324.

[127] K. Lang ; A. Waibel ; G. Hinton : A time-delay neural network architecture for isolated word recognition. Neural Netw., 3 (1) (1990), 2343.

[129] A. Waibel ; T. Hanazawa ; G. Hinton ; K. Shikano ; K. Lang : Phoneme recognition using time-delay neural networks. IEEE Trans. ASSP, 37 (3) (1989), 328339.

[133] M. Siniscalchi ; D. Yu ; L. Deng ; C.-H. Lee : Exploiting deep neural networks for detection-based speech recognition. Neurocomputing, 106 (2013), 148157.

[134] M. Siniscalchi ; T. Svendsen ; C.-H. Lee : A bottom-up modular search approach to large vocabulary continuous speech recognition. IEEE Trans. Audio Speech, Lang. Process., 21 (2013), 786797.

[137] J. Sun ; L. Deng : An overlapping-feature based phonological model incorporating linguistic constraints: applications to speech recognition. J. Acoust. Soc. Am., 111 (2) (2002), 10861101.

[142] H. Larochelle ; Y. Bengio : Classification using discriminative restricted Boltzmann machines, in Proc. ICML, 2008.

[143] H. Lee ; R. Grosse ; R. Ranganath ; and A. Ng : Unsupervised learning of hierarchical representations with convolutional deep belief networks, Communications of the ACM, Vol. 54, No. 10, October, 2011, pp. 95103.

[144] H. Lee ; R. Grosse ; R. Ranganath ; A. Ng : Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations, Proc. ICML, 2009.

[148] X. He ; L. Deng : Speech recognition, machine translation, and speech translation – a unifying discriminative framework. IEEE Signal Process. Mag., 28 (2011), 126133.

[149] S. Yamin ; L. Deng ; Y. Wang ; A. Acero : An integrative and discriminative technique for spoken utterance classification. IEEE Trans. Audio Speech Lang. Process., 16 (2008), 12071214.

[151] X. He ; L. Deng : Speech-centric information processing: an optimization-oriented approach, in Proc. IEEE, 2013.

[159] D. Wolpert : Stacked generalization. Neural Netw., 5 (2) (1992), 241259.

[163] L. Deng ; J. Ma : Spontaneous speech recognition using a statistical coarticulatory model for the vocal tract resonance dynamics. J. Acoust. Soc. Am., 108 (2000), 30363048.

[164] R. Togneri ; L. Deng : Joint state and parameter estimation for a target-directed nonlinear dynamic system model. IEEE Trans. Signal Process., 51 (12) (2003), 30613070.

[166] G. Sivaram ; H. Hermansky : Sparse multilayer perceptron for phoneme recognition. IEEE Trans. Audio Speech Lang. Process., 20 (1) (2012), 2329.

[170] B. Juang ; S. Levinson ; M. Sondhi : Maximum likelihood estimation for multivariate mixture observations of Markov chains. IEEE Trans. Inf. Theory, 32 (1986), 307309.

[171] L. Deng ; M. Lennig ; F. Seitz ; P. Mermelstein : Large vocabulary word recognition using context-dependent allophonic hidden Markov models. Comput. Speech Lang., 4 (4) (1990), 345357.

[172] L. Deng ; P. Kenny ; M. Lennig ; V. Gupta ; F. Seitz ; P. Mermelstein : Phonemic hidden Markov models with continuous mixture output densities for large vocabulary word recognition. IEEE Trans. Signal Process, 39 (7) (1991), 16771681.

[173] H. Sheikhzadeh ; L. Deng : Waveform-based speech recognition using hidden filter models: parameter selection and sensitivity to power normalization. IEEE Trans. Speech Audio Process., 2 (1994), 8091.

[179] D. Yu ; J.-Y. Li ; L. Deng : Calibration of confidence measures in speech recognition. IEEE Trans. Audio Speech Lang., 19 (2010), 24612473.

[181] Z. Ling ; L. Deng ; D. Yu : Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process., 21 (10) (2013), 21292139.

[194] G. Papandreou ; A. Katsamanis ; V. Pitsikalis ; P. Maragos : Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans. Audio Speech Lang. Process., 17 (3) (2009), 423435.

[195] L. Deng ; J. Wu ; J. Droppo ; A. Acero : Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Trans. Speech Audio Process., 13 (3) (2005), 412421.

[198] A. Mnih ; G. Hinton : Three new graphical models for statistical language modeling, in Proc. ICML, 2007, 641648.

[204] S. Huang ; S. Renals : Hierarchical Bayesian language models for conversational speech recognition. IEEE Trans. Audio Speech Lang. Process., 18 (8) (2010), 19411954.

[205] R. Collobert ; J. Weston : A unified architecture for natural language processing: deep neural networks with multitask learning, in Proc. ICML, 2008.

[215] D. Yu ; L. Deng ; F. Seide : The deep tensor neural network with applications to large vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process., 21 (2013), 388396.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

APSIPA Transactions on Signal and Information Processing
  • ISSN: 2048-7703
  • EISSN: 2048-7703
  • URL: /core/journals/apsipa-transactions-on-signal-and-information-processing
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Keywords:

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 279
Total number of PDF views: 5223 *
Loading metrics...

Abstract views

Total abstract views: 2401 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 23rd September 2017. This data will be updated every 24 hours.