Hostname: page-component-77f85d65b8-6c7dr Total loading time: 0 Render date: 2026-03-29T05:10:22.678Z Has data issue: false hasContentIssue false

Theoretical analysis of skip connections and batch normalization from generalization and optimization perspectives

Published online by Cambridge University Press:  27 February 2020

Yasutaka Furusho
Affiliation:
Nara Institute of Science and Technology, Ikoma, Nara, 8916-5, Japan
Kazushi Ikeda*
Affiliation:
Nara Institute of Science and Technology, Ikoma, Nara, 8916-5, Japan
*
Corresponding author: Kazushi Ikeda Email: kazushi@is.naist.jp

Abstract

Deep neural networks (DNNs) have the same structure as the neocognitron proposed in 1979 but have much better performance, which is because DNNs include many heuristic techniques such as pre-training, dropout, skip connections, batch normalization (BN), and stochastic depth. However, the reason why these techniques improve the performance is not fully understood. Recently, two tools for theoretical analyses have been proposed. One is to evaluate the generalization gap, defined as the difference between the expected loss and empirical loss, by calculating the algorithmic stability, and the other is to evaluate the convergence rate by calculating the eigenvalues of the Fisher information matrix of DNNs. This overview paper briefly introduces the tools and shows their usefulness by showing why the skip connections and BN improve the performance.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Authors, 2020
Figure 0

Fig. 1. Algorithmic stability. It measures how much the removal of one example $z(n)$ from the training set S affects the trained model $A(S^n)$.

Figure 1

Fig. 2. Excess risk is smaller than a quadratic function of parameters θ. The constant μ for the PL condition controls its flatness.

Figure 2

Table 1. PL condition of the L-layer linear DNNs.

Figure 3

Fig. 3. Constants μ for the PL-condition of the 10-layer linear DNNs.

Figure 4

Table 2. Convergence of the L-layer linear DNNs.

Figure 5

Fig. 4. Training loss (solid lines) and test loss (dotted lines) of the 10-layer DNNs with the ReLU activation.

Figure 6

Fig. 5. Approximation of the stability $\epsilon _{stab}$ of the 10-layer DNNs with the ReLU activation.

Figure 7

Fig. 6. Activation rate of the hidden units in each layer.

Figure 8

Fig. 7. Mean eigenvalues and maximum eigenvalues of the expected FIM.

Figure 9

Fig. 8. Training loss under various settings. Read lines: theoretical lower-bounds of the maximum learning rates. White color: divergence (the loss ${>}1000$). (a) ResNet. (b) ResNet with BN.

Figure 10

Fig. 9. Loss of the 4-layer DNNs (solid line: average, shadowed area: within one s.d.)