Poster
Global Convergence of Gradient Descent for Deep Linear Residual Networks
Lei Wu · Qingcan Wang · Chao Ma
East Exhibition Hall B, C #201
Keywords: [ Optimization for Deep Networks ] [ Deep Learning ] [ Optimization -> Non-Convex Optimization; Theory -> Computational Complexity; Theory ] [ Learning Theory ]
[
Abstract
]
Abstract:
We analyze the global convergence of gradient descent for deep linear residual
networks by proposing a new initialization: zero-asymmetric (ZAS)
initialization. It is motivated by avoiding stable manifolds of saddle points.
We prove that under the ZAS initialization, for an arbitrary target matrix,
gradient descent converges to an $\varepsilon$-optimal point in $O\left( L^3
\log(1/\varepsilon) \right)$ iterations, which scales polynomially with the
network depth $L$. Our result and the $\exp(\Omega(L))$ convergence time for the
standard initialization (Xavier or near-identity)
\cite{shamir2018exponential} together demonstrate the importance of the
residual structure and the initialization in the optimization for deep linear
neural networks, especially when $L$ is large.
Live content is unavailable. Log in and register to view live content