Optimal Lottery Tickets via Subset Sum: Logarithmic Over-Parameterization is Sufficient
Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, Dimitrios Papailiopoulos
Spotlight presentation: Orals & Spotlights Track 08: Deep Learning
on 2020-12-08T08:20:00-08:00 - 2020-12-08T08:30:00-08:00
on 2020-12-08T08:20:00-08:00 - 2020-12-08T08:30:00-08:00
Poster Session 2 (more posters)
on 2020-12-08T09:00:00-08:00 - 2020-12-08T11:00:00-08:00
GatherTown: Deep learning ( Town A4 - Spot A2 )
on 2020-12-08T09:00:00-08:00 - 2020-12-08T11:00:00-08:00
GatherTown: Deep learning ( Town A4 - Spot A2 )
Join GatherTown
Only iff poster is crowded, join Zoom . Authors have to start the Zoom call from their Profile page / Presentation History.
Only iff poster is crowded, join Zoom . Authors have to start the Zoom call from their Profile page / Presentation History.
Toggle Abstract Paper (in Proceedings / .pdf)
Abstract: The strong lottery ticket hypothesis (LTH) postulates that one can approximate any target neural network by only pruning the weights of a sufficiently over-parameterized random network. A recent work by Malach et al. [MYSS20] establishes the first theoretical analysis for the strong LTH: one can provably approximate a neural network of width $d$ and depth $l$, by pruning a random one that is a factor $O(d^4 l^2)$ wider and twice as deep. This polynomial over-parameterization requirement is at odds with recent experimental research that achieves good approximation with networks that are a small factor wider than the target. In this work, we close the gap and offer an exponential improvement to the over-parameterization requirement for the existence of lottery tickets. We show that any target network of width $d$ and depth $l$ can be approximated by pruning a random network that is a factor $O(log(dl))$ wider and twice as deep. Our analysis heavily relies on connecting pruning random ReLU networks to random instances of the Subset Sum problem. We then show that this logarithmic over-parameterization is essentially optimal for constant depth networks. Finally, we verify several of our theoretical insights with experiments.