Git Re-Basin @ DLCT
Ғылым және технология
This is a talk delivered at the (usually not recorded) weekly journal club "Deep Learning: Classics and Trends" (mlcollective.org/dlct/ ).
Speaker: Samuel Ainsworth
Title: Git Re-Basin: Merging Models modulo Permutation Symmetries
Abstract: The success of deep learning is due in large part to our ability to solve certain massive non-convex optimization problems with relative ease. Though non-convex optimization is NP-hard, simple algorithms -- often variants of stochastic gradient descent -- exhibit surprising effectiveness in fitting large neural networks in practice. We argue that neural network loss landscapes contain (nearly) a single basin after accounting for all possible permutation symmetries of hidden units a la Entezari et al. (2021). We introduce three algorithms to permute the units of one model to bring them into alignment with a reference model in order to merge the two models in weight space. This transformation produces a functionally equivalent set of weights that lie in an approximately convex basin near the reference model. Experimentally, we demonstrate the single basin phenomenon across a variety of model architectures and datasets, including the first (to our knowledge) demonstration of zero-barrier linear mode connectivity between independently trained ResNet models on CIFAR-10 and CIFAR-100. Additionally, we identify intriguing phenomena relating model width and training time to mode connectivity. Finally, we discuss shortcomings of the linear mode connectivity hypothesis, including a counterexample to the single basin theory.
Speaker bio: Samuel Ainsworth is a Senior Research Scientist at Cruise AI Research where he studies imitation learning, robustness, and efficiency. He completed his undergraduate in Computer Science and Applied Mathematics at Brown University and received his PhD from the School of Computer Science and Engineering at the University of Washington. His research interests span reinforcement learning, deep learning, programming languages, and drug discovery. He has previously worked on recommender systems, Bayesian optimization, and variational inference at organizations such as The New York Times and Google.
Paper link: arxiv.org/abs/2209.04836
Пікірлер: 2
This was a great talk! I missed the live talk. Thanks for recording this one.
Great talk! One point is that the argument for why the lambda is seemingly at 0.5 doesn't seem right. Because these cases are chosen with random seeds, all you can expect is that the distribution of lambda is peaked at 0.5 (for lots and lots of seeds) but it doesn't follow by symmetry that it would be exactly 0.5. That seems to warrant an explanation.