Efficient Pre-training @ DLCT

Ғылым және технология

This is a talk delivered at the (usually not recorded) weekly journal club "Deep Learning: Classics and Trends" (mlcollective.org/dlct ).
Speaker: Sunny Sanyal
Title: Pre-training with a little less data and compute
Abstract: Pre-training LLMs is all about extremely large data and compute. For instance the newly released LLaMA-3 models are trained with 15 trillion tokens with 16K GPUs. In this talk, we will discuss two efficient pre-training techniques: the latest weight averaging (LAWA) and Inheritune. We begin by discussing LAWA, demonstrating that checkpoint averaging during pre-training accelerates convergence and improves test generalization. The benefits of checkpoint averaging are more pronounced with higher learning rates and when averaging distant checkpoints. Following this, we discuss Inheritune, a straightforward yet effective method for pre-training smaller base language models (LMs). Using this technique, we trained a 1.5 billion-parameter base LM with just 1 billion tokens, leveraging a large pre-trained reference model using a single GPU for less than half a day. Our small model performs comparably to similarly sized models that were pre-trained using 50 to 1,000 times more data. We present findings involving billion-size Pythia, OpenLLAMA models, and smaller-scale GPT2 models.
Speaker bio: Sunny Sanyal is a PhD student at UT Austin, advised by Prof. Sujay Sanghavi in the Department of Electrical and Computer Engineering. He is currently working on efficient training strategies for large models. He was an intern at Amazon Alexa in the Summer of 2022, where he worked with their multimodal team. He has received the Ram's Horn Best Class Project Award and the Excellent Master's Thesis Award. Some of his recent research has been featured in Ahead of AI magazine, Marktechpost's newsletter, and the Interconnects newsletter.
Paper links:
arxiv.org/abs/2306.03241
arxiv.org/abs/2404.08634

Пікірлер

    Келесі