Aligning LLMs with Direct Preference Optimization

Ойын-сауық

In this workshop, Lewis Tunstall and Edward Beeching from Hugging Face will discuss a powerful alignment technique called Direct Preference Optimisation (DPO) which was used to train Zephyr (arxiv.org/abs/2310.16944) and is rapidly becoming the de facto method to boost the performance of open chat models.
By the end of this workshop, attendees will:
Understand the steps involved in fine-tuning LLMs for chat applications.
Learn the theory behind Direct Preference Optimisation and how to apply it in practice with the Hugging Face TRL library.
Know what metrics to consider when evaluating chat models.
Take a moment to register for our community forum:
bit.ly/48UIIve
Take a moment to register for our short courses here:
bit.ly/420iXHx
Workshop Notebooks:
Notebook #1:
colab.research.google.com/dri...
Notebook #2:
colab.research.google.com/dri...
Slides:
docs.google.com/presentation/...
About DeepLearning.AI
DeepLearning.AI is an education technology company that is empowering the global workforce to build an AI-powered future through world-class education, hands-on training, and a collaborative community. Take your generative AI skills to the next level with short courses help you learn new skills, tools, and concepts efficiently.
About Hugging Face
Hugging Face is an AI company specializing in natural language processing (NLP) and machine learning, and is known for its open-source contributions and collaborative approach to AI research and development. The company is famous for developing the Transformers library, which offers a wide range of pretrained models and tools for a variety of NLP tasks, making it easier for researchers and developers to implement state-of-the-art AI solutions. Hugging Face also fosters a vibrant community for AI enthusiasts and professionals, providing a platform for sharing models, datasets, and research, which significantly contributes to the advancement of AI technology.
Speakers:
Lewis Tunstall, Machine Learning Engineer, Hugging Face
/ lewis-tunstall
Edward Beeching, Research Scientist, Hugging Face
/ ed-beeching-3553b468

Пікірлер: 18

@PritishYuvraj3 ай бұрын
Excellent description between PPO and DPO! Kudos
@eliporter39804 ай бұрын
I'm learning a lot from these talks, thank you for having them.
@NitinPasumarthy4 ай бұрын
The best content I have seen in a while. Enjoyed both the theory and the practical notes from both speakers! Huge thanks dlai for organizing this event
@vijaybhaskar53334 ай бұрын
Excellent topic. Well explained. One of the best videos on this subject I saw recently. Continue your goodwork😊
@katie-484 ай бұрын
Great presentation, thank you very much!
@user-rx5pp3hh1x4 ай бұрын
cut to the chase - 3:30 questions on DPO - 27:37 practical deep-dive - 30:19 question - 53:32
@amortalbeing4 ай бұрын
This was amazing thank you everyone. One thing though, if thats possible, it would be greatly appreciated if you could record in 1080p where the details/text on the slides are visible and easier to consume. Thanks a lot again
@MatijaGrcic
4 ай бұрын
Check out notebooks and slides in the description.
@amortalbeing
4 ай бұрын
@@MatijaGrcic Thanks a lot, downloaded the slides
@jeankunz59864 ай бұрын
great presentation. Congratulations.
@PaulaLeonova4 ай бұрын
At 29:40 Lewis mentions an algorithm that requires fewer training samples, what is the name of it? I heard "data", but don't think that is correct. If anyone knows, would you mind replying?
@user-rx5pp3hh1x
4 ай бұрын
Possibly, this paper, "Rethinking Data Selection for Supervised Fine-Tuning" arxiv.org/pdf/2402.06094.pdf
@ralphabrooks
4 ай бұрын
I am also interested in hearing more about this "data" algorithm. Is there a link to a paper or blog on it?
@austinmw894 ай бұрын
Curious if you compared SFT on all data vs. training on completions only?
@TheRilwen4 ай бұрын
I'm wondering why simple techniques, such as sample boosting, increasing errors for high ranked examples, or attention layer wouldn't work in place of RLHF. It seems like a very convoluted and inefficient way of doing a simple thing - which convinces me that I'm missing something :-)
@iseminamanim4 ай бұрын
Interested
@MacProUser998764 ай бұрын
How DPO works under the hood: kzread.info/dash/bejne/fKlh0qiDfsm1lrw.html

Aligning LLMs with Direct Preference Optimization

Ойын-сауық

Пікірлер: 18

@MatijaGrcic

4 ай бұрын

@amortalbeing

4 ай бұрын

@user-rx5pp3hh1x

4 ай бұрын

@ralphabrooks

4 ай бұрын

Келесі