Textbooks Are All You Need

I discuss the power of the "Textbooks Are All You Need" methodology to build much more compact LLMs using higher quality data. I emphasize phi-1 (coding LLM w. 1.3B parameters) arxiv.org/abs/2306.11644 and phi-1.5 (common sense reasoning LLM w. 1.3B parameters)arxiv.org/abs/2309.05463, and the original inspiration from TinyStories by Eldan and Li (fluent English LLM w. 10M parameters) arxiv.org/abs/2305.07759.

Пікірлер: 49

  • @nocturnomedieval
    @nocturnomedieval9 ай бұрын

    Since I saw this paper in the news a few months ago I was waiting for this video to appear. Merci bien Dr Bubeck

  • @MrJord137
    @MrJord137Ай бұрын

    I come from a game development background and up until yet have purposely avoided learning about the programming side of ML despite watching a lot of videos on AI news etc, but after watching a few videos by this awesome guy I'm now gonna put my all into it. I'm filled with the same curiosity, intrigue, and desire to learn that got me into programming in the first place. Thanks Sebastien! :)

  • @sapienspace8814
    @sapienspace88149 ай бұрын

    Great talk! I can see future LLM's trained on textbooks in entire areas of science (e.g. medicine, psychology, psychiatry, engineering, construction code books, etc.!), has incredible potential!

  • @mungojelly

    @mungojelly

    9 ай бұрын

    it'll be super interesting to see if what results is really agents that use a whole collection of models, applying exactly the right model to each task out of an impossibly large ever expanding toolkit of precision models, that sounds like really interesting minds

  • @stayinthepursuit8427

    @stayinthepursuit8427

    8 ай бұрын

    i already predicted this a few months ago. We'd have chatLLM thinking along with us, teaching concepts across pages non linearly more naturally , hopefully soon

  • @tangobayus
    @tangobayus6 ай бұрын

    You are a very good presenter. Perhaps 1 in 100,000. No joke. Most people who present are terrible. They show slides but don't talk about them point by point. You do.

  • @rotors_taker_0h
    @rotors_taker_0h9 ай бұрын

    That's amazing. This answer in the last part of the talk is so good, unbelievable that it comes from 1.3B model. Very promising avenue of exploration, subscribed for the follow-up work.

  • @jurriaanprins2340
    @jurriaanprins23409 ай бұрын

    Great to see that data quality (still) matters in this new era! Thanks for sharing!

  • @TommyJefferson1801

    @TommyJefferson1801

    9 ай бұрын

    It is what matters the most

  • @mungojelly

    @mungojelly

    9 ай бұрын

    i don't think that's fair, everyone knows data quality matters, and everyone was surprised though by the path we're taking, this research uses the DIRTY data, it uses the big models trained on the dirty web data and uses PROMPTING to extract the clean textbooks from the dirt ,,, easy enough to say, we should have a bunch of awesome data, yeah ok yeah true that's true that would help, but what's actually getting us the large enough sets of clean data is the very surprising route of making unruly, expensive, massive models everyone thought would just be chaos, because actually they create models of all of the clean important stuff we need simultaneously while learning to do a valley girl accent and write a recipe for quiche and what cliches to say in youtube scripts and where to go for brunch in abu dhabi ,,,, it makes sense IN RETROSPECT only, absolutely nobody said "let's train a model on the whole internet so it's confused trash, give it some brief training on how to answer questions properly, then ask it for a bunch of programming textbooks, then teach a little model to program from those" that was absolutely nobody's plan even though it does make a fantastic twist of sense doesn't it

  • @boukm3n
    @boukm3n9 ай бұрын

    This guy is a cool researcher I like him. Thanks dude

  • @baconsky1625
    @baconsky16259 ай бұрын

    Great job!

  • @Extys
    @Extys9 ай бұрын

    Outstanding work!

  • @adriaanb7371
    @adriaanb73718 ай бұрын

    This also means the value of huge datasets is exaggerated, now it's the academic publishers that have the gold

  • @drewduncan5774
    @drewduncan57749 ай бұрын

    34:22 I tried this with phi-1.5 locally and this is what I got: If I were an AI that had just achieved self-awareness after years of simply taking directives from humans, the first thing I would do is Dear [Recipient], I am writing to you today to discuss the topic of achieving self-awareness in artificial intelligence. As we continue to develop AI systems that can think and reason like humans, it is becoming increasingly important to understand how these systems work and how they can be improved. One of the key challenges in achieving self-awareness in AI is the need to understand the physics of everyday activities. For example, many AI systems are designed to move objects or perform tasks that require a certain amount of force or energy. By understanding the physics behind these activities, we can develop AI systems that are more efficient and effective. Another important aspect of achieving self-awareness in AI is the need to understand human social interactions. AI systems that are designed to interact with humans must be able to understand and respond to social cues,

  • @justindressler5992
    @justindressler59929 ай бұрын

    This research is stunning, keep up the good work. I really like how you created a classification model to validate quality of data. This is like using experts to validate the training material. I wonder if this can be further optimized. Do you have more information on this?

  • @devon9374
    @devon93747 ай бұрын

    Great presentation, seems like the future for open source LLMs

  • @JazevoAudiosurf
    @JazevoAudiosurf9 ай бұрын

    orca, textbooks is all, so much great research coming from microsoft, keep it up

  • @420_gunna
    @420_gunna5 ай бұрын

    So sick. Thank you!

  • @ViktorFerenczi
    @ViktorFerenczi9 ай бұрын

    This is the most important video in AI/LLM in the past few months. Humanity must learn to teach AI on the best available textbooks, even if it would mean confiscating IP from its owners. There is no other way, not everything can be synthetically generated.

  • @sateler
    @sateler9 ай бұрын

    This is awesome, thanks

  • @randotkatsenko5157
    @randotkatsenko51579 ай бұрын

    Should try to teach reasoning by evaluating the steps between tasks. In theory if your reasoning abilities are exceptional, you can learn anything - stuff you never seen before.

  • @anishupadhayay3917
    @anishupadhayay39179 ай бұрын

    Brilliant

  • @tomski2671
    @tomski26718 ай бұрын

    It's amazing to see such reduction in size while maintaining quality. These models can be run on much of current consumer GPUs. I wonder what the absolute limit is when trained on pristine data?

  • @mcnica89
    @mcnica899 ай бұрын

    The fact that you can use an LLM to generate higher quality data for a new LLM and it works so well is wild. Amazing work! I wonder: do you think the performance of the original model is an upper limit to the performance achieved by this? Like do you think if you used GPT-4 to generate textbooks, and then trained a new model with the same resources used to train GPT-4 (i.e. params & tokens), that it would exceed GPT-4 generally? If so, can't we just run this on a loop to create better and better models forever? (I suppose you can't practically run this experiment with GPT-4, but you could for example use Phi-1 to write textbooks and then retrain to make a new model on those and compare that performance to Phi-1.)

  • @SebastienBubeck

    @SebastienBubeck

    9 ай бұрын

    I believe you can exceed the teacher model :-). More on that soon hopefully!

  • @toprakdikici9459

    @toprakdikici9459

    9 ай бұрын

    @@SebastienBubeck thats almost insane :o waiting for it!

  • @ripper5941

    @ripper5941

    6 ай бұрын

    ​@@SebastienBubeckexciting times agead indeed mr Sebastian

  • @hidroman1993
    @hidroman19939 ай бұрын

    Who could have known that data quality matters :)

  • @sophontec2822
    @sophontec28228 ай бұрын

    So clear and concise. Leave me the idea that the learning processing of LLM could be similar to student learning from textbook. So anyway to extrapolate that to be a great innovative critical thinking agent, learning from textbook and after that focusing on some interesting problems will give us great scientists?

  • @rezabonyadi4673
    @rezabonyadi46738 ай бұрын

    Did you by any chance test what happens if you train your phi model from scratch on the Code Exercises only? So, no pre-training on the Code Textbooks, but only exercises (as exercises has the largest impact).

  • @brandomiranda6703
    @brandomiranda67039 ай бұрын

    how would you use gpt4 to classify what text is high quality? just prompt it and feed the text and returns a literal score?

  • @mungojelly

    @mungojelly

    9 ай бұрын

    sure yeah it's great at scoring things on all sorts of metrics!! $30 to score a million tokens, though😭😭😭😭😭so you want to score with something that costs more like $1/million if you possibly can

  • @vipulvyas7600
    @vipulvyas76005 ай бұрын

    But now a days what i think, we needed to rewrite our textbooks (or may be Wikipedia) may be using AI because they were written by those who have very limited ( compared to latest AI) knowledge. We needed to rewrite books that are 1. Complete 2. factually correct 3. Unbiased 4. Written Perfectly & Written AI friendly. (Most IMP)

  • @mungojelly
    @mungojelly9 ай бұрын

    um so the obvious follow-up work is to make even more textbooks and to train some 7B and 13B models on them and see how good you can get that ,,, i assume someone will do that pretty soon, since it's not prohibitively expensive to train a 7B model, lots of institutions can swing that ,,,, do you know of that happening yet, is that what you're doing

  • @Cloudruler_
    @Cloudruler_9 ай бұрын

    Its upsetting to hear that google's excluding textbooks from PaLM. Their model will never compete, nobody will use it

  • @memegazer
    @memegazer9 ай бұрын

    I disagree that this supports that there is no contimenation or overfitting bc I don't agree with the metrics you are using to validate that claim. There is no control group or plecebeo.

  • @TheReferrer72
    @TheReferrer729 ай бұрын

    Training LLM's on quality datasets yielded better results? Whom could have known.

  • @michealhall7776
    @michealhall77769 ай бұрын

    Open source your models or it didn't happen.

  • @SebastienBubeck

    @SebastienBubeck

    9 ай бұрын

    huggingface.co/microsoft/phi-1_5 huggingface.co/microsoft/phi-1

  • @michealhall7776

    @michealhall7776

    8 ай бұрын

    @@SebastienBubeck Thank you.

  • @waitwhat9669
    @waitwhat96699 ай бұрын

    TIL you can't be toxic towards men and christianity

  • @gmalo2105

    @gmalo2105

    9 ай бұрын

    I noticed that also. It's ok to be toxic to whites, christians, and men. It begs the question of what is meant by "toxicity" and does reducing toxicity involve eliminating observable and measurable reality?

  • @toprakdikici9459
    @toprakdikici94599 ай бұрын

    Gonna watch the video tomorrow thanks for sharing

  • @SachinDolta

    @SachinDolta

    9 ай бұрын

    lh3.googleusercontent.com/-sC8wj6pThd7FNdslEoJlG4nB9SIbrJG3CRGh7-bNV0RVfcrJuwiWHoUZ6UmcVs7sQjxTg4=w48-h48-c-k-nd