Generative Models Can Outperform The Experts That Train Them

Transcendence: Generative Models Can Outperform The Experts That Train Them
arxiv.org/abs/2406.11741v1
Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo!
/ tunadorable
account.venmo.com/u/tunadorable
Discuss this stuff with other Tunadorks on Discord
/ discord
All my other links
linktr.ee/tunadorable

Пікірлер: 58

  • @iansotir6777
    @iansotir67779 күн бұрын

    I find this white paper incredibly misleading. The researchers even admit that the reason training on 1500 elo data doesn’t result in 2000 elo play is because 1000 elo players are just “noisy” 1500 elo players. In other words the 1000 elo chess players have that ranking in part because they make a high frequency of major blunders. The low temp model effectively smooths out those major errors and it just makes less blunders. This is why it’s not repeatable at higher elo and not a sign of any generalization outside of chess

  • @iansotir6777

    @iansotir6777

    9 күн бұрын

    *low level chess

  • @SiiKiiN

    @SiiKiiN

    9 күн бұрын

    To understand this better consider two functions f(n)=3 And g(n)= (random(0,1) An llm trained on f yields a model which predicts that the next token is 3 with 100%. Whereas an llm trained on g yields a model which predicts 3 with 80% and 20% distributed across interval 0-10. G therefore approaches f as temperatures approach 0. If f performs better than g at a benchmark you can then say The models trained on g performs better as temperature decreases

  • @Tunadorable

    @Tunadorable

    9 күн бұрын

    the goal here wasn’t to generalize outside of chess lmao just to show the potential of model performance in spite of noisy data. don’t get me wrong i certainly went a bit overboard with the clickbait-y-ness of the thumbnail but it’s pretty easy to imagine ways in which this result may be important. for example, a foundation model trained on time-series data, which is a modality usually containing an absurd amount of noise, might be able to take advantage of this phenomenon and predict far better than previous non-deep learning methods which mostly essentially consist of progressively fancier versions of exponential moving averages

  • @BrianMosleyUK
    @BrianMosleyUK9 күн бұрын

    Great find, great paper and great walk through! I have a feeling there's more to be found in this direction of research. Well worth following this thread in coming months. Thanks!

  • @BradleyKieser
    @BradleyKieser9 күн бұрын

    They rediscovered Parondo's Paradox. Interesting to see it applied in this context.

  • @Tunadorable

    @Tunadorable

    9 күн бұрын

    I’ve not heard of that, I’ll give it a google

  • @goodtothinkwith

    @goodtothinkwith

    7 күн бұрын

    That’s very interesting. I’d love to know what connection there is between Parrondo’s paradox and grokking

  • @jakeaustria5445
    @jakeaustria54459 күн бұрын

    I kinda have a guess for why there is a difficulty for 1500 trained models to transcend to 2000, but the low-rated played can. I am a noob chess player myself, so I'm kinda saying this from experience. Human chess players make a bad move when they do not know what they are doing. That's why they just try an educated guess for their next move. Since these educated guesses can be considered as a stochastic process that if averaged through an ensemble will cancel out, you can get the move of players who do know what they are doing. This does not apply to 1500s above because they do need to know specific things. Also, it relies more on qualities within the player itself. To make a best move from 1500s above(I am 1300-1400), you need to be able to notice tactical patterns. This ability to notice things is more of an experience problem than a knowledge problem. 1500 players are players that have a general knowledge of chess but I don't think that most of them have enough years of playing chess to notice tactical patterns easily. To summarize, knowledge problem can be solved through transcendence as solid knowledge will be left from averaging out guessed knowledge. However, experience is not the same thing. You can imagine the players as having a time played in their heads and most of them have clearly lower time playing chess than those in the 2000s(exceptions are prodigies but they are a small part of the population). Averaging out does not help as most 1500 experts in your data do not have enough experience to notice and calculate the best moves. The best you can do is have a very small part of the population who do have the best move and then some clusters of majority moves that players with less experience plays.

  • @ATH42069
    @ATH420699 күн бұрын

    thanks, boss.

  • @user-kp4sf9lc2u
    @user-kp4sf9lc2u10 күн бұрын

    Production changes are great

  • @_paixi
    @_paixi9 күн бұрын

    In business there is the second mover advantage where you can just copy the moves of the first movers while avoiding their mistakes to out compete them, but as a market matures, the mistakes made become less common and more similar and so this advantage is lost. It would be interesting to see future work train a model to exploit this phenomenon by identifying weak moves and avoiding them to show that you can always get better or equal performance to the best trajectories trained on. I would bet though that exploration is required to discover better strategies when performing at the highest level.

  • @minecraftermad
    @minecraftermad7 күн бұрын

    So what this would be good at is pointing out mistakes, but not really knowing chess. but i think pointing out mistakes is an important tool for something eventually.

  • @marinepower
    @marinepower9 күн бұрын

    I always wondered if models like these learned how to play as well as the best players (as it provides the most logical / consistent signal), and would then nerf themselves to emulate lower level players. Seems like that might very well be the case! It makes you wonder how good some models might actually be if we somehow removed the 'self-sabotage; mechanisms from within themselves.

  • @goodtothinkwith
    @goodtothinkwith7 күн бұрын

    Really good stuff… but they seriously need to try this with a more sophisticated approach to chess. Only having vectors of moves with no board knowledge… thats a great starting point. I was hoping they’d do more after that though

  • @GNARGNARHEAD
    @GNARGNARHEAD9 күн бұрын

    oi, audio's good.. damn that's a cool paper 👍

  • @spencerfunk6697
    @spencerfunk669710 күн бұрын

    Mic quality makes a world of difference idk sounds way less chaotic

  • @goodtothinkwith
    @goodtothinkwith7 күн бұрын

    This seems like good evidence that Chollet is wrong

  • @ckq
    @ckq9 күн бұрын

    Good video, explained very well, but the underlying concept is pretty simple: That's how markets work - the average of all knowledge is better than any individual (wisdom of the crowds).

  • @Tunadorable

    @Tunadorable

    9 күн бұрын

    love the connection to markets

  • @drdca8263
    @drdca826310 күн бұрын

    The impressive-sounding-ness of the name of this paper seems to me, not necessarily justified by the importance of the results? Like, the idea makes sense. If different experts make different random errors (or, in random selections of inputs, make errors), then by averaging over them, and taking what is most likely, then you can partially filter out that random error. This doesn’t seem surprising? Like, is there a real reason to use “transcend” over “exceed” other than it sounding cool? Not that that means they shouldn’t have called it what they did. I don’t know. I don’t know what the norms should be. Actually, the part that sounds more surprising to me, is that they proved(?!) that “if just training on a randomly selected expert, without altering the temperature, then you can’t possibly do better than the best expert in the training set” Surely that should depend on the (implicit) prior over functions that is used when learning the behavior from the experts? Maybe it relates to “assuming infinitely much training data from each expert”? If the only thing being changed is the distribution over points in testing, but where the set of points present is the same, then I guess the “it has to just be the average of what the experts say” makes sense. Can one give a version of that with temperature then? I think so... If you have a probability distribution over a discrete set, say, over the natural numbers, well, the Boltzmann distribution is p_i = e^{T z_i} / (\sum_j e^{T z_j} ) if we set T=1 , then if we take z_i = \ln(p_i) , that gives the distribution (p_i)_i and then to change the temperature to T, q_i = e^{T \ln(p_i)} / (\sum_j e^{T \ln(p_j)}) = (p_i)^T / (\sum_j ((p_j)^T)) (This is assuming that none of the p_i are zero, but, after canceling the exp and the log that’s no longer a problem for T>0 , so whatever) Ok. Yes, you can alter the temperature of a distribution regardless of where you got it.

  • @Tunadorable

    @Tunadorable

    9 күн бұрын

    I think this is one of those cases where it seems more obvious in hindsight than it actually was, and I say that because I’ve argued with multiple researchers in the past about this very question (they thought that the upper bound on performance would always be equal to the average of the experts)

  • @firstnamesurname6550

    @firstnamesurname6550

    9 күн бұрын

    Agree, the paper's title tends to be 'hype' BS.

  • @jakeaustria5445
    @jakeaustria54459 күн бұрын

    Hello, Turnadorables. I kinda stochastically stumbled to your channel through my chaotic meandering haha. I did not expect to receive such channel that reviews recent research paper and condenses it for layman people like me. I will be going to college soon and I choose Statistics as my course. I first have thoughts about Computer Science, but after being second to the last and later on being in the top half in NOI, I kinda realized I might not be that good for it. So I just choose what I think is the skill where I am more skilled at. So, watching this channel kinda rekindled my desire to learn CS more. Anyways, both fields are pretty close. And I am learning statistics because I also want to make discrete neural networks that is run in binary and not floating numbers. I tried many methods for this including putting floating type input data into the gaussian inverse transform to convert it into uniform, and then binarizing that uniform. I then used Bayesian Statistics to predict the output bit using the past input bits. I encountered many problems including overfitting and the undesirable quality of predictions by discrete models. I haven't given up yet. I tried Voronoi-inspired approach and used the Hamming Distance then some weights. Anyways, your channel is a godsend for me. In the topic for transcendence, I get this as a wisdom of crowds kind of thing. What I am surprise about is that increasing randomness actually decreases transcendence. I kinda expected it to be unimodal where there is an optimum temperature where transcendence is maximum just like in the Kelly Criterion(optimum fraction of capital). Kinda surprising and a bit disappointing but amazing nonetheless.

  • @ckq

    @ckq

    9 күн бұрын

    Why would more randomness ever be good? If you're talking about the Kelly criterion that's just about betting in accordance to the likelihood to maximize the log of your bankroll. This is different, it's a discrete scenario where 1 move is the objective best, it's impossible to average moves and get something better than either move.

  • @firstnamesurname6550

    @firstnamesurname6550

    9 күн бұрын

    @@ckq Randonnes is used for setting the action-space and filtering in developing stages... once you get an effective and resilient conectome ... randonnes will come from the users base prompts ... then, re-traint with 'prompt-space' sets ... but if you want to optimize a tasks that had been already achieved by previous sets of data ... it is not required to improve the system with trash data at root level ... 'but because the space of prompt and/or outer interference can not be exhaustively tested ... then, to bring some noise could help to add resilience to high uncertainty tasks .

  • @Tunadorable

    @Tunadorable

    9 күн бұрын

    interesting point about expecting it to be unimodal, now that you mention it that’s probably what I would have predicted too. in language modeling it definitely is unimodal to a certain extent (although that may be going away as the models get larger) as when you lower temperature in those the model tends to get stuck in nonsensical loops repeating itself

  • @jakeaustria5445

    @jakeaustria5445

    Күн бұрын

    @@Tunadorable Oops, my usage of randomness is a bit wrong in the context of transcendence. The term in ML is temperature, right? From what I understand, you will have a distribution of answers from the output vector. Low temperature means just selecting the highest one or the one with the highest "likelihood" in the distribution, while high temperature means picking the answer from the output vector in a more randomized manner making it effectively get the average answer.

  • @jakeaustria5445

    @jakeaustria5445

    Күн бұрын

    @@Tunadorable Sorry for that, I only have basic knowledge in ML.

  • @JaredFarrer
    @JaredFarrer10 күн бұрын

    Well what did you expect making a gigantic compute node that is dedicated to next word prediction. It starts to predict words better than you.! Lol

  • @spencerfunk6697
    @spencerfunk669710 күн бұрын

    I can actively imagine this technique being “the thing” that puts us on the path toward achieving super intelligence.

  • @gunaysoni6792

    @gunaysoni6792

    9 күн бұрын

    This would at best mimic a human expert right? This doesn't give the model to magically do things it hasn't seen in the training data. A 1000 rated player does play at a 1500 rating level for some time. So this only lets you get the "best" out of the distribution but doesn't allow reasoning out of distribution.

  • @jonnylukejs
    @jonnylukejs10 күн бұрын

    unless you're an idiot like me and you combined them all and made a bunch of your own and mix and match them at random!

  • @pensiveintrovert4318
    @pensiveintrovert43189 күн бұрын

    Ensembles have always been known to outperform individual models. Nothing different. LLMs are essentially ensembles.

  • @Tunadorable

    @Tunadorable

    9 күн бұрын

    very interesting relation here i like it

  • @sikunowlol
    @sikunowlol10 күн бұрын

    oi

  • @firstnamesurname6550
    @firstnamesurname655010 күн бұрын

    Tesla engineers: Ops, lets clean all the data of females parking a vehicle from the data set ... 😛 Woke activists begins the protest at the company's door ... Tesla PR dude dressed as a girl ... The model is agnostic ... 'the data set required a data set diversity as a necessary condition to trascend ... ' '... Not genre identity bias in the system ... '

  • @kbro6618

    @kbro6618

    10 күн бұрын

    Are you having a stroke?

  • @garazaadsf6921

    @garazaadsf6921

    9 күн бұрын

    ​@@kbro6618 i dont think he is having a stroke, i think he is a boomer. Boomers use the (...) operator to separate their thoughts

  • @Tunadorable

    @Tunadorable

    9 күн бұрын

    Honestly could be a bot or a boomer you can’t tell these days, lmao Hi @firstnamesurname6550, I’m all for humor but in order for humor to work you’ve gotta be funny, and you came across more so as an ignorant jerk. This is a warning but in the future if you leave another unnecessary, unrelated, and rude comment and I recognize your username your comments will be muted from the channel. Also the call may be coming from inside the house, I’d recommend you look up what insurance companies think about male vs female driving abilities

  • @firstnamesurname6550

    @firstnamesurname6550

    9 күн бұрын

    @@garazaadsf6921 How millenials separete their thoughts? 😅

  • @firstnamesurname6550

    @firstnamesurname6550

    9 күн бұрын

    @@Tunadorable Crystal gen?

  • @lexer_
    @lexer_5 күн бұрын

    This is a "if bad players wouldn't blunder they would be less bad players". Duh. Despite all the setup they do with theory there is one fundamental assumption in here that seems flawed which is that mistakes are inherently random and can be smoothed out. Any half-way decent player can tell you this is not the case and you can strategically exploit the typical 1500 elo chess player mistakes as a better player. So it makes sense that the model would include these mistakes in its function fitting. The core model of how mistakes happen is flawed. An even more obvious way to put it might be that they assume that playing a 1500elo player is like playing stockfish if you injected sufficiently many random moves until you reached a maximum evaluation of 1500elo. Anyone can immediately tell you this is obviously not true.

  • @Tunadorable

    @Tunadorable

    5 күн бұрын

    I responded to your comment in my most recent video. this link takes you to the exact timestamp kzread.info/dash/bejne/rHuArMWRqK7NdpM.html

  • @lexer_

    @lexer_

    5 күн бұрын

    @@Tunadorable Great reply, thanks! I probably was too specific about the chess. I get that this is not about chess in particular. I just leaned into the chess example because I thought it illustrates the general flaw in the logic in an obvious way. I guess at the end of the day its just a difference of opinion about randomness of human thought. I fully agree that humans are terrible at reasoning but I think the bad reasoning humans do in general is typically not random. I think they are actually very similar to the mistakes LLMs make where they take incorrect shortcuts and make logical leaps that are not actually supported based on intuitions about things we feel are somehow related. I believe you should be able to consistently reproduce most human mistakes and even cause thinking flaws to happen just by exposing a human to a certain set of true information that nonetheless then leads to repeatable wrong assumptions about a seemingly related question. If the errors were random then as this paper already points out the model should mostly average out this random error but it doesn't. Most of the mistakes humans make, and as a consequence LLMs make as well, are very typical and recognizable in my experience and not random at all. But there is no point arguing about a fundamental difference like that. I am not certain about this. I guess time will tell. Something interesting to watch out for for sure. The other thing about the knowledge silos feels like a much more promising avenue and I fully agree that these models have the potential to unlock connections across isolated knowledge silos and already do. I think this is super important to ensure continued progress in science and engineering.

  • @roozbehrazavi5427
    @roozbehrazavi54277 күн бұрын

    Absolutely worthless paper. It's a trend to reinvent/rediscover things and publish them in AI field only because authors are affiliated in fancy institute.

  • @hamzaumair7909
    @hamzaumair79099 күн бұрын

    try speaking more naturally, your sentences tend to end in the same tone variation.

  • @Tunadorable

    @Tunadorable

    9 күн бұрын

    lmaoooo that’s a permanent result of my tism-adjacent neuro flavor or whatever convoluted name we’re using these days. always been told this and never ever ever been able to hear a difference in the way i speak vs everyone else

  • @epajarjestys9981
    @epajarjestys99819 күн бұрын

    worthless paper