What is Q-Learning (back to basics)

Ғылым және технология

#qlearning #qstar #rlhf
What is Q-Learning and how does it work? A brief tour through the background of Q-Learning, Markov Decision Processes, Deep Q-Networks, and other basics necessary to understand Q* ;)
OUTLINE:
0:00 - Introduction
2:00 - Reinforcement Learning
7:00 - Q-Functions
19:00 - The Bellman Equation
26:00 - How to learn the Q-Function?
38:00 - Deep Q-Learning
42:30 - Summary
Paper: arxiv.org/abs/1312.5602
My old video on DQN: • [Classic] Playing Atar...
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
KZread: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 179

  • @raunaquepatra3966
    @raunaquepatra39666 ай бұрын

    this is how you leverage the hype like a true gentleman 😎

  • @MideoKuze

    @MideoKuze

    6 ай бұрын

    Yes! Jumping on the hype train like a sir. One upvote from me, you're welcome for the gold.

  • @EdFormer
    @EdFormer6 ай бұрын

    I have no time for the hype, but I have all the time in the world for a classic Yannic Kilcher paper explanation video

  • @guidaditi
    @guidaditi6 ай бұрын

    Thank Q!

  • @changtimwu
    @changtimwu6 ай бұрын

    Thanks for such a solid fundamental introduction to Q-learning especially in a time many are really excited about Q-star, but few seem to try understanding its basic principles.

  • @Alilinpow2
    @Alilinpow26 ай бұрын

    Thank you Yannic your style of surfing the hype is the best!!!

  • @OrdniformicRhetoric
    @OrdniformicRhetoric6 ай бұрын

    I would be very interested in seeing a series of paper/concept reviews such as this focusing on the state of the art in RL

  • @qwerty123443wifi
    @qwerty123443wifi6 ай бұрын

    Love these paper videos, the reason I subscribed to the channel :)

  • @K.F-R
    @K.F-R6 ай бұрын

    This was very informative. Thank you so much for sharing.

  • @agenticmark
    @agenticmark5 ай бұрын

    Another awesome video from you Yannic! Gold material on this channel.

  • @vorushin
    @vorushin6 ай бұрын

    18:00 In chess terms, 'Reason 1' can be likened to: 1) Choosing a1 means you won't capture any of your opponent's pieces. 2) Opting for a2 allows you to swiftly capture a substantial piece.

  • @michaelbondarenko4650
    @michaelbondarenko46506 ай бұрын

    Will this be a series?

  • @ceezar
    @ceezar6 ай бұрын

    I did deep q learning for my cs bachelors thesis way back. Thank you so much for reminding me of those memories.

  • @clray123

    @clray123

    6 ай бұрын

    of those terrible memories ;)

  • @abnormal010

    @abnormal010

    5 ай бұрын

    What are you currently doing?

  • @travisporco
    @travisporco6 ай бұрын

    thanks for posting this; good to see some real content

  • @ProblematicBitch
    @ProblematicBitch6 ай бұрын

    I need someone to upload the Q function to my brain so my life choices start making sense

  • @banknote501

    @banknote501

    6 ай бұрын

    Maybe just try to lower the epsilon to make less random choices in life?

  • @2ndfloorsongs

    @2ndfloorsongs

    6 ай бұрын

    Your brain comes preloaded with a Q function and it's following it. Make some popcorn and enjoy the show.

  • @DeltafangEX

    @DeltafangEX

    6 ай бұрын

    It's possible they'll make sense in retrospect as the most optimal path far in the future. Or it could be that your most optimal path will always suck from your perspective but in fact provides the least amount of suffering possible. So...look on the bright side?

  • @33markiss

    @33markiss

    6 ай бұрын

    @user-dt7px5xp6z That’s called Natural Selection, another algorithm of “nature”.

  • @draken5379
    @draken53796 ай бұрын

    A good example for what you were talking about just before the bellman eq, would be that Move B(10 reward) will help take a chess piece in the future. Where as Move A, will result in moving away from that reality, or even maybe having the piece be taken by the opponent, making the 'next move' the 'policy' would want, not be possible.

  • @maxbretschneider6521
    @maxbretschneider65213 ай бұрын

    By far the best video on the topic

  • @user-oj9iz4vb4q
    @user-oj9iz4vb4q6 ай бұрын

    With regards to that future discounting, it's not just that you'd "like" to have it right now. It's that it's more useful right now and so $100 now is more useful than $100 tomorrow. If only because I could invest that $100 I got today, and have $100 + interest tomorrow. Economics formally defines these things with stuff like net present value.

  • @nickd717
    @nickd7176 ай бұрын

    This is great. You’re a true wizard in explaining Q, and I love the anonymous look with the sunglasses. You’re a regular Q-anon shaman.

  • @matskjr5425
    @matskjr54256 ай бұрын

    By far the most effective way of learning. Hacking at the essence, in a chain of thought manner.

  • @Dron008
    @Dron0086 ай бұрын

    Thank you, great explanation!

  • @user-tg6lv6hv4r
    @user-tg6lv6hv4r6 ай бұрын

    I realize that I read this paper ten years ago. Now I'm ten years older omg.

  • @cezary_dmowski
    @cezary_dmowski6 ай бұрын

    perfect for my sunday. appreciated!

  • @Alberto_Cavalcante
    @Alberto_Cavalcante6 ай бұрын

    Thanks for this explanation!

  • @drdca8263
    @drdca82636 ай бұрын

    I will make sure to stay hydrated, thank you

  • @drhilm
    @drhilm6 ай бұрын

    Old paper review - yeh! we missed that.

  • @jackschultz23
    @jackschultz236 ай бұрын

    My dude, that point you mention at 45:05, right at the end, about having state and actions being the input is exactly the question I've been trying to find an answer to. To see and hear it mentioned twice but each time you said you're not going to talk about it felt like knife in heart. If you don't do a video on it, do you have papers that talk through how this has been done? Great stuff either way, able to learn a bunch.

  • @JuanColonna
    @JuanColonna6 ай бұрын

    Great explanation

  • @neocrz
    @neocrz6 ай бұрын

    Very informative.

  • @user9924
    @user99246 ай бұрын

    Thanks man for the explanation

  • @Seehart
    @Seehart6 ай бұрын

    15:30 You still tend to want a discount < 1 in things like chess with a terminal reward. All other things being equal, you want to win sooner than later. Otherwise, with discount=1, you might forego a mate in one if a different mate is within your horizon, and that could go on forever (or perhaps for 49 unnecessary moves). I use 0.9999 for that kind of scenario, which is sufficient.

  • @therainman7777

    @therainman7777

    6 ай бұрын

    Good point, thanks for the info.

  • @EnricoGolfettoMasella
    @EnricoGolfettoMasella6 ай бұрын

    During your explanation it comes to my mind the Dijkstra's Algorithm. They say that this Q* can increase the processing needs some 1000 times. You check all the paths in your graph and choose the ideal one.

  • @ericbabich

    @ericbabich

    6 ай бұрын

    maybe not if you consider if a non-reward end means you have to run the whole process again and prehaps the only way to reach a satisfactory answer is to employ a checking mechanism that reduces chance of failure for some questions

  • @notu483

    @notu483

    6 ай бұрын

    Yes, and A* is even better than Dijkstra for pathfinding.

  • @clray123

    @clray123

    6 ай бұрын

    And what pray tell might be the heuristic or reward function when it comes to next token generation? It seems it all hinges on the most important issue of first having to solve the problem which you are aiming to solve by your wonderful search algorithm.

  • @dr.mikeybee
    @dr.mikeybee6 ай бұрын

    Good job!

  • @tchlux
    @tchlux6 ай бұрын

    Yeah I guess that Q-star will run multiple completions for each prompt with the large language model and then model the cumulative probability of the next token over the different completions. To trim the search space they probably do one full response at a temperature of 0 (only pick highest likelihood next tokens), then pick the few places in the response where it was closest to picking a different token and explore the graph of alternative responses that way, similar to the greedy A-star search for a best path. Alternatively they could just generate a few responses with a small temperature. If they generate a bunch of completions that way then they could create a Q estimator to improve the selection of tokens at the current step of the response for longer time horizon "correctness". At runtime they could use that Q estimator and an A* approach (greedy after adding in heuristic) to pick next tokens, which encourages the model to "think ahead" better than current approaches. Without the ability to reassess final responses, and "go back and change it's mind" (which would be a lot more computationally expensive), I suspect we'll still see lots of examples of the large language models being confidently wrong, but I guess we'll find out soon!

  • @yohanhamilton7149

    @yohanhamilton7149

    6 ай бұрын

    Quote ' I suspect we'll still see lots of examples of the large language models being confidently wrong', It's like we human would think-twice about what we are going to say before speaking. So, it's always good habit (but more computationally expensive) to do so. So does LLM, Q* might be mimicking that human habit using by forecasting reward of saying A instead of B using some "smart" heuristics (just like A*'s distance heuristic in making decision of what state to explore next)

  • @clray123

    @clray123

    6 ай бұрын

    The problem with your "approach" is that you "forgot" to define the reward function.

  • @alexd.3905
    @alexd.39054 ай бұрын

    very good explanation video

  • @AncientSlugThrower
    @AncientSlugThrower6 ай бұрын

    Great video.

  • @MichaelScharf
    @MichaelScharf6 ай бұрын

    Great video

  • @Rizhiy13
    @Rizhiy136 ай бұрын

    40:50 Why limit the possible actions only to best and random? Why not sample according to something like softmax of Q?

  • @sultanzabu
    @sultanzabu6 ай бұрын

    great explanation

  • @dreamphoenix
    @dreamphoenix6 ай бұрын

    Thank you.

  • @hilmiterzi3847
    @hilmiterzi38476 ай бұрын

    Nice explanation G

  • @jurischaber6935
    @jurischaber69356 ай бұрын

    Thanks again.😊

  • @oraz.
    @oraz.6 ай бұрын

    Is the update done at each step or do you actually have to recurse to the end to get R.

  • @luckyrand66
    @luckyrand666 ай бұрын

    nice video!

  • @2ndfloorsongs
    @2ndfloorsongs6 ай бұрын

    When things get hard for me to understand, I find myself blankly staring at your sunglasses. I like to tell myself it's some sort of behavioral adaptation that provides a survival advantage of some sort. After staring at your sunglasses for a few minutes, I find I can detach myself enough to get up and make popcorn. All this evolution stuff usually comes down to food.

  • @nisenobody8273

    @nisenobody8273

    6 ай бұрын

    same

  • @clray123

    @clray123

    6 ай бұрын

    It seems you've already mastered your Q function, what else is there to learn?

  • @warsin8641
    @warsin86416 ай бұрын

    Ty

  • @Eric-eo1dp
    @Eric-eo1dp6 ай бұрын

    it's primal dual optimization on neural networks. Currently researcher uses infinite network theory to approach global solution in neural networks. Primal dual achieve the same goal by transforming the space of the neural network.

  • @pi5549
    @pi55496 ай бұрын

    Yannic, when you invent time-travel can you go back to 2015 and re-upload? This will save me from struggling through Sutton's book.

  • @fixfaxerify
    @fixfaxerify6 ай бұрын

    Hmm.. in classic chess algos you have board eval / positional analysis functions, should be useable as a reward function, no?

  • @visuality2541
    @visuality25416 ай бұрын

    Could you also exaplain A* algorithm in detail?

  • @NelsLindahl
    @NelsLindahl6 ай бұрын

    My favorite videos are the ones where Yannic draws everywhere... please build that into the future bot Yannic constitution...

  • @jimmy21584
    @jimmy215846 ай бұрын

    Reminds me of the minmax algorithm that I used for a Game Boy Advance board game back in the day.

  • @skipintro9988
    @skipintro99886 ай бұрын

    Yannic is the best

  • @JohnSmith-he5xg
    @JohnSmith-he5xg6 ай бұрын

    I'm a big fan. I'm impressed by you being able to speak/draw this off the cuff, but it might be better next time to refer to the printed out equations. It got a bit messy.

  • @visuality2541
    @visuality25416 ай бұрын

    Lovely

  • @Halopend
    @Halopend6 ай бұрын

    Nice explanation. Very clear explanation. It’s only at the end with gradient descent where I got lost as to what the comparison is between. Total reward vs ____________? I normal think of gradient descent as y vs y’, actual vs measured or current vs next and updating knowledge based on the diff. Not seeing the y’ here. If I need a different analogy here, let know.

  • @gianpierocea
    @gianpierocea6 ай бұрын

    Being overly pedantic, but in terms of notation at 24:27 you want argmax, not max: this is a policy so it should spit out an action a, right? Very clear exposition, nice video :)

  • @JonathanYankovich
    @JonathanYankovich6 ай бұрын

    This is great, thank you. Have some engagement.

  • @KolTregaskes
    @KolTregaskes6 ай бұрын

    36:00 Yannic goes Geordie on us and starts repeating way aye over and over, hehe.

  • @idiomaxiom
    @idiomaxiom6 ай бұрын

    So ChatGPT would try to guess what my next question will be after its response and optimize for that, effectively learning to read my mind?

  • @EdanMeyer
    @EdanMeyer6 ай бұрын

    The timing on this is too good lmao

  • @davidbell304
    @davidbell3046 ай бұрын

    Hey Yannick. Thought you might like to know, Juergen Schmidhuber is claiming Q* as his invention 😀

  • @SLAM2977
    @SLAM29776 ай бұрын

    Teaching AGI already :)

  • @UCs6ktlulE5BEeb3vBBOu6DQ

    @UCs6ktlulE5BEeb3vBBOu6DQ

    6 ай бұрын

    👀

  • @wolffischer3597
    @wolffischer35976 ай бұрын

    Thanks a lot, really good video! What I am wondering: how exactly could that translate to an LLM? You said that the possible actions would depend on the token space. So how would the Q function then know that it would have reached the final state with the highest reward possible? I do understand how that works for chess (check mate) but for a natural language prompt that states some ambiguous problem or statement? Maybe that is why The rumours said that this works for grade school level math problems which is a very specific subset of the whole space out there (and it still is large). I can't yet imagine to make this work properly for something like GPT4 or Bard or..., especially not for a customer grade solution.

  • @therainman7777

    @therainman7777

    6 ай бұрын

    I think the idea is that you seed the whole process with human-provided feedback, that is given at the step-by-step level after instructing the model to “reason step by step.” That’s what the “Let’s Verify Step by Step” paper is all about. Rather than simply checking whether the model got the final answer right, humans grade every individual step of the model’s reasoning. A reward model is then trained on this human feedback data, and learns to mimic a human grader at assessing the quality of a reasoning step. Once you have this reward model, you can grade any string that is meant to be an expression of reasoning, and you can do this one token at a time, so that you can test out the search space in a tree-like fashion, and choose the optimal next token in terms of maximizing the expected value of the finished string. Does that make sense? You’re starting with (probably expert) human feedback as a seed, training a model to emulate such human feedback, and then using reinforcement learning to search the space and improve both the token selection AND the reward model that you initially trained on human feedback. The fact that you’re improving both the next-token prediction (the “policy”) AND the reward model (the “Q table”) at the same time is critical, as this is what makes this approach similar to self-play systems such as AlphaGo. Both sides are being optimized simultaneously, so that the model is teaching itself. This is how these systems can ultimately achiever superhuman performance, even though they were initially seeded with human data.

  • @clray123

    @clray123

    6 ай бұрын

    The answer is nobody knows, probably not even OpenAI's marketing team.

  • @wolffischer3597

    @wolffischer3597

    6 ай бұрын

    @@therainman7777 hm I sadly am not an expert on AI or LLM, just an enthusiastic amateur. From my point of view, it could make sense, I still miss the phantasy of how you would scale that to large and unknown domains. And in the end it probably still is not "reasoning" but "just" recall of trained associations /patterns in a known domain and as soon as you deviate from that domain your results start to get worse and worse... But let's see. I am not too unhappy if it takes more time until the singularity ;)

  • @therainman7777

    @therainman7777

    6 ай бұрын

    @@wolffischer3597 No, the point of reinforcement learning is specifically that it is NOT just recalling patterns or learned associations that it saw in the training data. With reinforcement learning, the model is free to explore the _entire_ search space, meaning it can (and usually does) stumble onto solutions and methods that no human being has ever thought of before. This happened with AlphaGo, along with countless other RL systems.

  • @wolffischer3597

    @wolffischer3597

    6 ай бұрын

    @@therainman7777 hm yes, but it is still go, right? And it is still pattern matching, although highly complex patterns, combined with tree search and some temperature probably for some randomisation..?

  • @adamrak7560
    @adamrak75606 ай бұрын

    it is shockingly simple, compared to how powerful it is at solving problems.

  • @lincolt
    @lincolt6 ай бұрын

    10:12 my favorite type of pie

  • @LostMekkaSoft
    @LostMekkaSoft5 ай бұрын

    wow, turns out i accidentally invented q-learning myself when i started university. (sry for the schmidhuber vibes lol) i didnt have the mathematical background, but i knew how neural networks work in theory. the way i thought about it was this: suppose you have the complete state tree of a game, so it starts at the starting state and the tree contains all actions and all resulting states and therefore also all the terminal states. i only know the reward for the terminal states, but i can play a random (or semi random) game and this will give me a path from the starting state to one of the terminal states. then i can take the value of that terminal state and kinda "smear" it backwards, with the reward value of every state in the path getting nudged a bit in the direction of the known reward of the terminal state. and i imagined that a neural net could be trained in a way that organically "smears" all the useful known reward values from the terminal states backwards, so that after a bit of training time it would have a good reward value on any given state. when i came up with this i was super excited and started to implement this thing, but i wasnt that experienced yet, so every attempt to build my own neural network stuff just resulted in the weights diverging xDD but im super happy to know now that my idea was spot on at least ^^

  • @user-hw2bb9jy5l
    @user-hw2bb9jy5l6 ай бұрын

    based on simply the name alone abt not watching the video, i would assume the Q would stand for quantum and the algorithm would be a positive reward reinforcement also based on a tree of thoughts that calculates the highest probability for the best outcome and would probably decide finally based on a principle like occams razor

  • @user-oj9iz4vb4q
    @user-oj9iz4vb4q6 ай бұрын

    I think what's missing here is model regression and simulated playback (dreaming).

  • @sagetmaster4
    @sagetmaster46 ай бұрын

    I get your skepticism but names mean something. Especially programmer types tend to have sensible names. Whether this thing is just conceptually similar or actually shares similarities in the architecture I think at least one of these lines of speculation is pretty close to the real Q*

  • @ohadgivaty2366
    @ohadgivaty23665 ай бұрын

    such an algorithm can turn LLM from something that simply answers questions to something that will have a clear goal like a salesman or convince people to vote for a certain candidate.

  • @clray123
    @clray1236 ай бұрын

    14:44 Actually, the lack of certainty about future is the only reason why we've evolved to be impatient and greedy about getting our rewards. If there was a guarantee for every future promise to be fulfilled and also for our life (and health and other circumstances related to the success of consuming the reward) to be extended so as to be exactly the same later as it is now, there would be no reason to hurry at all. P.S. This is also why "today or tomorrow" is an unconvincing example for time preference. Make it "today or in a hundred years" and everyone will understand and agree.

  • @jaymee_
    @jaymee_6 ай бұрын

    Ok so effectively it's just comparing two different policies, that might actually be the same policy but with a single check before taking the following step? Like playing Tetris knowing what the next piece is going to be?

  • @thorcook

    @thorcook

    6 ай бұрын

    sort of... but i think it's actually _recursively_ comparing policies to 'itself' (the 'composed' or embedded [prior] policy) to iterate through steps. at least that's how Q [reinforcement] learning works

  • @JinKee
    @JinKee6 ай бұрын

    I wonder if the star in Q* is a reference to A* pathfinding

  • @awillingham

    @awillingham

    6 ай бұрын

    You can build out a graph of states with actions connecting them, and then use the Q function as the heuristic you need as input to A*, and you can effectively search the problem space to find an optimal path to the solution (for your input Q function). I think this technique would let you efficiently search complex problem spaces

  • @oraz.

    @oraz.

    6 ай бұрын

    Many people are saying this!

  • @JinKee

    @JinKee

    6 ай бұрын

    @@awillingham winston churchill once said "you can always count on the united states to do the right thing, once they've tried everything else." Sounds like our government needs to implement Q*

  • @bebeperrunocanino2337
    @bebeperrunocanino23374 ай бұрын

    Yo aprendi el algoritmo Q-Learning con la ayuda de una ia, la ia me enseño como y yo pude.

  • @keypey8256
    @keypey82566 ай бұрын

    I'm just at 26:09 and so far it has been just min-max for singleplayer Edit: It's funny that I don't know much about machine learning but have already seen all of those ideas in other fields. This kind of shows that the ideas from papers make their way also into other fields.

  • @abdelkaioumbouaicha
    @abdelkaioumbouaicha6 ай бұрын

    📝 Summary of Key Points: 📌 Q-learning is a concept in reinforcement learning where an agent interacts with an environment, receiving observations and taking actions based on those observations. 🧐 The Q function is used in Q-learning to predict the total reward that would be obtained if a proposed action is taken in a given state. It helps the agent make decisions about which actions to take. 🚀 The Markov decision process assumes that observations are equivalent to states, and discounting future rewards is important in reinforcement learning. 🚀 The Bellman equation describes the relationship between the Q value of a state-action pair and the immediate reward plus the discounted future reward. 🚀 Q-learning can be used to estimate the Q function by iteratively updating the Q values based on observed rewards and future Q values. 🚀 Neural networks, particularly in Deep Q-learning for playing Atari games, can be used in Q-learning. Experience replay, where transitions are stored and sampled to train the Q function, is also mentioned. 💡 Additional Insights and Observations: 💬 "The Q function predicts the total reward that would be obtained if a proposed action is taken in a given state." 📊 No specific data or statistics were mentioned in the video. 🌐 No specific references or sources were mentioned in the video. 📣 Concluding Remarks: This video provides a clear introduction to Q-learning and its application in reinforcement learning. It explains the concept of the Q function, the Markov decision process, the Bellman equation, and the iterative process of updating Q values. The video also touches on the use of neural networks and experience replay in Q-learning. Overall, it provides a solid foundation for understanding Q-learning and its role in decision-making and learning optimal policies. Generated using Talkbud (Browser Extension)

  • @torikapotat977

    @torikapotat977

    5 ай бұрын

    tks u so much

  • @therealjezzyc6209
    @therealjezzyc62093 ай бұрын

    This looks a lot like dynamic programming and the Bellman Equation

  • @serta5727
    @serta57276 ай бұрын

    Q-ute star 🌟😊

  • @pensiveintrovert4318
    @pensiveintrovert43186 ай бұрын

    I am speculating that it is named after James Bond nerd sidekick Q. As good as your speculation.

  • @EdFormer

    @EdFormer

    6 ай бұрын

    What speculation?

  • @2ndfloorsongs

    @2ndfloorsongs

    6 ай бұрын

    ​@@EdFormerSpeculation as to what Open AI meant by q star.

  • @EdFormer

    @EdFormer

    6 ай бұрын

    @@2ndfloorsongs when did he speculate about what OpenAI meant by Q*?

  • @2ndfloorsongs

    @2ndfloorsongs

    6 ай бұрын

    @@EdFormer He alluded to it in the first few minutes. It was a continuation of his last video updating the AI debacle. That's pretty much the whole reason he was giving this tutorial on Q.

  • @EdFormer

    @EdFormer

    6 ай бұрын

    ​@@2ndfloorsongsyou didn't sense the sarcasm, even after conditioning yourself on his views in the previous video? I.e. the one where he ridiculed those speculating that Q* is an AGI that combines Q-learning with A*, joked that it could just have come from someone holding the shift key and mashing the left hand side of the keyboard, and called everything going on with OpenAI a clown car?

  • @kinwong8618
    @kinwong86186 ай бұрын

    I thought the discount factor is a constant.

  • @indikom
    @indikom6 ай бұрын

    Policy is strategy what action to choose, for example you can choose to be more greedy.

  • @washedtoohot

    @washedtoohot

    6 ай бұрын

    Is this true? Iirc greediness is a parameter

  • @clray123

    @clray123

    6 ай бұрын

    Policy/strategy is a really stupid word for a function which produces an action given a state. But since the term was chosen so multiple decades ago, we have to suffer and live with that. "Action function" might have been less pompous and misleading, but in case you haven't noticed yet, AI people really like to pretend their inventions are smarter than actual, and this has remained true throughout history of the field.

  • @indikom

    @indikom

    6 ай бұрын

    ​ @clray123 I don't agree. "Policy" term concisely describes a fundamental concept - the decision-making strategy of an agent

  • @clray123

    @clray123

    6 ай бұрын

    @@indikom The problem is that "policy" / "strategy" in colloquial use are both much broader concepts, and are understood as some abstract considerations that GUIDE decision-making, not a definite mapping of which action to take given a particular state of affairs. Law makers who devise policies or managers who devise strategies do not continuously stalk every citizen/employee to prescribe them what to do at every possible decision point. But this is what the "policy" in AI accomplishes. So I repeat, this is a bad terms which sows confusion through analogy to real life uses of the same term. But theoretical sciences, including mathematics, are full of such weird misnomers, perhaps stemming from the fact that researchers who work in them have no clue about how real life operates beside them.

  • @dullyvampir83
    @dullyvampir836 ай бұрын

    So this wouldn't work well, if there is no immediate reward for a move like in chess?

  • @clray123

    @clray123

    6 ай бұрын

    No, the whole point is that it works even with no immediate reward like in chess (because of the discounted future reward component which kinda transports information about the final reward back across all the time steps preceding it). But having (additional and correct) immediate rewards/penalties along the path helps guide the algorithm to converge on the optimal solution faster.

  • @dullyvampir83

    @dullyvampir83

    6 ай бұрын

    @@clray123 And how do you get these immediate rewards for chess?

  • @clray123

    @clray123

    6 ай бұрын

    ​@@dullyvampir83Arbirarily - for example, you could introduce an immediate penalty on each move, to limit the number of moves per game. Or you could get statistics from records of games of successful real-world players, providing pressure to avoid certain configurations of player pieces that occur more often in losers' games than in those of the winners.

  • @alleycatsphinx
    @alleycatsphinx5 ай бұрын

    You should really go the other direction with this video - instead of starting at succession, ask what a number is (binary enumeration) and then delve into how and why binary succession works. From there you could go into binary addition (perhaps looking into how carry works,) multiplication, division, exponentiation, etc... It isn't trivial that a shift in binary is the equivalent of multiplication by two - there's a good video in all this. : )

  • @mikebarnacle1469
    @mikebarnacle14696 ай бұрын

    The call magnus carlson example is funny, I'm just imagining chat-gpt figured out long ago it can call humans and write back what they say and we had no idea this was happening all along and we were actually talking to people through an intermediate. Would explain why I have been getting so many random calls asking for medical advice lately. I always just say go see a doctor.

  • @TheDukeGreat
    @TheDukeGreat6 ай бұрын

    Ah shit, here we go again

  • @Summersault666
    @Summersault6666 ай бұрын

    Q* = Bellman + A* search ?

  • @watcher8582

    @watcher8582

    6 ай бұрын

    I've seen people mention A* search, but is there a hint for that? Already in Q-learning you got objected names Q*. I mean in optimization adding a star usually just means "the solution"

  • @Summersault666

    @Summersault666

    6 ай бұрын

    @@watcher8582 maybe changing beam search to A* search using reward as distance in a vector knowledge graph database instead of full reinforcement learning?

  • @seidtgeist
    @seidtgeist6 ай бұрын

    🧐What if Q* is a very believable Star (*!) Trek reference and, basically, the biggest and most Q-esque diversion troll ever? 🧐

  • @Henry_Okinawa
    @Henry_Okinawa6 ай бұрын

    I rly think that all this drama around Altman is fictional to advertise new product they have prepared

  • @clray123

    @clray123

    6 ай бұрын

    You got it wrong, the drama is real, the product is fictional.

  • @OperationDarkside
    @OperationDarkside6 ай бұрын

    I think, I got like 30% - 40%. I definitely lack pre-existing knowledge.

  • @garyyakamoto2648
    @garyyakamoto26482 ай бұрын

    Thanks. I wish you didn't have to go such a low bass in your voice, it's like a continuous drilling in the brain.

  • @14types
    @14types6 ай бұрын

    Is this a madman from a mental hospital who makes up formulas on the fly?

  • @drdca8263

    @drdca8263

    6 ай бұрын

    He’s summarizing a well-known technique.

  • @amansinghal5908
    @amansinghal59083 ай бұрын

    Amazing! Are you open to constructive feedback?

  • @clray123
    @clray1236 ай бұрын

    Reinforcement learning is how most businesses are run. Investors to management: make us munnnnniiies, and up to you to figure out how. Management to employees: make us munnnniesssss and you go figure out how. Employees: .

  • @barni_7762
    @barni_77626 ай бұрын

    Nice

  • @alan2here
    @alan2here6 ай бұрын

    While I love GPT-4, remember that Open AI thinks very highly of Open AI, never underestimate this. "we are amazing everything's changing, major but entirely vague advancements, give us research money!111"

  • @therainman7777

    @therainman7777

    6 ай бұрын

    Their opinion of themselves is accurate and justified by their output. No one has shipped more groundbreaking AI developments than they have over the past few years, despite the fact that much larger and much richer companies (Google, Meta, etc) are trying as hard as they can. They deserve every penny of research money they’ve gotten, and have clearly spent it prudently given the results they’ve gotten are far better than other companies who have spent far more than them. It’s so easy to criticize and try to impute the motives or reputation of people who are actually productive and creative; it’s much harder to produce and create yourself.

  • @clray123

    @clray123

    6 ай бұрын

    @@therainman7777 In case of OpenAI it's really hard to say whether the "ground-breaking developments" are based on brilliance or whether they have to do with having the first mover advantage (most pertinent training data to their application because of capturing the user base). In any case, we know that OpenAI did NOT invent the fundamental LLM algorithm (the transformer architecture) - and the famously cited transformer paper did not either. The cornerstone "attention" mechanism was proposed by Bahdanau from University of Bremen. So (as usual) you have to be careful where you assign credit for any ground-breaking inventions...

  • @therainman7777

    @therainman7777

    6 ай бұрын

    @@clray123 No, you really don’t understand. I don’t mean to be condescending. But I’m an AI researcher and have been in this field for nearly 20 years. I did not ever claim OpenAI invented the Transformer architecture. Everyone always throws that out as a strawman/red herring. Literally no one is saying they invented the Transformer. What we are saying is that the whole world has known about Transformers for six years now, yet no one has managed to do what OpenAI has done. They got there first, both with GPT-4 and several of their other best-in-class models, and they consistently shop groundbreaking PRODUCTS, ahead of everyone else, while many of their competitors are still trying to ship a single useful product. Including companies like Google who have orders of magnitude more resources. I never said anything about who invented the Transformer. I said they’ve invented an incredibly useful PRODUCT that no one else has been able to match, and actually they’ve pulled that off multiple times now, despite being a relatively small firm for much of their existence. They also have invented a number of additional techniques and components, and figured out clever ways to assemble others together, to get the result that they got-which to this day no one else has come close to in terms of performance on benchmarks, consumer adoption, business sector adoption, or anything else.

  • @DanFrederiksen
    @DanFrederiksen6 ай бұрын

    If you have the formulas in print beforehand it's much faster to convey. This classic blackboard approach of writing it out during lecture is very time inefficient.

  • @clray123

    @clray123

    6 ай бұрын

    But you have to incorporate entertainment value into your reward function.

  • @DanFrederiksen

    @DanFrederiksen

    6 ай бұрын

    @@clray123 that's a little meta :)

  • @rogerc7960
    @rogerc79606 ай бұрын

    Elon: great, I'll use my time machine to go back to before google brought deepmind and invest in the start-up, and fork a copy of human reinforcement learning. Perfect for tesla to learn how to drive...

  • @scorber23
    @scorber236 ай бұрын

    Q + .. Quantum Intelligence changes everything 💫 things are lining up / stacking up / leveling up

  • @hadsaadat8283
    @hadsaadat82836 ай бұрын

    Use the DARK MODE mf

  • @Pierluigi_Di_Lorenzo
    @Pierluigi_Di_Lorenzo6 ай бұрын

    Q*anon. Doesn't exist, including the letter about it.

Келесі