Actor Critic Algorithms

Reinforcement learning is hot right now! Policy gradients and deep q learning can only get us so far, but what if we used two networks to help train and AI instead of one? Thats the idea behind actor critic algorithms. I'll explain how they work in this video using the 'Doom" shooting game as an example.
Code for this video:
github.com/llSourcell/actor_c...
i-Nickk's winning code:
github.com/I-NicKK/Tic-Tac-Toe
Vignesh's runner up code:
github.com/tj27-vkr/Q-learnin...
Taryn's Twitter:
/ tarynsouthern
More learning resources:
papers.nips.cc/paper/1786-act...
rll.berkeley.edu/deeprlcourse/...
web.mit.edu/jnt/www/Papers/J09...
mlg.eng.cam.ac.uk/rowan/files/...
mi.eng.cam.ac.uk/~mg436/Lectur...
Please Subscribe! And like. And comment. That's what keeps me going.
Want more inspiration & education? Connect with me:
Twitter: / sirajraval
Facebook: / sirajology
Join us in the Wizards Slack channel:
wizards.herokuapp.com/
And please support me on Patreon:
www.patreon.com/user?u=3191693 Instagram: / sirajraval Instagram: / sirajraval
Signup for my newsletter for exciting updates in the field of AI:
goo.gl/FZzJ5w
Hit the Join button above to sign up to become a member of my channel for access to exclusive content! Join my AI community: chatgptschool.io/ Sign up for my AI Sports betting Bot, WagerGPT! (500 spots available):
www.wagergpt.co

Пікірлер: 114

  • @robertotomas
    @robertotomas3 жыл бұрын

    Wow this is seriously a fantastic introduction motivating ac methods

  • @sophieg.9272
    @sophieg.92723 жыл бұрын

    You saved my live with this video. Thanks! I have to write a text, that this topic includes and i struggled for so long to understand it, but now it seems so easy.

  • @Ronnypetson
    @Ronnypetson6 жыл бұрын

    Siraj is definitely very important for the dissemination of AI knowledge. I myself owe Siraj many thanks for this incredible channel!!

  • @kushalpatel8939
    @kushalpatel89394 жыл бұрын

    Amazing video. Nicely Explained.

  • @chicken6180
    @chicken61806 жыл бұрын

    out of all the channels im subbed to this is the only one i have notifs on cuz its good

  • @SirajRaval

    @SirajRaval

    6 жыл бұрын

    thanks spark also tell me what vid topic you'd love to see

  • @VigneshKumar-xd7xi
    @VigneshKumar-xd7xi6 жыл бұрын

    Thanks for the recognition @Siraj. Looking forward to your upcoming works on the channel. A Halite 2 AI bot perhaps.

  • @davidm.johnston8994
    @davidm.johnston89946 жыл бұрын

    Very interesting video as usual, thank you! :-)

  • @unicornAGI
    @unicornAGI6 жыл бұрын

    Hey Siraj!Got a chance to Implement one of the NIPS paper 2017,I have selected Reinforcement Learning Field , How Hard it will be and What is the Procedure to Implement the paper?

  • @cryptomustache9921
    @cryptomustache99215 жыл бұрын

    the fact that is being applied to Doom is for some reason, or given time will it work on any FPS game. Does it train with the game showing, or just the code going superfast at super speed, being able to play multiple games. Thanks for your videos.

  • @larryteslaspacexboringlawr739
    @larryteslaspacexboringlawr7396 жыл бұрын

    thank you for actor critic video

  • @matthewdaly8879
    @matthewdaly88796 жыл бұрын

    So is the actor's predicted best choice then optimized with gradient ascent on based on the critics Q values?

  • @vornamenachname906

    @vornamenachname906

    2 жыл бұрын

    no.

  • @luck3949
    @luck39496 жыл бұрын

    Hi Siraj! Can you please make a video on program synthesis? Please please please, I beg you! For me it seems that it is the straightest way to get a skynet-level AI, but it is so underhyped, that I did't even know that word until I googled the idea behind it. I have no idea, why nobody talks about that topic. I have no idea, why they don't use neural networks. It seems that Alpha-Go suits almost perfectly for that task (this is also a search in a tree), but I haven't heard about any revolution in that area.

  • @dewinmoonl

    @dewinmoonl

    5 жыл бұрын

    program synthesis doesn't use AI because the pattern is too complicated and the data is too sparse. but if you want to watch synthesis I stream on twitch under "evanthebouncy"

  • @Belowzeroism

    @Belowzeroism

    5 жыл бұрын

    Creating programs requires AGI which is far far beyond our reach by now

  • @vladimirblagojevic1950
    @vladimirblagojevic19506 жыл бұрын

    Can you please make a video about proximal policy optimization as well?

  • @NolePTR
    @NolePTR6 жыл бұрын

    The way AlphaZero did it if I understand right is that it critiques the current state, not the future state given an action. So all you have to put in is S to receive the fitness (and policy vector). It's more of a fitness value than reward, due to context. This is possible since Chess has a finite number of positions pieces can be. The best output from the policy network is simulated and then passed back through the NN. State transition predictions are actually hardcoded (it always returns the ACTUAL state that would occur given an action, not a prediction of the actual state from a simulate_move function). So if my understanding is right, this is used for instead of hardcoding the state transitions for simulation, it uses a NN to predict the outcome state?

  • @jeffpeng1118
    @jeffpeng11183 жыл бұрын

    How does the critic know what the action score is?

  • @spenhouet
    @spenhouet6 жыл бұрын

    Cool technique!

  • @SirajRaval

    @SirajRaval

    6 жыл бұрын

    thanks Sebastian!

  • @alexlevine78
    @alexlevine786 жыл бұрын

    Is it possible to use multiple agents? My game is a first person shooter and multiple agents are allies going against an enemy. Is using the same critic neural net for all agents, but separate actor ones possible agent possible? I want to increase efficiency and make it decentralized. Feel free to pm me. A collaborator might be useful.

  • @chiragshahckshhh9696
    @chiragshahckshhh96966 жыл бұрын

    Nice..!

  • @adrianjaoszewski2631
    @adrianjaoszewski26316 жыл бұрын

    Did anybody actually try to run the source code? I've seen the same code snippet in two different places and none of them worked. Frankly - not only does it not work, it also has a lot of redundancy (many unused variables and errors), typos which make the code work incorrect, but are not spotted because the update methods are actually dead code which is never called. Basically the whole example is doomed because of the fact that it's just a single run through the environment and it usually stops just by hanging down. After fixing this it also does not work because the update function is never called. If you call the update function at the end of the train method it has runtime errors because of typos and wrong model use (trying to assign critic weights to the actor) and to be honest - even the neural nets are wrong - both have ReLUs as output layers, but the inputs can be negative (impossible with ReLU) and the Q-values should be mostly negative (most of the rewards are negative).

  • @LunnarisLP

    @LunnarisLP

    6 жыл бұрын

    It's usually just sample code, because going through the whole code would often require to explain the used libraries and stuff. Google did the same with their policy gradient video with tensorflow :D

  • @user-ll7mt9wx1i
    @user-ll7mt9wx1i6 жыл бұрын

    I love your video, they are all helpful for me. But this video doesn't have the subtitle, it's difficult for me. T_T

  • @toxicdesire8811
    @toxicdesire88116 жыл бұрын

    Are you in india right now? Because upload time is different this time.

  • @ionmosnoi
    @ionmosnoi6 жыл бұрын

    the source code is not working, the target weights are not updated!

  • @the007apocalypse
    @the007apocalypse3 жыл бұрын

    Apparently codes weren't the only things he plagiarised. Imagine this as a playground with a kid (the “actor”) and her parent (the “critic”). The kid is looking around, exploring all the possible options in this environment, such as sliding up a slide, swinging on a swing, and pulling grass from the ground. The parent will look at the kid, and either criticize or complement here based on what she did. towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69

  • @chaitanyayanamala845
    @chaitanyayanamala8456 жыл бұрын

    My virtual teacher Siraj

  • @dustinandrews89019
    @dustinandrews890196 жыл бұрын

    Perfect timing. I am creating a AC on a toy grid-world problem and struggling with using the Q value to update the actor (output softmax((4,))). I'll check out the code.

  • @dustinandrews89019

    @dustinandrews89019

    6 жыл бұрын

    Siraj, it would be great if you could zoom in one how you use the gradients from the critic to update the actor. I know it's the chain rule, but a simplified example walk-through would be awesome.

  • @tomw4688
    @tomw46883 жыл бұрын

    He goes so fast. It's like he's talking to someone that already understand it.

  • @timothyquill889

    @timothyquill889

    3 жыл бұрын

    Think he's more interested in showing off his knowledge than actually helping anyone

  • @kaushikdr
    @kaushikdr3 жыл бұрын

    Great video! One question: Why do we need a "model" to act as a critic? Don't we just need to maximize our reward? Also, how can we know if we have chosen the "best" action if we don't know all the rewards of an infinite input space? (Of course, in chess there is a finite input space.)

  • @chas7618

    @chas7618

    2 жыл бұрын

    The actor critic algorithm is a two part algorithm, it has both a policy model which takes the actual action, and a value function that tells the policy model how good the action was. To improve RL and apply it into real world problems, means that our action space is continuous. Value based RL methods simply cannot function in highly continuous action spaces e.g. deep Q. Therefore we need policy gradient based approaches. To improve policy gradient methods further we need value functions. Hence the fact that we need a combination of both value iteration and policy gradient based approaches. Therefore we need actor critic RL algorithms

  • @chas7618

    @chas7618

    2 жыл бұрын

    In RL there is a constant uncertainty of whether we have found the best action to take at a particular state, this is the problem of exploitation and exploration. Optimizing the RL agent means tweaking the weights of the network of the policy or the value function until we converge at the best possible actions to take at each state. Studying multi armed bandit problems we learn about the exploration and exploitation problem in great detail

  • @G12GilbertProduction
    @G12GilbertProduction6 жыл бұрын

    But how this Q-net archie network goes spinal?

  • @deepaks.m.6709
    @deepaks.m.67096 жыл бұрын

    Finally you've controlled your speed. Love you bro :)

  • @zakarie
    @zakarie6 жыл бұрын

    Great

  • @tonycatman
    @tonycatman6 жыл бұрын

    I watched a demo from NVIDIA this week in which they played a John Williams type of music score. It was unbelievably good. It'll be interesting to see what people come up with. A new Christmas Carol ?

  • @SirajRaval

    @SirajRaval

    6 жыл бұрын

    That’s dope! Hans zimmer ai next

  • @dshoulders

    @dshoulders

    6 жыл бұрын

    where can i find this demo

  • @tonycatman

    @tonycatman

    6 жыл бұрын

    Here : kzread.info/dash/bejne/l5t-krKNe7TWZLg.html. Starts at about 02:00. I'm not sure how much licence the orchestra had.

  • @diegoantoniorosariopalomin4977
    @diegoantoniorosariopalomin49776 жыл бұрын

    So , learning from human preferences is an actor critic model ?

  • @adamduvick
    @adamduvick5 жыл бұрын

    This video is just about this article: towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69

  • @himanshujat3658
    @himanshujat36586 жыл бұрын

    Wizard of this week, thank you siraj!!😇

  • @dustinandrews89019
    @dustinandrews890196 жыл бұрын

    This method "Q-Prop" from earlier this year seems like an improvement on this A-C method, but I don't see much about it online. arxiv.org/pdf/1611.02247.pdf Shixiang Gu, Timothy Lillicrap , Zoubin Ghahramani, Richard E. Turner, Sergey Levine. Has it been overlooked or superseded?

  • @rajathshetty325
    @rajathshetty3256 жыл бұрын

    I understood some of those words..

  • @cybrhckr
    @cybrhckr6 жыл бұрын

    Is it just oversimplification or this is just Q learning with multipreprocessing

  • @somekid338

    @somekid338

    6 жыл бұрын

    no, it works by replacing the advantage of a policy gradient method with an estimation of future rewards, given by the critic network in the form of q-values. Berkeley's deep rl bootcamp, lecture 4a, has a pretty good explanation of it.

  • @notaras1985
    @notaras19856 жыл бұрын

    i have a question please. when it learnt to play chess by itself, was it given the pieces and pawns movements? or it lacked even that?

  • @onewhoraisesvoice
    @onewhoraisesvoice6 жыл бұрын

    Yay!

  • @underlecht
    @underlecht Жыл бұрын

    Most interactive and most unclear/inaccurate video on actor-critic. Thank you!

  • @lordphu
    @lordphu5 жыл бұрын

    the correct term for finding the derivative is to "Differentiate" not "Derive"

  • @allenday4273
    @allenday42736 жыл бұрын

    Good stuff!

  • @SLR_96
    @SLR_964 жыл бұрын

    Suggestion: In videos where you're trying to explain an idea or a method in a general form, try to simplify it as much as possible and don't go into much detail... Also definitely try examples and simple analogies as much as you can, because as we all know the process of learning works best with more examples

  • @FabianAmran

    @FabianAmran

    2 жыл бұрын

    I agree

  • @davidmoser1103
    @davidmoser11036 жыл бұрын

    The linked source is for playing a pendulum game, not doom, which is much more complex. Honestly, I don't think you ever wrote a bot for playing doom, that's why you only show 5s of doom being played. To prove me wrong, link the source code for the doom bot.

  • @somekid338

    @somekid338

    6 жыл бұрын

    Gebregl I believe Arthur Juliani made source code for a doom bot using this method. I would recommend checking out his explanation instead.

  • @LunnarisLP

    @LunnarisLP

    6 жыл бұрын

    GJ Sherlock Since Siraj is mainly making youtube tutorials for noobs like us, he probaby doesn't code many major projekts like the doom one would be, which was probably created by a whole team, like most of those major projekts. Not only that, but the doom bot was probably trained for multiple days on really powerful machines.. So GJ on spotting that he didn't code the doombot himself :D

  • @siriusblack9999
    @siriusblack99996 жыл бұрын

    but... how does the critic network learn what actions/states to give high q values and which to give low ones?

  • @toxicdesire8811

    @toxicdesire8811

    6 жыл бұрын

    Sirius Black I think it will depend on the boundary conditions of the actions taken by actor

  • @siriusblack9999

    @siriusblack9999

    6 жыл бұрын

    i meant more generally - what purpose does the critic have VS just rewarding the actor directly with whatever you would otherwise reward the critic with, or is the critic's only purpose to "interpolate" intermittent rewards? IE you have 1 reward every 50 generations, and the critic attempts to learn how the other 49 generations should be rewarded to get it to that final reward?and if that IS the purpose, why not just use synthetic gradients instead? or is this just another case of "let's give the same thing two different names just to confuse people", just like how "perceptron" and "neural network layer" sound like they're completely unrelated topics but they're actually the exact same thing except you normally don't care about input gradients in a perceptron because it's only one layer (and you therefore normally don't implement it, even though you could and it would still be a perceptron, but now you could also use the same exact implementation as a hidden layer in a neural network)

  • @neilslater8223

    @neilslater8223

    6 жыл бұрын

    In simple policy gradient methods, you would train the actor to maximise total return. But without a critic you cannot predict the return - you have to run the actor to the end of each episode before you can train it a single step. The critic, by *predicting* the final return on each step allows you to bootstrap and train the actor on each step. It is this bootstrapping process (from temporal difference learning approach) that makes Actor Critic a faster learner than, say REINFORCE (a pure policy gradient method).

  • @siriusblack9999

    @siriusblack9999

    6 жыл бұрын

    so it's the exact same thing as a synthetic gradient

  • @sikor02

    @sikor02

    6 жыл бұрын

    I'm wondering the same, how the critic is being learned? I still can't figure it out. Looking at the code it seems like critic is predicting the Q value and then uses the fit function to ... fit what it predicted multiplied by gamma factor? I can't understand this part.

  • @fabdenur
    @fabdenur6 жыл бұрын

    Hey Siraj, I'm a huge fan and watch the great majority of your videos. Having said that, let me repeat a bit of constructive criticism: you explain the concepts really well, but often only flash by the actual results. For instance, in this video there are only 5 seconds (from 8mins1secs to 8mins5secs) of the Doom bot playing. It would be much more satisfying if you showed it playing for let's say 15 or 20 seconds. This would only add 10 to 15 seconds to the length of the whole video, but the audience would get to appreciate the results a lot better. best and keep up the great work! :)

  • @davidmoser1103

    @davidmoser1103

    6 жыл бұрын

    Yes, footage of the doom bot initially, and after some learning, would be very interesting to see. But he didn't write a doom bot, the code is for a simple pendulum game. Very disingenuous.

  • @fabdenur

    @fabdenur

    6 жыл бұрын

    Wow. He didn't make that very clear, did he? Not cool

  • @julienmercier7790

    @julienmercier7790

    5 жыл бұрын

    He didn't write the doom bot. That's the hard truth

  • @rajroy2426
    @rajroy24263 жыл бұрын

    Just saying in rl you need to reward it if it wins so it knows what winning means

  • @richardteubner7364
    @richardteubner73646 жыл бұрын

    this code has nothing do to with doom.

  • @rishabhagarwal7540
    @rishabhagarwal75406 жыл бұрын

    It would be helpful to include the relevant blog post in video description (in addition to github): towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69

  • @anteckningar

    @anteckningar

    6 жыл бұрын

    It feels like he is ripping-off of that blog post a bit too much...

  • @andreasv9472

    @andreasv9472

    6 жыл бұрын

    Richard Löwenström At least he should give it credit.

  • @davidmoser1103

    @davidmoser1103

    6 жыл бұрын

    Not only did he not give proper credit (mentioning it in the video, and video description), the code linked is for a pendulum game. And people commenting on that code say it doesn't work, or not well. So, no doom bot to be found anywhere. Such a pity, he explains things so well, but then lies about the results.

  • @LemurDrengene

    @LemurDrengene

    6 жыл бұрын

    This is what he does in many of his videos. Sometimes I wonder if he even understand what he "teaches" or if he is just reading other people's work. It's down to the smallest detail, even with the playground analogy and controller with infinite buttons.. Disgusting earning money like this on other people's work.

  • @Mirandorl
    @Mirandorl6 жыл бұрын

    0:07 how many people checked for slack notifications

  • @silentgrove7670
    @silentgrove76704 жыл бұрын

    I am playing a game without a rule book or an end goal.

  • @jra5434
    @jra54346 жыл бұрын

    I made some songs on Amper but I suck at connecting APIs and other things to python. I use spyder and always get errors when I try to connect them together.

  • @MrYashpaunikar
    @MrYashpaunikar6 жыл бұрын

    Are u in Delhi?

  • @shivashishsingh5915

    @shivashishsingh5915

    6 жыл бұрын

    Yash Paunikar he was in Delhi in September

  • @jinxblaze
    @jinxblaze6 жыл бұрын

    notification squad hit like

  • @vladomie
    @vladomie6 жыл бұрын

    Wow! It appears that AIs now have that critical voice in their head like the one described in Taryn's song kzread.info/dash/bejne/mGGIz5OyiJmTcrw.html

  • @sarangs8441
    @sarangs84416 жыл бұрын

    Are u in delhi

  • @vornamenachname906
    @vornamenachname9062 жыл бұрын

    this is a bad explanation why the critic model is important. the background of this second network was an issue: what if you get your reward only after a long series of steps, and you need to update all steps with this one reward. maybe there were some good and some bad moves - you get a lot lot of noise if you apply for example only "win" and "lose" to all of this steps. the critic model helps you calculate the loss for every step, to get hight loss at really bad steps that lead to a lose, and say "nah ok you lost but that move was not that bad".

  • @gautamjoshi3143
    @gautamjoshi31436 жыл бұрын

    are you in india?

  • @Iskuhsama
    @Iskuhsama6 жыл бұрын

    hello, world

  • @ishantpundir9747
    @ishantpundir97476 жыл бұрын

    hey Siraj I am ishant I am 16 I have dropped out just to work on AI and Robotics 24X7 you are a. really big inspiration when are you coming back to india I would love to meet you

  • @Suro_One
    @Suro_One6 жыл бұрын

    I can't form a better comment than "Amazing". Anyone agree with me? The high level description of the model seems simple, but it's very complex if you dive deeper. What are your preferred methods of learning things like this?

  • @Donaldo
    @Donaldo6 жыл бұрын

    sfx :/

  • @KunwarPratapSingh41951
    @KunwarPratapSingh419516 жыл бұрын

    Zeroth comment.. btw Love for Siraj brother

  • @debadarshee
    @debadarshee3 жыл бұрын

    Rewarding actors to create AI tasks..

  • @daksh6752
    @daksh67526 жыл бұрын

    Very good explanation, but can really improve the code

  • @Kaixo
    @Kaixo6 жыл бұрын

    isn't it easier to learn chess by evolution instead of cnn's? I just made Snake with evolution and it works better than when I did it with a neural network. The only problem I need to fix is that in the end all snakes just have the same tactic, but I think that'll be easily fixable. I'm now going to make a 4 in a row with evolution, I hope that it works out!

  • @dippatel1739

    @dippatel1739

    6 жыл бұрын

    Kaixo Music evaluation is good but it's bit of problematic based on fitness function.

  • @MegaGippie
    @MegaGippie4 жыл бұрын

    Dude the explanation is awesome. I learned a lot about the topic. But the sonds you lay over the animation of nearly every image are annoying......This distracts me a lot.....

  • @thunder852za
    @thunder852za6 жыл бұрын

    flow diagram to explain training rather than shit code

  • @danny-bw8tu
    @danny-bw8tu6 жыл бұрын

    damn, the girl is hot.

  • @meeravalinawab9372
    @meeravalinawab93726 жыл бұрын

    First comment

  • @user-hf3fu2xt2j
    @user-hf3fu2xt2j3 жыл бұрын

    Why each time when it comes to some RL algorithm, the concept is predictable af