Policy Gradient Methods | Reinforcement Learning Part 6

The machine learning consultancy: truetheta.io
Want to work together? See here: truetheta.io/about/#want-to-w...
Policy Gradient Methods are among the most effective techniques in Reinforcement Learning. In this video, we'll motivate their design, observe their behavior and understand their background theory.
SOCIAL MEDIA
LinkedIn : / dj-rich-90b91753
Twitter : / duanejrich
Github: github.com/Duane321
Enjoy learning this way? Want me to make more videos? Consider supporting me on Patreon: / mutualinformation
SOURCES FOR THE FULL SERIES
[1] R. Sutton and A. Barto. Reinforcement learning: An Introduction (2nd Ed). MIT Press, 2018.
[2] H. Hasselt, et al. RL Lecture Series, Deepmind and UCL, 2021, • DeepMind x UCL RL Lect...
[3] J. Achiam. Spinning Up in Deep Reinforcement Learning, OpenAI, 2018
ADDITIONAL SOURCES FOR THIS VIDEO
[4] J. Achiam, Spinning Up in Deep Reinforcement Learning: Intro to Policy Optimization, OpenAI, 2018, spinningup.openai.com/en/late...
[5] D. Silver, Lecture 7: Policy Gradient Methods, Deepmind, 2015, • RL Course by David Sil...
TIMESTAMPS
0:00 Introduction
0:50 Basic Idea of Policy Gradient Methods
2:30 A Familiar Shape
4:23 Motivating the Update Rule
10:51 Fixing the Update Rule
12:55 Example: Windy Highway
16:47 A Problem with Naive PGMs
19:43 Reinforce with Baseline
21:42 The Policy Gradient Theorem
25:20 General Comments
28:02 Thanking The Sources
LINKS
Windy Highway: github.com/Duane321/mutual_in...
NOTES
[1] When motivating the update rule with an animation protopoints and theta bars, I don't specify alpha. That's because the lengths of the gradient arrows can only be interpretted on a relative basis. Their absolute numeric values can't be deduced from the animation because there was some unmentioned scaling done to make the animation look natural. Mentioning alpha would have make this calculation possible to attempt, so I avoided it.

Пікірлер: 73

@maximilianpowers9785 Жыл бұрын
My RL exam is in 2 weeks, you’re a life saver. I’m studying at UCL and the lectures lack a bit of that much needed visual intuition!
@Mutual_Information
Жыл бұрын
Exactly what I'm going for. When I read the update rule, it didn't click until I ultimately landed on a visual like this. Happy it clicked for you too
@johnshim67279 ай бұрын
This was a great video as a beginner in RL to grasp concepts. Appreciate your effort and time for making this!!
@stemfolk Жыл бұрын
This is one of the best channels on the site by a long way. Very grateful that you never compromise on the quality of the videos. Excellent work!
@Mutual_Information
Жыл бұрын
Thank you - and I'm not slowing down!
@rajatkumar.j24 күн бұрын
Finally, after watching this 3 times I got the intuition of this method. Thank you for uploading a great series!
@shadowdragon2484Ай бұрын
Genuinely such an amazing series its changed the way I look at optimization problems as a whole moving forward
@Mutual_Information
Ай бұрын
Thank you for appreciating this one too. It's much less viewed
@derickd6150 Жыл бұрын
Wow I'm so lucky. You only upload every few months and I just watched your previous videos yesterday. Love the series!! Thank you so much!
@Mutual_Information
Жыл бұрын
I intend to upload more frequently in fact. Just need to shorten the videos a bit. This one was a monster.
@quachthetruong Жыл бұрын
Your channel makes me love statistical probability and its applications more. You fully explain math without turning it into a tough academic lecture! Thank you so much!
@Mutual_Information
Жыл бұрын
Excellent - exactly the impact I'm hoping for
@asdf567907 ай бұрын
A huuuge thank you for making this series! It was extremly well-explained and I would've taken me many many more hours learning this from a book/on other resources. It is a very dense course, so I had to spend quite some time rewatching, but it's absolutely worth it and I like the great coverage. Amazing!
@Mutual_Information
7 ай бұрын
And you are in exactly the circumstance I was aiming for. After I read the book, I thought.. damn that just takes way too long to learn. If I could lower the cost of learning, people would appreciate it, just like I would have. So I'm glad it worked for you!
@rolandbertin-johannet5270 Жыл бұрын
So grateful for this channel, any topic you cover is understood 10x faster than through other media
@Mutual_Information
Жыл бұрын
That's what I'm going for - I'm here for the quick learners ;)
@chiwaiwan2484
6 ай бұрын
and 10x quicker than my fucking lectures
@joshithmurthy6209 Жыл бұрын
This video literally came a day before my test , thanks for uploading before test , even if you did after my test I would have seen it. They are so good.
@Mutual_Information
Жыл бұрын
Excellent! Love to hear it
@bonettimauricio6 ай бұрын
Really excited to go into this RL journey with you. I was reading the book and watching these lessons, now it is the end (hopefully just the beginning), thank you so much for this!
@Mutual_Information
6 ай бұрын
Happy to hear you consider it a journey, and you're welcome!
@wenkanglee95966 ай бұрын
I just want to express my gratitude to you. As a total newbie to AI and ML, IMO, this series might be the well-explained videos in RL. Please keep up the good work. :)
@Mutual_Information
6 ай бұрын
Thank you, especially when it's said on this video - part 6 of the RL series. This series took me a real long time and it's only appreciate by a small, studious bunch - so it's great to hear from them. Thanks again!
@DoGyKG10 ай бұрын
Damn thank you for making this video. The visualization of algorithm is beyond any other explanations
@Mutual_Information
10 ай бұрын
And thank you for watching this one! It took forever to create and is largely at the very end of the series, so not that many people make it to it
@awaisahmad59082 ай бұрын
Thank You So Much. I wish we had teachers like You in our universities.
@dasyud3 ай бұрын
I was struggling to get into reading RL material because of the lack of intuition and this was exactly what I needed. Thanks a ton! I can now easily build upon the fundamentals you've taught me. I'm gonna binge watch every video on your channel since they are all on topics I find very interesting and want to learn about. I hope you put out more videos! Cheers! 🎉
@Mutual_Information
3 ай бұрын
Glad they're working for you. And yea I'm cooking a big one as we speak.
@user-bj8wg8vq8h10 ай бұрын
Very well done video, thank you
@siddharthbisht1287 Жыл бұрын
You are going to heaven sir, with the kind of work you are producing. The interesting thing one can watch these with popcorn, with a notebook or listen to it while working on another task and it just works. Your explanations are clear, simple and straightforward which reflects your understanding. Also, thanks for sharing the sources, it genuinely helps a lot. Keep up with a great work.
@Mutual_Information
Жыл бұрын
Siddharth you're too nice! If I'm in heaven (hopefully in a long time), I'm sure I'll see you there
@onurrrrr77
5 ай бұрын
I am afraid of a man who watches this series with popcorn.
@user-fh7hj7du2f5 ай бұрын
Thank you for explaining so well this complicated topic.
@mCoding Жыл бұрын
Great series! I have a question about the weighting of proto points. Do you do a simple distance weighted average over all proto points, and if so is there a variation that only weights K nearest neighbors? I ask because if the landscape is very non-monotonic, weights from far points might tug in the wrong direction even if locally the nearby proto points give good information, e.g. imagine a gradient landscape that is like a maze.
@Mutual_Information
Жыл бұрын
Thank you! To answer your Q, in this very simple case, I'm doing a distance weighted average over all protopoints, but there is a hyperparameter I didn't mention in the video, which scales distances prior to the conversion into normalized weights. The effect is, if the scale is very small, then the model effectively becomes identical to a K=1 nearest neighbors, and so any state only use the single closest protopoint to determine its action probability. In this case, that problematic tugging you mention doesn't exist. If the scale is very large, then the model sees all protopoints as almost equally far away.. and for any state, the action probabilities will essentially be just a simple average of all proto-action-probabilities. So, somewhere between these extremes, we balance that tugging-problem with the benefit of averaging data from nearby states. In larger, high dimensional problems. The way I've done things doesn't work well. We can't tile the space without blowing up parameters. So there are then a variety of approaches.. like selecting protopoints that well partition up the regions where the data is observed.. or reducing dimensionality of the original space into a more manageable latent space. Or.. abandon protopoints entirely and just use deep nets! In fact, I don't believe I've seen any large RL model that is nearest-neighbors based..
@TallMonkey4 ай бұрын
Thank you man. You're the best. Really helped me study for my upcoming exam. Most importantly helping me understand the intuition behind it all
@Mutual_Information
4 ай бұрын
My goal exactly!
@mberoakoko24 Жыл бұрын
Aye , you are back. I'll come back to watch this with a notebook
@marcegger7411 Жыл бұрын
My favorite youtube channel is back!!
@Mutual_Information
Жыл бұрын
Thank you Marc :)
@MathVisualProofs Жыл бұрын
👍So nicely done.
@Mutual_Information
Жыл бұрын
Thank you MVP - that means something big come from you
@akritiupreti69746 ай бұрын
Work of art!
@dhinas9444Ай бұрын
Thank you man for maxing out our mutual information!
@Mutual_Information
Ай бұрын
YES! Someone finally said it! lol that's honestly exactly what I had in my mind when naming this silly channel
@timothytyree5211 Жыл бұрын
Encore! Encore! Thou art the mac daddy of RL! I will stay tuned! Couldst thou pretty please consider developing a follow up video on more sophisticated PPO methods?
@Mutual_Information
Жыл бұрын
PPO would be the next topic. I can't say that's the next thing on the menu, but if this series gathers some attention, I can come back with an encore in due time. Thanks for the love!
@maximechopin26005 ай бұрын
I wanted to ask how you make your animations, they are very clear and concise , thanks for the great content
@kimchi_taco10 ай бұрын
salute!
@Mutual_Information
10 ай бұрын
Thanks for watching these more intense videos. This one took a long ass time!
@dhlee8594 Жыл бұрын
What is your background current job? Your videos are of really high-quality and cover advanced topics
@Mutual_Information
Жыл бұрын
I used to be in quantitative finance. Now I'm a data scientist at Lyft. And thank you - the advanced topics are where the action is!
@zenchiassassin283 Жыл бұрын
Hi, thanks a lot for your videos ! Do you plan to make some videos on other reinforcement learning policy gradient methods ?
@Mutual_Information
Жыл бұрын
RL videos aren't next in the queue. I'll be exploring some new categories. But eventually, I'd like to touch on PPO more directly. Just because if it's usefulness. But that probably won't happen this year.
@5_inchc594 Жыл бұрын
Thanks for the clear explaination. Could you please make a video on the PPO algorithm?
@Mutual_Information
Жыл бұрын
I don't currently have plans for it.. but it would be my next follow up on the RL series.
@5_inchc594
Жыл бұрын
@@Mutual_Information Thanks! that would be great
@add-mt5xc7 ай бұрын
How does one see that the objective (average reward) used in the policy gradient theorem is independent of the initial state? I think this is true as on the right-hand side, you are summing over states s. Is it the Markov assumption that lets you write the average reward in that manner such that it is independent of the initial state?
@Mutual_Information
7 ай бұрын
It's not that it's independent of the stating state. It's that it's true for any starting state. You'll still be able to improve in expected return even if the starting state is randomly set at the start of each episode.
@DRich222 Жыл бұрын
Hah- Nice candid clip at the end.
@actualBIAS Жыл бұрын
Would love to add this to my playlists. Is there any chance to do it? Why is it even disabled? Great vids btw
@Mutual_Information
Жыл бұрын
Wait.. what's disabled?? I don't think I've disabled any such thing on my end.
@GusTheWolfgang11 ай бұрын
Why didnt yoou upload before my dissertation last year D:
@akhilezai Жыл бұрын
Yaayyyy you're back!
@Mutual_Information
Жыл бұрын
Indeed. This one was 30 minutes long, hence it took forever to create. Next videos will be shorter/uploaded more frequently.
@moisesbessalleАй бұрын
I think at @6:20 you meant "since we are creating 3 values out of 2 constraints" right?
@wilhem7206
Ай бұрын
The one constraint is theta1 + theta2 + theta3 = 0, so if you know two of the thetas the 3rd one is determined
@moisesbessalle
Ай бұрын
@@wilhem7206 but there is another constraint which is that each p>=0
@wrjog23 Жыл бұрын
too much informatiooooon!!!
@zerotwo731916 күн бұрын
Man, I hate that this has nothing to do with neurons, or anything biologically inspired. great explanation to see what is really going on. but this has nothing to do with intelligence.
@TimL_ Жыл бұрын
Thank you.
@gravkint83765 ай бұрын
Damn this video is helpful. So far I was only able to get a vague understanding of the topic with lots of time and effort. But this gives a whole new level of intuition. Thank you so much!
@Mutual_Information
5 ай бұрын
Exactly what I'm going for ;)

Policy Gradient Methods | Reinforcement Learning Part 6

Пікірлер: 73

@Mutual_Information

Жыл бұрын

@Mutual_Information

Жыл бұрын

@Mutual_Information

Ай бұрын

@Mutual_Information

Жыл бұрын

@Mutual_Information

Жыл бұрын

@Mutual_Information

7 ай бұрын

@Mutual_Information

Жыл бұрын

@chiwaiwan2484

6 ай бұрын

@Mutual_Information

Жыл бұрын

@Mutual_Information

6 ай бұрын

@Mutual_Information

6 ай бұрын

@Mutual_Information

10 ай бұрын

@Mutual_Information

3 ай бұрын

@Mutual_Information

Жыл бұрын

@onurrrrr77

5 ай бұрын

@Mutual_Information

Жыл бұрын

@Mutual_Information

4 ай бұрын

@Mutual_Information

Жыл бұрын

@Mutual_Information

Жыл бұрын

@Mutual_Information

Ай бұрын

@Mutual_Information

Жыл бұрын

@Mutual_Information

10 ай бұрын

@Mutual_Information

Жыл бұрын

@Mutual_Information

Жыл бұрын

@Mutual_Information

Жыл бұрын

@5_inchc594

Жыл бұрын

@Mutual_Information

7 ай бұрын

@Mutual_Information

Жыл бұрын

@Mutual_Information

Жыл бұрын

@wilhem7206

Ай бұрын

@moisesbessalle

Ай бұрын

@Mutual_Information

5 ай бұрын

Келесі