No video

EfficientZero: Mastering Atari Games with Limited Data (Machine Learning Research Paper Explained)

#efficientzero #muzero #atari
Reinforcement Learning methods are notoriously data-hungry. Notably, MuZero learns a latent world model just from scalar feedback of reward- and policy-predictions, and therefore relies on scale to perform well. However, most RL algorithms fail when presented with very little data. EfficientZero makes several improvements over MuZero that allows it to learn from astonishingly small amounts of data and outperform other methods by a large margin in the low-sample setting. This could be a staple algorithm for future RL research.
OUTLINE:
0:00 - Intro & Outline
2:30 - MuZero Recap
10:50 - EfficientZero improvements
14:15 - Self-Supervised consistency loss
17:50 - End-to-end prediction of the value prefix
20:40 - Model-based off-policy correction
25:45 - Experimental Results & Conclusion
Paper: arxiv.org/abs/...
Code: github.com/YeW...
Note: code not there yet as of release of this video
Abstract:
Reinforcement learning has achieved great success in many applications. However, sample efficiency remains a key challenge, with prominent methods requiring millions (or even billions) of environment steps to train. Recently, there has been significant progress in sample efficient image-based RL algorithms; however, consistent human-level performance on the Atari game benchmark remains an elusive goal. We propose a sample efficient model-based visual RL algorithm built on MuZero, which we name EfficientZero. Our method achieves 190.4% mean human performance and 116.0% median performance on the Atari 100k benchmark with only two hours of real-time game experience and outperforms the state SAC in some tasks on the DMControl 100k benchmark. This is the first time an algorithm achieves super-human performance on Atari games with such little data. EfficientZero's performance is also close to DQN's performance at 200 million frames while we consume 500 times less data. EfficientZero's low sample complexity and high performance can bring RL closer to real-world applicability. We implement our algorithm in an easy-to-understand manner and it is available at this https URL. We hope it will accelerate the research of MCTS-based RL algorithms in the wider community.
Authors: Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, Yang Gao
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-...
KZread: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.c...
Minds: www.minds.com/...
Parler: parler.com/pro...
LinkedIn: / ykilcher
BiliBili: space.bilibili...
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribes...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 63

  • @YannicKilcher
    @YannicKilcher2 жыл бұрын

    OUTLINE: 0:00 - Intro & Outline 2:30 - MuZero Recap 10:50 - EfficientZero improvements 14:15 - Self-Supervised consistency loss 17:50 - End-to-end prediction of the value prefix 20:40 - Model-based off-policy correction 25:45 - Experimental Results & Conclusion Paper: arxiv.org/abs/2111.00210 Code: github.com/YeWR/EfficientZero Note: code not there yet as of release of this video

  • @mastercheater08
    @mastercheater082 жыл бұрын

    Hi Yannic, the 100k training steps are actually 2h playtime per game and not 2 days. More importantly though, you referred to the code being available. If you actually follow the given link, you will see, that it is in fact not.

  • @joedalton77

    @joedalton77

    2 жыл бұрын

    Sharp as always

  • @dylanloader3869

    @dylanloader3869

    2 жыл бұрын

    The classic "Thank you for your attention, we will open-source the codebase later. Please leave your email address here. We will send you the email once we open-source it." - Posted 5 years ago Papers should be retracted if the codebase isn't published within a reasonable amount of time post publication.

  • @YannicKilcher

    @YannicKilcher

    2 жыл бұрын

    Thanks for correcting and noticing

  • @Frankthegravelrider

    @Frankthegravelrider

    2 жыл бұрын

    @@dylanloader3869 agree, honestly I wonder how many deep learning papers have massaged results.

  • @joshuasmith5782

    @joshuasmith5782

    2 жыл бұрын

    @@dylanloader3869 isn’t this a preprint

  • @kev9220
    @kev92202 жыл бұрын

    Trajectory Prediction would be also an amazing topic to cover! Thanks for this awesome video.

  • @howuhh8960

    @howuhh8960

    2 жыл бұрын

    he already done video about decision transformer (which is very similar to trajectory transformer)

  • @Bvic3
    @Bvic32 жыл бұрын

    What I'd like to know is how reliable it is compared to DQN. DQN is monstruously hard to train, it depends of how many frames we take together, how many frames between time steps. Also, to get to real world problems, we probably need skipping latent states. When we remember the world, we remember the memorable landmarks and how to transition between landmarks. Just like if we want to do math, we remember important steps and between steps we use business as usual flow. This is a multi-speed/multi-scale way of thinking. This is how we manage to navigate problems with long term goals. If I understand your video well, it's still trying to predict time step by time step.

  • @AcesseAcessoVip
    @AcesseAcessoVip2 жыл бұрын

    Your channel is criminally under rated. Very good explanations

  • @raphaels2103

    @raphaels2103

    2 жыл бұрын

    Underrated? How so?

  • @ssssssstssssssss
    @ssssssstssssssss2 жыл бұрын

    Engineering-oriented papers like this are a good thing even though machine learning purists don't like it. But not emphasizing the engineering aspect in papers is unfortunate. Real world practitioners need to know what works for different problems.

  • @MikkoRantalainen

    @MikkoRantalainen

    2 жыл бұрын

    I think machine learning should be more engineering oriented anyways. This is because all those algorithms require obscene amount of computations and an algorithm that better matches the actual hardware can be 100x faster with nothing else changed. This is because e.g. data in L1 can be fetched 50-100x faster than the data in RAM. If the algorithm does indirect memory reference using two data values in L1 vs two data values in RAM, the speed difference will be at least 100x for that difference alone. And big-O analysis will usually still claim both algorithms are identical.

  • @ivanmochalov3102
    @ivanmochalov31022 жыл бұрын

    Usually, I'm not watching such videos. But this one is superb (due to clear explanation)

  • @StefanausBC
    @StefanausBC2 жыл бұрын

    Got it! Whole thing is yet another invention by Jürgen Schmidhuber as it uses LSTM o_O

  • @marcobiemann8770

    @marcobiemann8770

    2 жыл бұрын

    Actually, the idea of learning a world model goes back to a paper of Ha and Schmidhuber :)

  • @serta5727
    @serta57272 жыл бұрын

    I can’t wait to implement Efficient Zero and try it for software testing

  • 2 жыл бұрын

    Reminds me a bit of a paper called curiosity driven exploration, except it was used only for exploration (the part where you compare the hidden state for the expected next step to the actual next step)

  • @herp_derpingson
    @herp_derpingson2 жыл бұрын

    17:30 These kind of losses tend to be unstable. The neural network might simply learn to output a fixed vector or vectors in a very small cluster for all states to minimize this loss. So maybe this wont work for very complex games like Go.

  • @tresuvesdobles

    @tresuvesdobles

    2 жыл бұрын

    As he mentions, Go does not require to encode the model of the world in any way, because you already know it perfectly! I don't know about the stability though, usually they have to resort to things like contrastive learning, and I am not sure if they are using any of that here. However, due to there being other losses pushing the optimization, maybe this is not a problem in this particular case

  • @YannicKilcher

    @YannicKilcher

    2 жыл бұрын

    true, I guess that's why muzero left it out, because I would totally put that in if I was the muzero author. but the loss does need to be balanced with the other losses, so maybe the consequence is just a very slippery hyperparameter to tune :)

  • @eelcohoogendoorn8044
    @eelcohoogendoorn80442 жыл бұрын

    Did they repeat those ablations multiple times to check for repeatability; or was it already stretching their compute budget? Was not quite clear to me if the fact that sometimes the improvements were not improvements was simply due to noise, and perhaps youd see a positive expected benefit if repeating that ablation experiment many times.

  • @MikkoRantalainen
    @MikkoRantalainen2 жыл бұрын

    Great explanation of the EfficientZero! I also really liked that you explicitly compared the differences to AlphaZero.

  • @MikkoRantalainen

    @MikkoRantalainen

    2 жыл бұрын

    I would have expected that the would have tried to create some kind of estimated world rules (basically conversion from observation to latent space) and then optimize that conversion with the reward. And at least start of the training focus on testing actions that have highest probability for high positive or negative reward because that would train the world rules faster. I think this would better match the way humans work. When you encountered an old computer game as a child, you tried every possible action first (what if I run towards the wall, will the movement wrap to the other side or what happens) before trying to find smaller actions. Without known set of rules for the world you could just assume any random set of rules and then try actions that rule out the maximum amount of wrong states from your model for every timestep. I agree with MuZero design that the reward is the true target that you should optimize but to avoid local maximum, you have to survey across a huge variance of possible actions and world states to figure out the edge cases and loop holes in the rules. Especially in the older games, the max performance required exploiting all kind of loop holes. Look at any modern speedruns of older games and it's obvious that the players exploit design mistakes in games. I think that EfficientZero should have objective to first find the exploitable rules and then maximize the reward using those exploits. As a side-effect, that would probably generate a very effective software exploitation algorithm, too. As a tool for the programmer, it would highlight all the security issues in your code. As a tool for the attacker (and not publicly available) it would be superior tool to attack any system.

  • @evanwalters2117
    @evanwalters21172 жыл бұрын

    17:00 Thank you for explaining why the muzero authors purposely left out supervising the hidden state. There are definitely trade-offs here, and it is clear their ideas are geared toward learning atari fast. I can't see the supervised hidden state and maybe the value prefix helping with boardgames, and I also wonder if rerunning mcts in reanalyze would be too slow without their c++ batched mcts implementation, so these are things to keep in mind. These ideas are great additions as options to try, though, depending on your needs!

  • @jchen5803

    @jchen5803

    2 жыл бұрын

    1. reuse samples is the key, and supervision on hidden states is done through the SSL module. 2. it depends on implementation. first deepmind does not release their impl, and second, the current open-source efforts made by the community is clearly not efficient enough (muzero-pytorch)

  • @NoNTr1v1aL
    @NoNTr1v1aL2 жыл бұрын

    Great video! May I please know where to get code for the experimental results or if there is code for model-based off-policy iteration?

  • @Rhannmah
    @Rhannmah2 жыл бұрын

    28:10 well, is this optimization per game tailored by a second AI? Because if it is, I don't see any problem with this approach. Although, did they test on Montezuma's Revenge? Not all Atari games are created equal, some are waaaay more complex than others.

  • @MikkoRantalainen

    @MikkoRantalainen

    2 жыл бұрын

    What would the input for such second AI be? Name of the game followed by best known latent state? Of course, you could always run two full AI systems in parallel and use third system to select next action based on short term success from either subsystem. Doing that will require 2-3x the computing power and the interesting question is could either subsystem give more accurate answer with that extra 2-3x computing power applied to that system alone? Given infinite processing power solving these issues would be much easier. The computations needed are so expensive that this is more like engineering optimization problem than purely theoretical computational problem.

  • @nobody_8_1
    @nobody_8_1 Жыл бұрын

    EfficientZero is a beast.

  • @lilhabibi3783
    @lilhabibi37832 жыл бұрын

    Section 4.1 describes their "Self-Supervised Consistency Loss", which is close to the WorldModel architecture [1], but where V and M components are trained end-to-end with the addition of a projector. I find it weird to formulate this in a SimSiam framework given the existing WorldModel architecture. Also, can anyone explain the point of the shared projector? It does not seem to be described well in the paper. [1] Ha and Schmidhuber, "Recurrent World Models Facilitate Policy Evolution", 2018

  • @G12GilbertProduction
    @G12GilbertProduction2 жыл бұрын

    Epistemic uncertainty it was according to the structural textual semantic models in this language paradigma? Implementation of that into a neural network models language models like CiT is strange and buzzing up all the resources and time.

  • @DistortedV12
    @DistortedV122 жыл бұрын

    Can you do a video on nando de freitas model delusions in sequence transformer. Interesting blend of causality and sequence models

  • @freemind.d2714
    @freemind.d27142 жыл бұрын

    Isn't Dreamer already does most of this???

  • @YannicKilcher

    @YannicKilcher

    2 жыл бұрын

    not sure I recall correctly, but I don't think dreamer has MCTS, which is one of the main components here. but yes, there are a lot of similarities

  • @marcobiemann8770

    @marcobiemann8770

    2 жыл бұрын

    Another difference is that this paper uses a contrastive loss, whereas Dreamer minimises the KL divergence between the distributions

  • @freemind.d2714

    @freemind.d2714

    2 жыл бұрын

    But man, I really wish we could have some ways that we could doing those research more analytically like in the Computer Graphic or Physics Simulation field... Must those thing really feel like artistic design instead of scientific research, like sometime we can only guess if it's good or not even after we test it

  • @TheHerbert4321

    @TheHerbert4321

    2 жыл бұрын

    @@YannicKilcher Isn't Dreamer even more data efficient than this architecture?

  • @JTMoustache
    @JTMoustache2 жыл бұрын

    Really confused by having the environment drawn on the left - that must be how the english do it

  • @sanagnos
    @sanagnos2 жыл бұрын

    🙏 🙏

  • @billykotsos4642
    @billykotsos46422 жыл бұрын

    27:50 lol Freeway doesn't really care

  • @user-kf9tp2qv9j
    @user-kf9tp2qv9j2 жыл бұрын

    100k data is only 2 hours, not 2 days.

  • @Keirp1
    @Keirp12 жыл бұрын

    This paper should be withdrawn from NeurIPS for lying about the number of seeds they ran (turns out they just ran one). There is literally a paper on how high the variance is for Atari100k. Also those reconstructions seem very broken.

  • @billykotsos4642

    @billykotsos4642

    2 жыл бұрын

    Are you sure about this? It has Abbeel's name on it

  • @Keirp1

    @Keirp1

    2 жыл бұрын

    Look at the tweet by agarwl_

  • @billykotsos4642

    @billykotsos4642

    2 жыл бұрын

    @@Keirp1 liiiink ???? please???

  • @Keirp1

    @Keirp1

    2 жыл бұрын

    @@billykotsos4642 I think youtube wont let me post a link to twitter.

  • @billykotsos4642

    @billykotsos4642

    2 жыл бұрын

    @@Keirp1 any other sources that confirm this ? Anything else you can point to ?

  • @guidoansem
    @guidoansem2 жыл бұрын

    algo

  • @billykotsos4642
    @billykotsos46422 жыл бұрын

    Its 2021. Atari games are old news. Researchers should up their game !

  • @doppelrutsch9540

    @doppelrutsch9540

    2 жыл бұрын

    They're still a useful benchmark. Being able to hit superhuman performance in a human-comparable timeframe is kind fo big news...

  • @Rhannmah

    @Rhannmah

    2 жыл бұрын

    It's more about Atari games being in a grey legal area where they're kind of unofficially public domain. No one is going to go ballistic over using the games to benchmark AIs without their consent.

  • @binjianxin7830

    @binjianxin7830

    2 жыл бұрын

    Vapnik still believes he can find the most important ML theory by experimenting with MNIST datasets!

  • @go00o87
    @go00o872 жыл бұрын

    kzread.info/dash/bejne/gH53rrezm9GTo6Q.html got me somewhat angry. The whole point of 100k should be sample efficiency so that one moves closer to real-world application. However, to me, it seems like the benchmark is poorly constructed, as one can trade-off sample efficiency vs. hyperparameter optimisation to some degree. In the business world, the total cost (total number of training steps) is what matters, together with performance, robustness, explainability, ... . But lets "just" plugin jet another loss (with hyperparameter) another network (with hyperparameters) etc .. it will all be good. Anyways thanks for the video I apprechiate your explanations a lot ;)

Келесі