Hindsight Experience Replay

Ғылым және технология

This is a quick presentation of the Hindsight Experience Replay idea and concepts, with a few references to recent papers.
The corresponding slides are available here:
pages.isir.upmc.fr/~sigaud/tea...

Пікірлер: 10

  • @shivang27
    @shivang274 жыл бұрын

    can we use PPO algorithm with HER? since HER only works with off policy methods.

  • @OlivierSigaud

    @OlivierSigaud

    5 ай бұрын

    Sorry, I missed that question. I'm not aware of any paper trying this. You are right, in principle it should not work well, as the replayed data would be off-policy, but the on-policy vs off-policy distinction is not clear-cut, some "off-policy" methods can have trouble with "too off-policy" data and some on-policy methods may work not too poorly with not-too-much off-policy data. But this requires introducing a replay memory containing previous trajectories in PPO, that would not be exactly PPO anymore...

  • @OlivierSigaud

    @OlivierSigaud

    5 ай бұрын

    actually, there are at least a few attempts: arxiv.org/pdf/2112.03798.pdf ojs.aaai.org/index.php/AAAI/article/download/26099/25871 I searched "PPO off-policy replay"...

  • @cagedgandalf3472
    @cagedgandalf34725 ай бұрын

    would it be unnecessary if I use HER to solve the inverted pendulum problem?

  • @OlivierSigaud

    @OlivierSigaud

    5 ай бұрын

    To use HER, you need to specify goals: the goals you want to achieve, and the goals you achieve in practice. So you have to add this notion of goal to the inverted pendulum environment.

  • @cagedgandalf3472

    @cagedgandalf3472

    5 ай бұрын

    @@OlivierSigaud I am currently working on my thesis and I am an undergrad of Electronics engg. I don't have much knowledge in machine learning so I apologize if I have the wrong concepts. I understand what you are saying but in the inverted pendulum environment, it's reward function is a cosine function. I was wondering if the cosine function would constitute as the "goal" in HER. From what I understand, HER sets a goal and it changes, in this case it depends on the angle of the pendulum, the goal will move towards 0deg. Is this not the same as the cosine function which increases the reward the closer it is to 0deg?

  • @OlivierSigaud

    @OlivierSigaud

    5 ай бұрын

    @@cagedgandalf3472 You may set different goals as different angles, and the agent would have to reach a particular angle. But this is quite artificial. You should first try your favorite "standard RL" algorithm without trying to introduce a goal. There is already a lot to be learned about standard RL before moving to goal-conditioned RL...

  • @cagedgandalf3472

    @cagedgandalf3472

    5 ай бұрын

    @@OlivierSigaud yes I understand I have tried td3 and sac for this problem and found out that td3 works better and learns faster since it is a deterministic environment. And then I tried this paper about using FORK (forward looking actor I don't know if I can send links in yt but it is by Wei et al) which trained faster than vanilla td3. I then stumbled upon HER and then this is where I am now. So far using td3 fork in simulations has worked and I am in the real world stage. I was just wondering if HER would work better than the FORK and I am planning to test it. Edit: I guess I am trying to find which algorithm is the most sample efficient.

  • @OlivierSigaud

    @OlivierSigaud

    5 ай бұрын

    @@cagedgandalf3472 OK, please keep me posted, I'd be glad to know the results of your attempts. You may have to design a curriculum with more and more difficult goals. You may have a look at these slides (after slide 22) for basic notions about GCRL (goal-conditioned RL) : master-dac.isir.upmc.fr/rl/advanced.pdf

Келесі