Accelerated Policy Learning with Parallel Differentiable Simulation


Jie Xu1,2     Viktor Makoviychuk1     Yashraj Narang1     Fabio Ramos1,3     Wojciech Matusik2     Animesh Garg1,4     Miles Macklin1
1 NVIDIA      2 Massachusetts Institute of Technology      3 University of Sydney      4 University of Toronto



Abstract


Deep reinforcement learning can generate complex control policies, but requires large amounts of training data to work effectively. Recent work has attempted to address this issue by leveraging differentiable simulators. However, inherent problems such as local minima and exploding/vanishing numerical gradients prevent these methods from being generally applied to control tasks with complex contact-rich dynamics, such as humanoid locomotion in classical RL benchmarks. In this work we present a high-performance differentiable simulator and a new policy learning algorithm (SHAC) that can effectively leverage simulation gradients, even in the presence of non-smoothness. Our learning algorithm alleviates problems with local minima through a smooth critic function, avoids vanishing/exploding gradients through a truncated learning window, and allows many physical environments to be run in parallel. We evaluate our method on classical RL control tasks, and show substantial improvements in sample efficiency and wall-clock time over state-of-the-art RL and differentiable simulation-based algorithms. In addition, we demonstrate the scalability of our method by applying it to the challenging high-dimensional problem of muscle-actuated locomotion with a large action space, achieving a greater than 17x reduction in training time over the best-performing established RL algorithm.


Paper


Accelerated Policy Learning with Parallel Differentiable Simulation
Jie Xu, Viktor Makoviychuk, Yashraj Narang, Fabio Ramos, Wojciech Matusik, Animesh Garg, Miles Macklin
International Conference on Learning Representations (ICLR) 2022
[Paper]  [Arxiv]  [Video]  [Code]  [Talk]  [Poster]  [BibTeX]


Video Demo





More Results



Training Speed Comparisons with Baseline Algorithms


Evaluation Problem 1: CartPole Swing Up + Balance

Episode 0
Episode 50
(36 seconds of training)
Episode 100
(72 seconds of training)
Episode 200
(2.5 minutes of training)
Episode 500
(6 minutes of training)

Evaluation Problem 2: Ant

Episode 0
Episode 200
(4 minutes of training)
Episode 400
(8 minutes of training)
Episode 800
(16 minutes of training)
Episode 2000
(40 minutes of training)

Evaluation Problem 3: Humanoid

Episode 0
Episode 200
(10.5 minutes of training)
Episode 400
(21 minutes of training)
Episode 800
(42 minutes of training)
Episode 2000
(105 minutes of training)

Evaluation Problem 4: Humanoid MTU

Episode 0
Episode 200
(8.5 minutes of training)
Episode 400
(17 minutes of training)
Episode 800
(34 minutes of training)
Episode 2000
(85 minutes of training)



Related Papers


An End-to-End Differentiable Framework for Contact-Aware Robot Design
Jie Xu, Tao Chen, Lara Zlokapa, Michael Foshey, Wojciech Matusik, Shinjiro Sueda, Pulkit Agrawal
Robotics: Science and Systems (RSS 2021)
[Project Page]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, Gavriel State
Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021
[Project Page]