Exploration is a key challenge in Reinforcement Learning, especially in long-horizon, deceptive and sparse-reward environments. For such applications, population-based approaches have proven effective. Methods such as Quality-Diversity deals with this by encouraging novel solutions and producing a diversity of behaviours. However, these methods are driven by either undirected sampling (i.e. mutations) or use approximated gradients (i.e. Evolution Strategies) in the parameter space, which makes them highly sample-inefficient. In this paper, we propose Dynamics-Aware QD-Ext (DA-QD-ext) and Gradient and Dynamics Aware QD (GDA-QD), two model-based Quality-Diversity approaches. They extend existing QD methods to use gradients for efficient exploitation and leverage perturbations in imagination for efficient exploration. Our approach takes advantage of the effectiveness of QD algorithms as good data generators to train deep models and use these models to learn diverse and high-performing populations. We demonstrate that they outperform baseline RL approaches on tasks with deceptive rewards, and maintain the divergent search capabilities of QD approaches while exceeding their performance by ∼ 1.5 times and reaching the same results in 5 times less samples.

This content is only available as a PDF.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.