MIT Press

Figure 4:

TPG training curves, each normalized relative to DQN's score in the same game (100%) and random play (0%): (a) shows curves for the 27 games in which TPG ultimately exceeded the level of DQN under test conditions and (b) shows curves for the 21 games in which TPG did not reach DQN level during test. Note that in several games TPG began with random policies (generation 1) that exceeded the level of DQN. Note that these are training scores averaged over 5 episodes in the given game title, and are thus not as robust as DQN's test score used for normalization. Also, these policies were often degenerate. For example, in Centipede, it is possible to get a score of 12,890 by selecting the “up-right-fire” action in every frame. While completely uninteresting, this strategy exceeds the test score reported for DQN (8,390) and the reported test score for a human professional video game tester (11,963) (Mnih et al., 2015). Regardless of their starting point, TPG policies improve throughout evolution to become more responsive and interesting. Note also that in Video Pinball, TPG exceeded DQN's score during training but not under test. The curve for Montezuma's Revenge is not pictured, a game in which neither algorithm scores any points.

This Feature Is Available To Subscribers Only