Hostname: page-component-77f85d65b8-grvzd Total loading time: 0 Render date: 2026-03-27T08:08:18.361Z Has data issue: false hasContentIssue false

A scalable species-based genetic algorithm for reinforcement learning problems

Published online by Cambridge University Press:  19 September 2022

Anirudh Seth
Affiliation:
KTH, Brinellvägen 8, Stockholm 114 28, Sweden; E-mail: aniset@kth.se Ericsson, Torshamnsgatan 23, Stockholm 164 83, Sweden; E-mails: alexandros.nikou@ericsson.com, marios.daoutis@ericsson.com
Alexandros Nikou
Affiliation:
Ericsson, Torshamnsgatan 23, Stockholm 164 83, Sweden; E-mails: alexandros.nikou@ericsson.com, marios.daoutis@ericsson.com
Marios Daoutis
Affiliation:
Ericsson, Torshamnsgatan 23, Stockholm 164 83, Sweden; E-mails: alexandros.nikou@ericsson.com, marios.daoutis@ericsson.com
Rights & Permissions [Opens in a new window]

Abstract

Reinforcement Learning (RL) methods often rely on gradient estimates to learn an optimal policy for control problems. These expensive computations result in long training times, a poor rate of convergence, and sample inefficiency when applied to real-world problems with a large state and action space. Evolutionary Computation (EC)-based techniques offer a gradient-free apparatus to train a deep neural network for RL problems. In this work, we leverage the benefits of EC and propose a novel variant of genetic algorithm called SP-GA which utilizes a species-inspired weight initialization strategy and trains a population of deep neural networks, each estimating the Q-function for the RL problem. Efficient encoding of a neural network that utilizes less memory is also proposed which provides an intuitive mechanism to apply Gaussian mutations and single-point crossover. The results on Atari 2600 games outline comparable performance with gradient-based algorithms like Deep Q-Network (DQN), Asynchronous Advantage Actor Critic (A3C), and gradient-free algorithms like Evolution Strategy (ES) and simple Genetic Algorithm (GA) while requiring far fewer hyperparameters to train. The algorithm also improved certain Key Performance Indicators (KPIs) when applied to a Remote Electrical Tilt (RET) optimization task in the telecommunication domain.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (http://creativecommons.org/licenses/by-nc-sa/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the reused or adapted article and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.
Copyright
© The Author(s), 2022. Published by Cambridge University Press
Figure 0

Figure 1. Interaction of an agent with the environment at a time step t. Image credits (Sutton & Barto, 2018)

Figure 1

Figure 2. Workflow for model-based learning techniques

Figure 2

Figure 3. GPI iterative process for optimal policy

Figure 3

Figure 4. Architecture for actor-critic methods (Sutton & Barto, 2018)

Figure 4

Figure 5. Activation of a single neuron in the neural network

Figure 5

Figure 6. Forward propagation in an artificial neural network with a single hidden layer

Figure 6

Figure 7. Backpropagation of gradients by application of chain rule

Figure 7

Figure 8. Commonly used activation functions, f(x) and their derivatives, $f^{\,\prime}$(x) for $x \in [\!-5,5]$

Figure 8

Figure 9. Evolutionary computation has its roots within computer science, artificial intelligence and evolutionary biology

Figure 9

Figure 10. Workflow of a simple genetic algorithm

Figure 10

Figure 11. Single-point crossover (top) and two-point crossover (bottom) between two parents to generate offsprings for the next generation

Figure 11

Algorithm 1: Evolution strategy for RL by OpenAI (Salimans et al., 2017)

Figure 12

Figure 12. A simplified version of direct encoding proposed in NEAT (Stanley & Miikkulainen, 2002). Mapping of genotype, a neural network to a phenotype, its encoding

Figure 13

Figure 13. Distribution of species in a population of size 200. Each species is a unique weight initialization strategy for neural networks

Figure 14

Algorithm 2: Species-based genetic algorithm (Sp-GA)

Figure 15

Figure 14. Distributed framework for Sp-GA

Figure 16

Figure 15. Proposed encoding of a neural network

Figure 17

Figure 16. Mutation of a neural network in parameter and encoded space. $\theta_{0:w}^{g-1}$ represent the parameters of the network at $g-1$ generation. $\theta_{0:w}^g$ are obtained by utilizing a gaussian operator with $\sigma_g$ as the mutation power and $\tau_g$ as the random seed

Figure 18

Table 1. Some key libraries used for the experiments

Figure 19

Figure 17. Crossover of two neural networks in encoded space

Figure 20

Figure 18. Neural network used for Atari 2600 games

Figure 21

Figure 19. Neural network used for RET environment

Figure 22

Figure 20. The classical RL training loop

Figure 23

Figure 21. Game frames from 4 Atari 2600 games—beamrider, frostbite, qbert, spaceinvader

Figure 24

Figure 22. Original game frame (left) and preprocessed frame (right) using the first three steps

Figure 25

Figure 23. Antennae downtilt, $\theta_{t,c}$ for cell c at time t. Source Vannella et al. (2021)

Figure 26

Table 2. Hyperparameters used to train Sp-GA on Atari 2600 games

Figure 27

Table 3. Hyperparameters used to train Sp-GA on RET environment

Figure 28

Table 4. Cumulative rewards achieved by DQN, A3C and Sp-GA on 6 Atari 2600 games

Figure 29

Figure 24. Elite model’s score and population average achieved by Sp-GA on Atari 2600 games for two milestones (5 million, 25 million training frames)

Figure 30

Figure 25. Maximum episodic reward for DQN, A3C on Atari 2600 games for 5 million training frames. The transparent lines represent the actual value, and the solid ones represent the smoothed values

Figure 31

Table 5. Comparing the performance of Sp-GA with a simple genetic algorithm (GA) and evolution strategies (ES)

Figure 32

Table 6. The number of generations, top elite’s species and average size of encoding after training Sp-GA for 25M frames

Figure 33

Figure 26. Total training time (including time for communication between workers) and average time per generation as a function of the number of CPU’s.

Figure 34

Figure 27. Parallel speedup as a function of number of processors

Figure 35

Figure 28. Efficiency as a function of number of processors

Figure 36

Figure 29. Comparison of sample efficiency on Atari 2600 games

Figure 37

Table 7. Average improvement of some KPI’s provided by the environment with respect to a set baseline. A positive value is an indicator of a good policy

Figure 38

Figure 30. Reward KPI from the RET environment at each episode of training for DQN and Sp-GA. The red line highlights zero improvement

Figure 39

Table A.1. Specification of Atari 2600 games used for the experiments

Figure 40

Table A.2. Hyperparameters used to train DQN on RET environment

Figure 41

Table A.3. Total training time (including communication between workers) and average time taken per generation as a function of number of CPU’s

Figure 42

Figure A.1. Metrics returned by RET environment at each episode during training for Sp-GA

Figure 43

Figure A.2. Metrics returned by RET environment at each episode during training for DQN

Figure 44

Figure A.3. Elite model’s score and population average achieved by GA on Atari 2600 games

Figure 45

Figure A.4. Elite model’s score and population average achieved by Sp-GA on Atari 2600 games

Figure 46

Figure A.5. Average episodic reward achieved by ES on Atari 2600 games

Figure 47

Figure A.6. Maximum and average episodic reward achieved by DQN on Atari 2600 games

Figure 48

Figure A.7. Maximum and average episodic reward achieved by A3C on Atari 2600 games