Hostname: page-component-89b8bd64d-shngb Total loading time: 0 Render date: 2026-05-12T10:48:44.234Z Has data issue: false hasContentIssue false

Adversarial agent-learning for cybersecurity: a comparison of algorithms

Published online by Cambridge University Press:  06 March 2023

Alexander Shashkov
Affiliation:
Williams College, Williamstown, MA 01267, USA; e-mail: aes7@williams.edu
Erik Hemberg
Affiliation:
MIT CSAIL, Cambridge, MA 02139, USA; e-mails: hembergerik@csail.mit.edu, mtulla@mit.edu, unamay@csail.mit.edu
Miguel Tulla
Affiliation:
MIT CSAIL, Cambridge, MA 02139, USA; e-mails: hembergerik@csail.mit.edu, mtulla@mit.edu, unamay@csail.mit.edu
Una-May O’Reilly
Affiliation:
MIT CSAIL, Cambridge, MA 02139, USA; e-mails: hembergerik@csail.mit.edu, mtulla@mit.edu, unamay@csail.mit.edu
Rights & Permissions [Opens in a new window]

Abstract

We investigate artificial intelligence and machine learning methods for optimizing the adversarial behavior of agents in cybersecurity simulations. Our cybersecurity simulations integrate the modeling of agents launching Advanced Persistent Threats (APTs) with the modeling of agents using detection and mitigation mechanisms against APTs. This simulates the phenomenon of how attacks and defenses coevolve. The simulations and machine learning are used to search for optimal agent behaviors. The central question is: under what circumstances, is one training method more advantageous than another? We adapt and compare a variety of deep reinforcement learning (DRL), evolutionary strategies (ES) and Monte Carlo Tree Search methods within Connect 4, a baseline game environment, and on both a simulation supporting a simple APT threat model, SNAPT, as well as CyberBattleSim, an open-source cybersecurity simulation. Our results show that when attackers are trained by DRL and ES algorithms, as well as when they are trained with both algorithms being used in alternation, they are able to effectively choose complex exploits that thwart a defense. The algorithm that combines DRL and ES achieves the best comparative performance when attackers and defenders are simultaneously trained, rather than when each is trained against its non-learning counterpart.

Information

Type
Research Article
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. The SNAPT setup used for training with three nodes. Each node contains the triple of data (s, v, p) equal to the security state(s), value(v) and exploit probability(p).

Figure 1

Figure 2. The network topology of a CyberBattleSim environment. Nodes represent devices and edges represent potential exploits sourced at the tail of the arrow and targeted at the head. The label of each node is its type. At the start of the simulation, the attacker has control of only the ‘Start’ node and needs to control the ‘Goal’ node in order to receive the full reward.

Figure 2

Figure 3. The neural network (NN) architecture used for the defender. When choosing an action, the actor/policy NN picks a node T to reimage or to leave the network unchanged (this can be seen as picking a ‘null’ node). The value is chosen separately by the critic/value NN.

Figure 3

Figure 4. The multi-stage neural network (NN) architecture used for the attacker. When choosing an action, each element of the tuple is selected individually and passed onto the next stage of the actor/policy neural network. The ordering of the Type NN and Target NN may be swapped, and the Source NN and Credential NN are only used for certain action types. The value is chosen separately by the critic/value NN.

Figure 4

Table 1. Overview of training methods. The methods are chosen based on how they update the neural network. The training choices are gradient descent and/or weighted average, the number of samples (#Samples) used, and if tree search is used

Figure 5

Figure 5. The iterative process used for evolutionary strategies (ES). At iteration t, a set of samples is taken from the distribution $(\mu_t, \sigma_t)$. These samples are then evaluated by executing episodes of CyberBattleSim. Sample performance is determined by average reward from these episodes. The distribution is then updated towards the better performing samples. The process then begins again with the new distribution $(\mu_{t+1}, \sigma_{t+1})$.

Figure 6

Table 2. Example calculations (using mock numbers) for the round-robin system introduced in Section 4.2.2 with $N=3$. Each cell gives the average reward from executing $G_{CEM}$ episodes between an attacker (given by the row) and defender (given by the column). The average for each attacker and defender is given at the end of each row and column. The best-performing attacker is $z^A_2$ and the best-performing defender is $z^D_3$, so these samples will be relabeled as $x^A_1$ and $x^D_1$ and have the largest weight when calculating $\mu^A_{t+1}$ and $\mu^D_{t+1}$, respectively

Figure 7

Table 3. Overview of training method evaluation. Column shows the evaluation setup and results section. Row shows the training method. CBS is CyberBattleSim

Figure 8

Table 4. Environment and algorithm parameters used in Connect 4 and SNAPT

Figure 9

Table 5. Environment and algorithm parameters used with CyberBattleSim

Figure 10

Table 6. Comparison of Connect 4 training algorithms without players using tree search (top) and with all players using tree search (bottom). Rows are the first player, and columns the second. Each cell shows how many games out of 100 the first player won against the second. The final column gives the average of each row and the final row gives the average of each column. The largest value in each column is bolded to highlight the best-performing attacker. The smallest value in each row is underlined to highlight the best-performing defender

Figure 11

Table 7. Comparison of SNAPT Attacker vs. Defender competitions without agents using tree search (top) and with both adversaries using tree search (bottom). Rows are the Attackers and columns the Defenders. Each cell shows how many competitions out of 100 the Attacker won against the Defender. The final column gives the average of each row and the final row gives the average of each column. The largest value in each column is bolded to highlight the best-performing attacker. The smallest value in each row is underlined to highlight the best-performing defender

Figure 12

Figure 6. A heatmap showing the move (output) probabilities for the trained SNAPT agents at the initial state. The row is the Attacker or Defender and the column is the move (output), with darker shades indicating the agent is more likely to make a certain move. The Attacker has 3 possible moves, and the Defender has 6 possible moves.

Figure 13

Table 8. The average rewards (rounded to the nearest integer) from 100 episodes in each attacker and defender pairing. The column gives the defender and the row gives the attacker. We include an untrained attacker and defender which randomly select each action as a control as well as an environment with no defender. The averages of each row and column are given at the end. The largest value in each column is bolded to show which attacker performed best against a given defender. The smallest value in each row is underlined to show which defender performed the best against a given attacker

Figure 14

Figure 7. (Left) The attacker’s reward over time versus no defender for each algorithm used in training. (Right) The difference in reward between the attacker versus the trained defender and the attacker versus no defender during coevolution. The rewards are smoothed by averaging consecutive terms.

Figure 15

Figure 8. (Left) Attacker’s reward at various checkpoints during the 24 hours of training by CEM with and without the round-robin method. (Right) The Attacker’s reward at various checkpoints during the 24 hours of training of A2C+CEM with the ‘Type First’, ‘Target First’, and ‘Simple’ neural network approaches described in Section 4.1. The rewards are smoothed by averaging consecutive terms.

Figure 16

Table 9. The average rewards (rounded to the nearest integer) from 100 episodes for each pairing of trained attackers and defenders from the three attacker neural network architectures trained with A2C+CEM. The row gives the attacker and the column gives the defender. The average reward for each attacker is given in the Mean column. The largest value in each column is bolded to show which attacker performed best against a given defender

Figure 17

Table 10. CyberBattleSim settings. The feature vector observations given to the attacker and defender. The Agent column gives whether the observation is given to the attacker or defender, the Type column gives whether the observation is global or device-specific, the Observation column gives the name of the observation (see Section C), and the Dimension column gives the dimension of the observation. For device-specific observations, the dimension is multiplied by 6 as there are 6 nodes in the network. The total size of the attacker and defender observations is given in the Total row in the observation column

Figure 18

Figure A.1. An overview of our experimental setup. We start with an untrained network and apply one of the training algorithms described in Section 4.2. For Connect 4 and SNAPT we compare ES, DRL and MCTS. For CyberBattleSim we then compare the CEM algorithms with and without round-robin, A2C+CEM with the three different attacker network architectures, and all the algorithms using a Type First neural network.

Figure 19

Table B.1 The number of training iterations executed during the 24 hour training period for each algorithm for CyberBattleSim. For the combined methods (A2C + FB-ES and A2C + CEM) one iteration is counted as one step of both A2C and the ES method

Figure 20

Table C.1 The standard deviations (rounded to the nearest integer) from 100 episodes in each attacker and defender pairing. The column gives the defender and the row gives the attacker. We include an untrained attacker and defender which randomly select each action as a control as well as an environment with no defender. The averages of each row and column are given at the end.