Hostname: page-component-89b8bd64d-dvtzq Total loading time: 0 Render date: 2026-05-08T08:59:00.582Z Has data issue: false hasContentIssue false

Reliability assessment of off-policy deep reinforcement learning: A benchmark for aerodynamics

Published online by Cambridge University Press:  25 January 2024

Sandrine Berger*
Affiliation:
Department of Aerodynamics and Propulsion, ISAE-SUPAERO, Université de Toulouse, Toulouse, France
Andrea Arroyo Ramo
Affiliation:
Department of Aerodynamics and Propulsion, ISAE-SUPAERO, Université de Toulouse, Toulouse, France
Valentin Guillet
Affiliation:
Department of Complex Systems Engineering, ISAE-SUPAERO, Université de Toulouse, Toulouse, France
Thibault Lahire
Affiliation:
Department of Complex Systems Engineering, ISAE-SUPAERO, Université de Toulouse, Toulouse, France
Brice Martin
Affiliation:
Department of Complex Systems Engineering, ISAE-SUPAERO, Université de Toulouse, Toulouse, France
Thierry Jardin
Affiliation:
Department of Aerodynamics and Propulsion, ISAE-SUPAERO, Université de Toulouse, Toulouse, France
Emmanuel Rachelson
Affiliation:
Department of Complex Systems Engineering, ISAE-SUPAERO, Université de Toulouse, Toulouse, France
Michaël Bauerheim
Affiliation:
Department of Aerodynamics and Propulsion, ISAE-SUPAERO, Université de Toulouse, Toulouse, France
*
Corresponding author: Sandrine Berger; Email: sand.qva@gmail.com

Abstract

Deep reinforcement learning (DRL) is promising for solving control problems in fluid mechanics, but it is a new field with many open questions. Possibilities are numerous and guidelines are rare concerning the choice of algorithms or best formulations for a given problem. Besides, DRL algorithms learn a control policy by collecting samples from an environment, which may be very costly when used with Computational Fluid Dynamics (CFD) solvers. Algorithms must therefore minimize the number of samples required for learning (sample efficiency) and generate a usable policy from each training (reliability). This paper aims to (a) evaluate three existing algorithms (DDPG, TD3, and SAC) on a fluid mechanics problem with respect to reliability and sample efficiency across a range of training configurations, (b) establish a fluid mechanics benchmark of increasing data collection cost, and (c) provide practical guidelines and insights for the fluid dynamics practitioner. The benchmark consists in controlling an airfoil to reach a target. The problem is solved with either a low-cost low-order model or with a high-fidelity CFD approach. The study found that DDPG and TD3 have learning stability issues highly dependent on DRL hyperparameters and reward formulation, requiring therefore significant tuning. In contrast, SAC is shown to be both reliable and sample efficient across a wide range of parameter setups, making it well suited to solve fluid mechanics problems and set up new cases without tremendous effort. In particular, SAC is resistant to small replay buffers, which could be critical if full-flow fields were to be stored.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-ShareAlike licence (http://creativecommons.org/licenses/by-sa/4.0), which permits re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the re-used or adapted article and the original article is properly cited.
Open Practices
Open materials
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. Illustration of the airfoil trajectory control problem.

Figure 1

Figure 2. Lift $ \overrightarrow{L} $, drag $ \overrightarrow{D} $, and gravity $ \overrightarrow{g} $ forces applied to the center of mass of the flat plate. The flat plate velocity is denoted $ \overrightarrow{V} $, the pitch angle $ \beta $ and the angle of attack $ \alpha $. All angles are measured as displayed in this figure and signed counter-clockwise: $ \alpha $ is the angle going from the flat plate to the velocity vector so that $ \alpha >0 $ in the figure and $ \beta $ is the angle going from the horizontal ($ -\overrightarrow{x} $) to the flat plate so that $ \beta <0 $ in the figure.

Figure 2

Table 1. Mesh characteristics

Figure 3

Figure 3. Evolution of the lift $ {c}_L $ and drag $ {c}_D $ coefficients with the angle of attack $ \alpha $. Results from Taira and Colonius (2009) (■), present study using STAR-CCM+ (○), and fitting of the present study data using Eq. (3) (- - -).

Figure 4

Figure 4. Airfoil polar coordinates $ \left(\rho, \theta \right) $ defined relative to point B.

Figure 5

Figure 5. Generic architecture of the policy $ \pi $ neural network.

Figure 6

Figure 6. Generic architecture of the state-action value function $ {Q}^{\pi}\left(s,a\right) $ neural network.

Figure 7

Figure 7. Illustration of the first task: A and B are fixed.

Figure 8

Figure 8. (a) Evolution of the evaluation returns during the training phase. This return is obtained through the evaluation of the policy at the end of each episode during training. To get a sense of the learning “history” and convergence, the return is plotted at each episode and no smoothing or average is performed. (b) Visualization of the 1,000 different trajectories explored during training. To link the explored trajectories and the policy learning process, curves from both plots are colored by the episode number.

Figure 9

Figure 9. (a) Trajectory, (b) normalized action $ \dot{\beta}/{\dot{\beta}}_0 $, and (c) pitch angle $ \beta $ for three selected episodes of the training and testing: the first and best episode (i.e., with the highest return) of training and the test episode. Corresponding returns are respectively equal to 80.99, 109.78, and 109.69.

Figure 10

Figure 10. (a) Trajectory, (b) normalized action $ \dot{\beta}/{\dot{\beta}}_0 $, and (c) pitch angle $ \beta $ obtained by testing 20 identically parametrized agents. Returns from all the tests are in the range $ \left[109.12;109.58\right] $.

Figure 11

Table 2. Influence of RL hyperparameters: cases characteristics and number of won trials over 20 realizations

Figure 12

Figure 11. Ensemble average and standard deviation of the evaluation return computed with only trials that led to a winning test trajectory (the average may therefore be performed with different number of runs for each algorithm). The curves were obtained by evaluating the policy at the end of each episode during training.

Figure 13

Table 3. Influence of the reward model (Eq. 8): cases characteristics and number of won trials over 20 realizations

Figure 14

Figure 12. Ensemble average and standard deviation of the evaluation return computed with only trials that led to a winning test trajectory (the average may therefore be performed with different number of runs for each algorithm). The curves were obtained by evaluating the policies at the end of each episode during training.

Figure 15

Table 4. Influence of the transition model: cases characteristics and number of won trials over 20 realizations

Figure 16

Figure 13. Ensemble average and standard deviation of the evaluation return computed with only trials that led to a winning test trajectory (the average may therefore be performed with different number of runs for each algorithm). The curves were obtained by evaluating the policy at the end of each episode during training.

Figure 17

Figure 14. (a) Trajectory, (b) normalized action $ \dot{\beta}/{\dot{\beta}}_0 $, and (c) normalized action probability density function (PDF) for selected cases.

Figure 18

Figure 15. For all sets of parameters reported in Tables 2–4 the percentage of won trials is reported in green as a positive number and the percentage of lost trials is reported in red as a negative number.

Figure 19

Figure 16. Illustration (at scale) of the second task: B can take any value in the domain delimited by dashed lines.

Figure 20

Figure 17. Policy evaluation is performed on a given set of B positions in the domain. (a) Set of 11 evaluation positions and (b) set of 4 evaluation positions.

Figure 21

Table 5. Second task, variable target point: cases characteristics and number of trials that won on all the target positions tested over 20 trials

Figure 22

Figure 18. Ensemble average and standard deviation of the evaluation return computed with only trials that led to a winning test trajectory (the average may therefore be performed with different number of runs for each algorithm). The curves were obtained by evaluating the policy at the end of each episode during training.

Figure 23

Figure 19. (a) Trajectory, (b) normalized action $ \dot{\beta}/{\dot{\beta}}_0 $, and (c) normalized action probability density function (PDF) for selected cases.

Figure 24

Figure A1. Instantaneous vorticity field at various time steps of the CFD-DRL framework applied to the trajectory control of an airfoil going from point A to point B. The action sequence of this maneuver is obtained by testing one of the 20 policies obtained for the 20 trials performed with SAC for the CFD case discussed in Section 4.4.

Figure 25

Table B1. Generalization capabilities on the second task, variable target point: number of won trials over 20 realizations

Submit a response

Comments

No Comments have been published for this article.