A nonzero-sum game with reinforcement learning under mean-variance framework

Junyi Guo; Xia Han; Hao Wang; Kam C. Yuen

doi:10.1017/asb.2025.10080

A nonzero-sum game with reinforcement learning under mean-variance framework

Published online by Cambridge University Press: 04 December 2025

Xia Han ,

and

Junyi Guo: Affiliation:
School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China
Xia Han: Affiliation:
School of Mathematical Sciences, LPMC and AAIS, Nankai University, Tianjin, 300071, China
Hao Wang*: Affiliation:
School of Mathematical Sciences, Nankai University, Tianjin, 300071, China
Kam C. Yuen: Affiliation:
Department of Statistics and Actuarial Science, The University of Hong Kong, Pok Fu Lam, Hong Kong
*: Corresponding author: Hao Wang; Email: hao.wang@mail.nankai.edu.cn

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

In this paper, we investigate a competitive market involving two agents who consider both their own wealth and the wealth gap with their opponent. Both agents can invest in a financial market consisting of a risk-free asset and a risky asset, under conditions where model parameters are partially or completely unknown. This setup gives rise to a nonzero-sum differential game within the framework of reinforcement learning (RL). Each agent aims to maximize his own Choquet-regularized, time-inconsistent mean-variance objective. Adopting the dynamic programming approach, we derive a time-consistent Nash equilibrium strategy in a general incomplete market setting. Under the additional assumption of a Gaussian mean return model, we obtain an explicit analytical solution, which facilitates the development of a practical RL algorithm. Notably, the proposed algorithm achieves uniform convergence, even though the conventional policy improvement theorem does not apply to the equilibrium policy. Numerical experiments demonstrate the robustness and effectiveness of the algorithm, underscoring its potential for practical implementation.

Keywords

Stochastic processes reinforcement learning Choquet regularizers mean-variance framework time-consistent Nash equilibrium

Information

Type: Research Article
Information: ASTIN Bulletin: The Journal of the IAA , Volume 56 , Issue 1 , January 2026 , pp. 154 - 180

DOI: https://doi.org/10.1017/asb.2025.10080 [Opens in a new window]
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of The International Actuarial Association

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Abel, A.B. (1990) Asset prices under habit formation and catching up with the Joneses. The American Economic Review, 80, 38–42.Google Scholar

Basak, S. and Chabakauri, G. (2010) Dynamic mean-variance asset allocation. The Review of Financial Studies, 23(8), 2970–3016.CrossRef Google Scholar

Bensoussan, A., Siu, C, Yam, S. and Yang, H. (2014) A class of non-zero-sum stochastic differential investment and reinsurance games. Automatica, 50(8), 2025–2037.CrossRef Google Scholar

Björk, T., Khapko, M. and Murgoci, A. (2017) On time-inconsistent stochastic control in continuous time. Finance and Stochastics, 21, 331–360.CrossRef Google Scholar

Björk, T. and Murgoci, A. (2010) A general theory of Markovian time inconsistent stochastic control problems, Stockholm School of Economics, working paper.CrossRef Google Scholar

Björk, T. and Murgoci, A. (2014) A theory of Markovian time-inconsistent stochastic control in discrete time. Finance and Stochastics, 18, 545–592.CrossRef Google Scholar

Björk, T., Murgoci, A. and Zhou, X. (2014) Mean-variance portfolio optimization with state-dependent risk aversion. Mathematical Finance, 24(1), 1–24.CrossRef Google Scholar

Browne, S. (2000) Stochastic differential portfolio games. Journal of Applied Probability, 37(1), 126–147 CrossRef Google Scholar

Chen, L. and Shen, Y. (2019) Stochastic Stackelberg differential reinsurance games under time-inconsistent mean-variance framework. Insurance: Mathematics and Economics, 88, 120–137.Google Scholar

Dai, M., Dong, Y. and Jia, Y. (2023) Learning equilibrium mean-variance strategy. Mathematical Finance, 33(4), 1166–1212.CrossRef Google Scholar

Dai, M., Jin, H., Kou, S. and Xu, Y. (2021) A dynamic mean-variance analysis for log returns. Management Science, 67(2), 1093–1108.CrossRef Google Scholar

DeMarzo, P.M., Kaniel, R. and Kremer, I. (2008) Relative wealth concerns and financial bubbles. The Review of Financial Studies, 21(1), 19–50.CrossRef Google Scholar

Deng, C., Zeng, X. and Zhu, H. (2018) Non-zero-sum stochastic differential reinsurance and investment games with default risk. European Journal of Operational Research, 264(3), 1144–1158.CrossRef Google Scholar

Ekeland, I. and Pirvu, T.A. (2008) Investment and consumption without commitment. Mathematics and Financial Economics, 2(1), 57–86.CrossRef Google Scholar

Espinosa, G. and Touzi, N. (2015) Optimal investment under relative performance concerns. Mathematical Finance, 25(2), 221–257.CrossRef Google Scholar

Foerster, J., Nardelli, N., Farquhar, G., Afouras, T., Torr, P.H., Kohli, P. and Whiteson, S. (2017) Stabilising experience replay for deep multi-agent reinforcement learning. International Conference on Machine Learning, pp. 1146–1155.Google Scholar

Föllmer, H. and Schied, A. (2011) Stochastic Finance: An Introduction in Discrete Time. Walter de Gruyter.CrossRef Google Scholar

Gali, J. (1994) Keeping up with the Joneses: Consumption externalities, portfolio choice, and asset prices. Journal of Money, Credit and Banking, 26(1), 1–8.CrossRef Google Scholar

Gilboa, I. and Schmeidler, D. (1989) Maxmin expected utility with non-unique prior. Journal of Mathematical Economics, 18(2), 141–153.CrossRef Google Scholar

Guo, J., Han, X. and Wang, H. (2025) Exploratory mean-variance portfolio selection with Choquet regularizers. Quantitative Finance, 1–21.CrossRef Google Scholar

Han, X., Wang, R. and Zhou, X.Y. (2023) Choquet regularization for continuous-time reinforcement learning. SIAM Journal on Control and Optimization, 61(5), 2777–2801.CrossRef Google Scholar

Hu, D. and Wang, H. (2018) Time-consistent investment and reinsurance under relative performance concerns. Communications in Statistics-Theory and Methods, 47(7), 1693–1717.CrossRef Google Scholar

Hu, T. and Chen, O. (2020) On a family of coherent measures of variability. Insurance: Mathematics and Economics, 95, 173–182.Google Scholar

Isaacs, R. (1965) Differential Games. New York: Wiley.Google Scholar

Jia, Y. and Zhou, X.Y. (2022a) Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research, 23(154), 1–55.Google Scholar

Jia, Y. and Zhou, X.Y. (2022b) Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research, 23(275), 1–50.Google Scholar

Jiang, R., Saunders, D. and Weng, C. (2022) The reinforcement learning Kelly strategy. Quantitative Finance, 22(8), 1445–1464.CrossRef Google Scholar

Kim, T.S. and Omberg, E. (1996) Dynamic nonmyopic portfolio behavior. The Review of Financial Studies, 9(1), 141–161.CrossRef Google Scholar

Li, D. and Ng, W.L. (2000) Optimal dynamic portfolio selection: Multiperiod mean-variance formulation. Mathematical Finance, 10, 287–406.CrossRef Google Scholar

Li, D. and Young, V.R. (2021) Bowley solution of a mean-variance game in insurance. Insurance: Mathematics and Economics, 98, 35–43.Google Scholar

Littman, M.L. (1994) Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994, pp. 157–163.CrossRef Google Scholar

Littman, M.L. (2001) Friend-or-foe Q-learning in general-sum games. International Conference on Machine Learning, pp. 322–328.Google Scholar

Liu, F., Cai, J., Lemieux, C. and Wang, R. (2020) Convex risk functionals: Representation and applications. Insurance: Mathematics and Economics, 90, 66–79.Google Scholar

Liu, J. (2001) Dynamic portfolio choice and risk aversion , working paper, UCLA.Google Scholar

Markowitz, H. (1952) Portfolio selection. The Journal of Finance, 7(1), 77–91.Google Scholar

Merton, R.C. (1980) On estimating the expected return on the market: An exploratory investigation. Journal of Financial Economics, 8(4), 323–361.CrossRef Google Scholar

Pontryagin, L.S. (1967) Linear differential games. I, II, in: Doklady Akademii Nauk. Russian Academy of Sciences, 175, 764–766.Google Scholar

Quiggin, J. (1982) A theory of anticipated utility. Journal of Economic Behavior and Organization, 3(4), 323–343.CrossRef Google Scholar

Rao, M., Chen, Y., Vemuri, B.C. and Wang, F. (2004) Cumulative residual entropy: A new measure of information. IEEE Transactions on Information Theory, 50(6), 1220–1228.CrossRef Google Scholar

Schmeidler, D. (1989). Subjective probability and expected utility without additivity. Econometrica, 57(3), 571–587.CrossRef Google Scholar

Siu, C.C., Yam, S.C.P., Yang, H. and Zhao, H. (2016) A class of nonzero-sum investment and reinsurance games subject to systematic risks. Scandinavian Actuarial Journal, 2017(8), 670–707.CrossRef Google Scholar

Sun, Z. and Jia, G. (2023) Reinforcement learning for exploratory linear-quadratic two-person zero-sum stochastic differential games. Applied Mathematics and Computation, 442, 127763.CrossRef Google Scholar

Sutton, R.S. and Barto, A.G. (2018) Reinforcement Learning:An Introduction. Cambridge, MA: MIT Press.Google Scholar

Wachter, J.A. (2002) Portfolio and consumption decisions under mean-reverting returns: An exact solution for complete markets. Journal of Financial and Quantitative Analysis, 37(1), 63–91.CrossRef Google Scholar

Wang, H. and Zhou, X.Y. (2020) Continuous-time mean-variance portfolio selection: A reinforcement learning framework. Mathematical Finance, 30(4), 1273–1308.CrossRef Google Scholar

Wang, H., Zariphopoulou, T. and Zhou, X.Y. (2020a) Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, 21(1), 8145–8178.Google Scholar

Wang, Q., Wang, R. and Wei, Y. (2020b). Distortion risk metrics on general spaces. ASTIN Bulletin, 50(4), 827–851.CrossRef Google Scholar

Wang, R., Wei, Y. and Willmot, G.E. (2020c) Characterization, robustness and aggregation of signed Choquet integrals. Mathematics of Operations Research, 45(3), 993–1015.CrossRef Google Scholar

Wang, N., Zhang, N., Jin, Z. and Qian, L. (2019) Robust non-zero-sum investment and reinsurance game with default risk. Insurance: Mathematics and Economics, 84, 115–132.Google Scholar

Wang, N., Zhang, N., Jin, Z. and Qian, L. (2021). Reinsurance-investment game between two mean-variance insurers under model uncertainty. Journal of Computational and Applied Mathematics, 382, 113095.CrossRef Google Scholar

Yaari, M. E. (1987) The dual theory of choice under risk. Econometrica, 55(1), 95–115.CrossRef Google Scholar

Yang, Y. and Wang, J. (2020) An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv:2011.00583.Google Scholar

Zeng, Y., Li, D. and Gu, A. (2016) Robust equilibrium reinsurance-investment strategy for a mean-variance insurer in a model with jumps. Insurance: Mathematics and Economics, 66, 138–152.Google Scholar

Zhang, K., Yang, Z. and Ba

${\rm{s}}$ ar, T. (2021) Multi-agent reinforcement learning: A selective overview of theories and algorithms. In Handbook of Reinforcement Learning and Control, pp. 321–384.CrossRef Google Scholar

Zhou, X. and Li, D. (2000) Continuous-time mean-variance portfolio selection: A stochastic LQ framework. Applied Mathematics and Optimization, 42(1), 19–33.CrossRef Google Scholar

Zhu, J., Guan, G. and Li, S. (2020) Time-consistent non-zero-sum stochastic differential reinsurance and investment game under default and volatility risks. Journal of Computational and Applied Mathematics, 374, 112737.CrossRef Google Scholar

Article contents

A nonzero-sum game with reinforcement learning under mean-variance framework

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests