Hostname: page-component-77f85d65b8-2tv5m Total loading time: 0 Render date: 2026-04-11T08:52:34.888Z Has data issue: false hasContentIssue false

A nonzero-sum game with reinforcement learning under mean-variance framework

Published online by Cambridge University Press:  04 December 2025

Junyi Guo
Affiliation:
School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China
Xia Han
Affiliation:
School of Mathematical Sciences, LPMC and AAIS, Nankai University, Tianjin, 300071, China
Hao Wang*
Affiliation:
School of Mathematical Sciences, Nankai University, Tianjin, 300071, China
Kam C. Yuen
Affiliation:
Department of Statistics and Actuarial Science, The University of Hong Kong, Pok Fu Lam, Hong Kong
*
Corresponding author: Hao Wang; Email: hao.wang@mail.nankai.edu.cn

Abstract

In this paper, we investigate a competitive market involving two agents who consider both their own wealth and the wealth gap with their opponent. Both agents can invest in a financial market consisting of a risk-free asset and a risky asset, under conditions where model parameters are partially or completely unknown. This setup gives rise to a nonzero-sum differential game within the framework of reinforcement learning (RL). Each agent aims to maximize his own Choquet-regularized, time-inconsistent mean-variance objective. Adopting the dynamic programming approach, we derive a time-consistent Nash equilibrium strategy in a general incomplete market setting. Under the additional assumption of a Gaussian mean return model, we obtain an explicit analytical solution, which facilitates the development of a practical RL algorithm. Notably, the proposed algorithm achieves uniform convergence, even though the conventional policy improvement theorem does not apply to the equilibrium policy. Numerical experiments demonstrate the robustness and effectiveness of the algorithm, underscoring its potential for practical implementation.

Information

Type
Research Article
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The International Actuarial Association

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Abel, A.B. (1990) Asset prices under habit formation and catching up with the Joneses. The American Economic Review, 80, 3842.Google Scholar
Basak, S. and Chabakauri, G. (2010) Dynamic mean-variance asset allocation. The Review of Financial Studies, 23(8), 29703016.CrossRefGoogle Scholar
Bensoussan, A., Siu, C, Yam, S. and Yang, H. (2014) A class of non-zero-sum stochastic differential investment and reinsurance games. Automatica, 50(8), 20252037.CrossRefGoogle Scholar
Björk, T., Khapko, M. and Murgoci, A. (2017) On time-inconsistent stochastic control in continuous time. Finance and Stochastics, 21, 331360.CrossRefGoogle Scholar
Björk, T. and Murgoci, A. (2010) A general theory of Markovian time inconsistent stochastic control problems, Stockholm School of Economics, working paper.CrossRefGoogle Scholar
Björk, T. and Murgoci, A. (2014) A theory of Markovian time-inconsistent stochastic control in discrete time. Finance and Stochastics, 18, 545592.CrossRefGoogle Scholar
Björk, T., Murgoci, A. and Zhou, X. (2014) Mean-variance portfolio optimization with state-dependent risk aversion. Mathematical Finance, 24(1), 124.CrossRefGoogle Scholar
Browne, S. (2000) Stochastic differential portfolio games. Journal of Applied Probability, 37(1), 126147 CrossRefGoogle Scholar
Chen, L. and Shen, Y. (2019) Stochastic Stackelberg differential reinsurance games under time-inconsistent mean-variance framework. Insurance: Mathematics and Economics, 88, 120137.Google Scholar
Dai, M., Dong, Y. and Jia, Y. (2023) Learning equilibrium mean-variance strategy. Mathematical Finance, 33(4), 11661212.CrossRefGoogle Scholar
Dai, M., Jin, H., Kou, S. and Xu, Y. (2021) A dynamic mean-variance analysis for log returns. Management Science, 67(2), 10931108.CrossRefGoogle Scholar
DeMarzo, P.M., Kaniel, R. and Kremer, I. (2008) Relative wealth concerns and financial bubbles. The Review of Financial Studies, 21(1), 1950.CrossRefGoogle Scholar
Deng, C., Zeng, X. and Zhu, H. (2018) Non-zero-sum stochastic differential reinsurance and investment games with default risk. European Journal of Operational Research, 264(3), 11441158.CrossRefGoogle Scholar
Ekeland, I. and Pirvu, T.A. (2008) Investment and consumption without commitment. Mathematics and Financial Economics, 2(1), 5786.CrossRefGoogle Scholar
Espinosa, G. and Touzi, N. (2015) Optimal investment under relative performance concerns. Mathematical Finance, 25(2), 221257.CrossRefGoogle Scholar
Foerster, J., Nardelli, N., Farquhar, G., Afouras, T., Torr, P.H., Kohli, P. and Whiteson, S. (2017) Stabilising experience replay for deep multi-agent reinforcement learning. International Conference on Machine Learning, pp. 11461155.Google Scholar
Föllmer, H. and Schied, A. (2011) Stochastic Finance: An Introduction in Discrete Time. Walter de Gruyter.CrossRefGoogle Scholar
Gali, J. (1994) Keeping up with the Joneses: Consumption externalities, portfolio choice, and asset prices. Journal of Money, Credit and Banking, 26(1), 18.CrossRefGoogle Scholar
Gilboa, I. and Schmeidler, D. (1989) Maxmin expected utility with non-unique prior. Journal of Mathematical Economics, 18(2), 141153.CrossRefGoogle Scholar
Guo, J., Han, X. and Wang, H. (2025) Exploratory mean-variance portfolio selection with Choquet regularizers. Quantitative Finance, 1–21.CrossRefGoogle Scholar
Han, X., Wang, R. and Zhou, X.Y. (2023) Choquet regularization for continuous-time reinforcement learning. SIAM Journal on Control and Optimization, 61(5), 27772801.CrossRefGoogle Scholar
Hu, D. and Wang, H. (2018) Time-consistent investment and reinsurance under relative performance concerns. Communications in Statistics-Theory and Methods, 47(7), 16931717.CrossRefGoogle Scholar
Hu, T. and Chen, O. (2020) On a family of coherent measures of variability. Insurance: Mathematics and Economics, 95, 173182.Google Scholar
Isaacs, R. (1965) Differential Games. New York: Wiley.Google Scholar
Jia, Y. and Zhou, X.Y. (2022a) Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research, 23(154), 155.Google Scholar
Jia, Y. and Zhou, X.Y. (2022b) Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research, 23(275), 150.Google Scholar
Jiang, R., Saunders, D. and Weng, C. (2022) The reinforcement learning Kelly strategy. Quantitative Finance, 22(8), 14451464.CrossRefGoogle Scholar
Kim, T.S. and Omberg, E. (1996) Dynamic nonmyopic portfolio behavior. The Review of Financial Studies, 9(1), 141161.CrossRefGoogle Scholar
Li, D. and Ng, W.L. (2000) Optimal dynamic portfolio selection: Multiperiod mean-variance formulation. Mathematical Finance, 10, 287406.CrossRefGoogle Scholar
Li, D. and Young, V.R. (2021) Bowley solution of a mean-variance game in insurance. Insurance: Mathematics and Economics, 98, 3543.Google Scholar
Littman, M.L. (1994) Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994, pp. 157163.CrossRefGoogle Scholar
Littman, M.L. (2001) Friend-or-foe Q-learning in general-sum games. International Conference on Machine Learning, pp. 322328.Google Scholar
Liu, F., Cai, J., Lemieux, C. and Wang, R. (2020) Convex risk functionals: Representation and applications. Insurance: Mathematics and Economics, 90, 6679.Google Scholar
Liu, J. (2001) Dynamic portfolio choice and risk aversion , working paper, UCLA.Google Scholar
Markowitz, H. (1952) Portfolio selection. The Journal of Finance, 7(1), 7791.Google Scholar
Merton, R.C. (1980) On estimating the expected return on the market: An exploratory investigation. Journal of Financial Economics, 8(4), 323361.CrossRefGoogle Scholar
Pontryagin, L.S. (1967) Linear differential games. I, II, in: Doklady Akademii Nauk. Russian Academy of Sciences, 175, 764766.Google Scholar
Quiggin, J. (1982) A theory of anticipated utility. Journal of Economic Behavior and Organization, 3(4), 323343.CrossRefGoogle Scholar
Rao, M., Chen, Y., Vemuri, B.C. and Wang, F. (2004) Cumulative residual entropy: A new measure of information. IEEE Transactions on Information Theory, 50(6), 12201228.CrossRefGoogle Scholar
Schmeidler, D. (1989). Subjective probability and expected utility without additivity. Econometrica, 57(3), 571587.CrossRefGoogle Scholar
Siu, C.C., Yam, S.C.P., Yang, H. and Zhao, H. (2016) A class of nonzero-sum investment and reinsurance games subject to systematic risks. Scandinavian Actuarial Journal, 2017(8), 670707.CrossRefGoogle Scholar
Sun, Z. and Jia, G. (2023) Reinforcement learning for exploratory linear-quadratic two-person zero-sum stochastic differential games. Applied Mathematics and Computation, 442, 127763.CrossRefGoogle Scholar
Sutton, R.S. and Barto, A.G. (2018) Reinforcement Learning:An Introduction. Cambridge, MA: MIT Press.Google Scholar
Wachter, J.A. (2002) Portfolio and consumption decisions under mean-reverting returns: An exact solution for complete markets. Journal of Financial and Quantitative Analysis, 37(1), 6391.CrossRefGoogle Scholar
Wang, H. and Zhou, X.Y. (2020) Continuous-time mean-variance portfolio selection: A reinforcement learning framework. Mathematical Finance, 30(4), 12731308.CrossRefGoogle Scholar
Wang, H., Zariphopoulou, T. and Zhou, X.Y. (2020a) Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, 21(1), 81458178.Google Scholar
Wang, Q., Wang, R. and Wei, Y. (2020b). Distortion risk metrics on general spaces. ASTIN Bulletin, 50(4), 827851.CrossRefGoogle Scholar
Wang, R., Wei, Y. and Willmot, G.E. (2020c) Characterization, robustness and aggregation of signed Choquet integrals. Mathematics of Operations Research, 45(3), 9931015.CrossRefGoogle Scholar
Wang, N., Zhang, N., Jin, Z. and Qian, L. (2019) Robust non-zero-sum investment and reinsurance game with default risk. Insurance: Mathematics and Economics, 84, 115132.Google Scholar
Wang, N., Zhang, N., Jin, Z. and Qian, L. (2021). Reinsurance-investment game between two mean-variance insurers under model uncertainty. Journal of Computational and Applied Mathematics, 382, 113095.CrossRefGoogle Scholar
Yaari, M. E. (1987) The dual theory of choice under risk. Econometrica, 55(1), 95115.CrossRefGoogle Scholar
Yang, Y. and Wang, J. (2020) An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv:2011.00583.Google Scholar
Zeng, Y., Li, D. and Gu, A. (2016) Robust equilibrium reinsurance-investment strategy for a mean-variance insurer in a model with jumps. Insurance: Mathematics and Economics, 66, 138152.Google Scholar
Zhang, K., Yang, Z. and Ba ${\rm{s}}$ ar, T. (2021) Multi-agent reinforcement learning: A selective overview of theories and algorithms. In Handbook of Reinforcement Learning and Control, pp. 321–384.CrossRefGoogle Scholar
Zhou, X. and Li, D. (2000) Continuous-time mean-variance portfolio selection: A stochastic LQ framework. Applied Mathematics and Optimization, 42(1), 1933.CrossRefGoogle Scholar
Zhu, J., Guan, G. and Li, S. (2020) Time-consistent non-zero-sum stochastic differential reinsurance and investment game under default and volatility risks. Journal of Computational and Applied Mathematics, 374, 112737.CrossRefGoogle Scholar