Hostname: page-component-848d4c4894-4hhp2 Total loading time: 0 Render date: 2024-05-29T10:42:24.742Z Has data issue: false hasContentIssue false

Learning self-play agents for combinatorial optimization problems

Published online by Cambridge University Press:  23 March 2020

Ruiyang Xu
Northeastern University Khoury College of Computer Sciences, Boston, MA, USA, e-mails:,
Karl Lieberherr
Northeastern University Khoury College of Computer Sciences, Boston, MA, USA, e-mails:,


Recent progress in reinforcement learning (RL) using self-play has shown remarkable performance with several board games (e.g., Chess and Go) and video games (e.g., Atari games and Dota2). It is plausible to hypothesize that RL, starting from zero knowledge, might be able to gradually approach a winning strategy after a certain amount of training. In this paper, we explore neural Monte Carlo Tree Search (neural MCTS), an RL algorithm that has been applied successfully by DeepMind to play Go and Chess at a superhuman level. We try to leverage the computational power of neural MCTS to solve a class of combinatorial optimization problems. Following the idea of Hintikka’s Game-Theoretical Semantics, we propose the Zermelo Gamification to transform specific combinatorial optimization problems into Zermelo games whose winning strategies correspond to the solutions of the original optimization problems. A specially designed neural MCTS algorithm is then introduced to train Zermelo game agents. We use a prototype problem for which the ground-truth policy is efficiently computable to demonstrate that neural MCTS is promising.

Research Article
© Cambridge University Press, 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)


Anthony, T., Tian, Z. & Barber, D. 2017. Thinking fast and slow with deep learning and tree search. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS ’17, 5366–5376.Google Scholar
Auer, P., Cesa-Bianchi, N. & Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2), 235256.CrossRefGoogle Scholar
Auger, D., Couetoux, A. & Teytaud, O. 2013. Continuous upper confidence trees with polynomial exploration - consistency. In ECML/PKDD (1), Lecture Notes in Computer Science 8188, 194209. Springer.Google Scholar
Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., Gulcehre, C., Song, F., Ballard, A, Gilmer, J., Dahl, G., Vaswani, A., Allen, K., Nash, C., Langston, V., Dyer, C., Heess, N, Wierstra, D., Kohli, P., Botvinick, M., Vinyals, O., Li, Y. & Pascanu, R. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint .Google Scholar
Bello, I., Pham, H., Le, Q. V., Norouzi, M. & Bengio, S. 2016. Neural combinatorial optimization with reinforcement learning. arXiv preprint .Google Scholar
Bjornsson, Y. & Finnsson, H. 2009. Cadiaplayer: a simulation-based general game player. IEEE Transactions on Computational Intelligence and AI in Games 1(1), 415.CrossRefGoogle Scholar
Browne, C., Powley, E. J., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Liebana, D. P., Samothrakis, S. & Colton, S. 2012. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games 4(1), 143.CrossRefGoogle Scholar
Fujikawa, Y. & Min, M. 2013. A new environment for algorithm research using gamification. IEEE International Conference on Electro-Information Technology , EIT 2013, Rapid City, SD, 16.Google Scholar
Genesereth, M., Love, N. & Pell, B. 2005. General game playing: Overview of the AAAI competition. AI Magazine 26(2), 62.Google Scholar
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. 2017. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning 70, 12631272. JMLR. org.Google Scholar
Hintikka, J. 1982. Game-theoretical semantics: insights and prospects. Notre Dame Journal of Formal Logic 23(2), 219241.CrossRefGoogle Scholar
Khalil, E., Dai, H., Zhang, Y., Dilkina, B. & Song, L. 2017. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, 63486358.Google Scholar
Kocsis, L. & Szepesvári, C. 2006. Bandit based Monte-Carlo planning. In Proceedings of the 17th European Conference on Machine Learning, ECML ’06, 282–293. Springer-Verlag.CrossRefGoogle Scholar
Laterre, A., Fu, Y., Jabri, M. K., Cohen, A.-S., Kas, D., Hajjar, K., Dahl, T. S., Kerkeni, A. & Beguir, K. 2018. Ranked reward: enabling self-play reinforcement learning for combinatorial optimization. arXiv preprint .Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S. & Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature 518(7540), 529533.CrossRefGoogle ScholarPubMed
Novikov, F. & Katsman, V. 2018. Gamification of problem solving process based on logical rules. In Informatics in Schools. Fundamentals of Computer Science and Software Engineering, Pozdniakov, S. N. & Dagienė, V. (eds). Springer International Publishing, 369380.CrossRefGoogle Scholar
Racanière, S., Weber, T., Reichert, D. P., Buesing, L., Guez, A., Rezende, D., Badia, A. P., Vinyals, O., Heess, N., Li, Y., Pascanu, R., Battaglia, P., Hassabis, D., Silver, D. & Wierstra, D. 2017. Imagination-augmented agents for deep reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS ’17, 5694–5705. Curran Associates Inc.Google Scholar
Rezende, M. & Chaimowicz, L. 2017. A methodology for creating generic game playing agents for board games. In 2017 16th Brazilian Symposium on Computer Games and Digital Entertainment (SBGames), 19–28. IEEE.CrossRefGoogle Scholar
Selsam, D., Lamm, M., Bunz, B., Liang, P., de Moura, L. & Dill, D. L. 2018. Learning a sat solver from single-bit supervision. arXiv preprint .Google Scholar
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T. P., Simonyan, K. & Hassabis, D. 2017a. Mastering Chess and Shogi by self-play with a general reinforcement learning algorithm. CoRR .Google Scholar
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K. & Hassabis, D. 2018. A general reinforcement learning algorithm that masters Chess, Shogi, and Go through self-play. Science 362(6419), 11401144.CrossRefGoogle ScholarPubMed
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T. & Hassabis, D. 2017b. Mastering the game of Go without human knowledge. Nature 550, 354.CrossRefGoogle ScholarPubMed
Sniedovich, M. 2003. OR/MS games: 4. The joy of egg-dropping in Braunschweig and Hong Kong. INFORMS Transactions on Education 4(1), 4864.CrossRefGoogle Scholar
Vinyals, O., Fortunato, M. & Jaitly, N. 2015. Pointer networks. In Advances in Neural Information Processing Systems, Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R. (eds), 28. Curran Associates, Inc., 2692–2700.Google Scholar
Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229256.CrossRefGoogle Scholar