Campbell, M., Hone, A. J. Jr. & Hsu, F-H. dark blue. Artif. intell. 134, 57–83 (2002).
Rajat, d. Et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Bellmayer, M.G., Naddafe, Y., Vennes, J. And bowling, m. Arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Race. 47, 253–279 (2013).
Machado, m. Et al. Revisiting the arcade learning environment: evaluation protocols and problems open to general agents. J. Artif. Intell. Race. 61, 523–562 (2018).
Rajat, d. Et al. A general reinforcement learning algorithm that operates chess, shogi and masters through self-play. Science 362, 1140–1144 (2018).
Scheffer, J. Et al. A World Championship Caliber Checkers Program. Artif. intell. 53, 273–289 (1992).
Brown, Ann. & Sandholm, T. Head-up no-limit poker for Superhuman AI: Libratus beats top professionals. Science 359, 418–424 (2018).
Moravčík, m. Et al. DeepStack: Expert-Level Artificial Intelligence in Head-Up No-Limit Poker. Science 356, 508–513 (2017).
Valhwas, I. And Rifanidis, I. Planning and scheduling Technical Report (EETN, 2013).
Segler, M.H., Preiss, M. And Waller, M.P. Planning chemical synthesis, including deep neural networks and symbolic AI. Nature 555, 604–610 (2018).
Sutton, R.S. and Barto, A.G. Reinforcement Learning: An Introduction 2nd edn (MIT Press, 2018).
Desenroth, m. And Rasmussen, C. PILCO: A model-based and data-efficient approach to policy discovery. In Proc. 28th International Conference on Machine Learning, ICML 2011 465–472 (Ubiquitous, 2011).
Hayes, Ann. Et al. Learning continuous control policies by stochastic price gradients. In NIPS’15: Proc. 28th International Conference on Neural Information Processing Systems Flight. 2 (ed. Cortes, C. et al.) 2944–2952 (MIT Press, 2015).
Levine, s. & Abiel, p. Learning Neural Network Policies with Guided Policy Search with Unknown Dynamics Adv. Neurological process. syst. 27, 1071–1079 (2014).
Hafner, D. Et al. Learning latent dynamics for planning from pixels. Refer to https://arxiv.org/abs/1811.04551 (2018).
Kaiser, L. et al. Model based reinforcement learning for the attic. Refer to https://arxiv.org/abs/1903.00374 (2019).
Bussing, L. Et al. Learning and querying increasingly common models for reinforcement learning. Refer to https://arxiv.org/abs/1802.03006 (2018).
Espeholt, L. et al. Impala: Scalable distributed Deep-RL with critical weighted actor-learner architecture. In Proc. International Conference on Machine Learning, ICML Vol. 80 (eds Dy, J. & Krause, A.) 1407–1416 (2018).
Kapturowski, S., Ostrovsky, G., Dabney, W., Cowan, J. & Munos, R. Reintegration experience, re-experience in distributed reinforcement learning. In International Conference on Learning Representation (2019).
Horgan, D. Et al. Prioritized experiences were redistributed. In International Conference on Learning Representation (2018).
Puterman, M.L. Markov Decision Process: Discrete Stochastic Dynamic Programming First Ed (John Wiley & Sons, 1994).
Coulom, R. In Search of Monte-Carlo Trees. Efficient selectivity and backup operator. In International Conference on Computers and Games 72–83 (Springer, 2006).
Pahlavan, N., Schön, T.B. & Deisenroth, Torres from MP to Pixel: Policy Learning with Deep Dynamic Models. Refer to http://arxiv.org/abs/1502.02251 (2015).
Water, M., Springberg, J.T., Boedeker, J. And Ridmiller, M. Embed to control: a locally linear latent dynamics model for control from raw images. In NIPS’15: Proc. 28th International Conference on Neural Information Processing Systems Flight. 2 (eds Cortes, C. et al.) 2746–2754 (MIT Press, 2015).
Yes, D. And Schmidhuber, J. Recurring world models facilitate policy development. In NIPS’18: Proc. 32nd International Conference on Neural Information Processing Systems (Ed Bengio, S. et al.) 2455–26, (Karran Associates, 201,).
Gelda, C., Kumar, S., Bakman, J., Nachum, O. And Belmare, M.G. Deepmedipi: Continuous latent space model learning for representation learning. Proc. 36th International Conference on Machine Learning: Volume 97 of Proc. Machine learning research (E.D. Chaudhary, K. and Salakhuddinov, R.) 2170–2179 (PMLR, 2019)
Van Hasselt, H., Hessel, M. And Aslanides, J. When to use parametric models in reinforcement learning? Refer to https://arxiv.org/abs/1906.05243 (2019).
Tamar, A., Wu, Y., Thomas, G., Levine, S. And Abiel, p. Value iteration network. Adv. Neurological process. syst. 29, 2154–2162 (2016).
Rajat, d. Et al. The prophet: End-to-end learning and planning. In Proc. 34th International Conference on Machine Learning Flight. 70 (eds Prechi, D. & Teh, YW) 3191–3199 (JMLR, 2017).
Farahmand, A.M., Barreto, A. And Nikowski, D. Value-aware loss function for model-based reinforcement learning. In Proc. 20th International Conference on Artificial Intelligence and Statistics: Volume 54 of Proc. Machine learning research (E.D. Singh, A. and Zhu, J.) 1486–1494 (PMLR, 2017).
Farhamand, a. Irrational Value-Aware Model Model Learning. Adv. Neurological process. syst. 31, 9090–9101 (2018).
Farquhar, G., Rocktaschel, T., Igle, M.. & Whitson, s. TreeQN and ATRC: Differential tree planning for deep reinforcement learning. In International Conference on Learning Representation (2018).
Oh, J., Singh, S. & Lee, H. Price Prediction Network. Adv. Neurological process. syst. 30, 6118–6128 (2017).
Krizhevsky, A., Sutskever, I. & Hinton, GE Imagenet classification with deep sensory neural networks. Adv. Neurological process. syst. 25, 1097–1105 (2012).
He, K., Zhang, X., Ren, S. And Sun, J. Deep Residential Mapping in Deep Normal Network. In 14th European Conference on Computer Vision 630–645 (2016).
Heschel, M. Et al. The Rainbow: Combining Improvements in Intensive Reinforcement Learning. In Thirty-second AAAI Conference on Artificial Intelligence (2018).
Shmit, S., Hessel, M. & Simonian, K. Replaying shared experience with off-policy actor-critic. Refer to https://arxiv.org/abs/1909.11583 (2019).
Azizadaneshi, k. Et al. Surprising negative results for generators adverse tree search. Refer to http://arxiv.org/abs/1806.05780 (2018).
Minh, v. Et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Open, AI OpenAI Five. OpenAI https://blog.openai.com/openai-five/ (2018).
Vinayals, O. Et al. Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019).
Jaderberg, m. Et al. Reinforcement learning with untrained assistive functions. Refer to https://arxiv.org/abs/1611.05397 (2016).
Rajat, d. Et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017).
Koxis, l. & Zeppeswari, c. Bandit-based Monte-Carlo scheme. The leash European Conference on Machine Learning 282–293 (Springer, 2006).
Rossin, CD with reference to the episode to multi-armed bandits. N. mathematics. Artif. intell. 61, 203–230 (2011).
Shahad, MP, Winner, MH, Van den Herrick, HJ, Chslot, GM-B. & Uiterwijk, JW Single-player Monte-Carlo tree search. In International Conference on Computers and Games 1–12 (Springer, 2008).
Pohlen, T. Et al. Observe and look further: achieving consistent performance on the attic. Refer to https://arxiv.org/abs/1805.11593 (2018).
Shahul, T., Cowan, J., Antonoglou, I. And Silver, D. Priority Experience Replays. In International Conference on Learning Representation (2016).
Cloud TPU. Google Cloud https://cloud.google.com/tpu/ (2019).
Coulom, R. Hole-History Rating: A Bayesian rating system for players of time-varying strength. In International Conference on Computers and Games 113–124 (2008).
Nair, a. Et al. Massively parallel methods for intensive reinforcement learning. Refer to https://arxiv.org/abs/1507.04296 (2015).
Lanctot, m. Et al. OpenSpiel: a framework for reinforcement learning in games. Refer to http://arxiv.org/abs/1908.09453 (2019).