Automatic landmark discovery for learning agents under partial observability

Alper Demіr; Erkіn Çіlden; Faruk Polat

doi:10.1017/S026988891900002X

Automatic landmark discovery for learning agents under partial observability

Published online by Cambridge University Press: 02 August 2019

and

Alper Demіr: Affiliation:
Department of Computer Engineering, Middle East Technical University, 06800 Ankara, Turkey e-mail: ademir@ceng.metu.edu.tr
Erkіn Çіlden: Affiliation:
RF and Simulation Systems Directorate, STM Defense Technologies Engineering and Trade Inc., 06530 Ankara, Turkey e-mail: erkin.cilden@stm.com.tr
Faruk Polat: Affiliation:
Department of Computer Engineering, Middle East Technical University, 06800 Ankara, Turkey e-mail: polat@ceng.metu.edu.tr

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

In the reinforcement learning context, a landmark is a compact information which uniquely couples a state, for problems with hidden states. Landmarks are shown to support finding good memoryless policies for Partially Observable Markov Decision Processes (POMDP) which contain at least one landmark. SarsaLandmark, as an adaptation of Sarsa(λ), is known to promise a better learning performance with the assumption that all landmarks of the problem are known in advance.

In this paper, we propose a framework built upon SarsaLandmark, which is able to automatically identify landmarks within the problem during learning without sacrificing quality, and requiring no prior information about the problem structure. For this purpose, the framework fuses SarsaLandmark with a well-known multiple-instance learning algorithm, namely Diverse Density (DD). By further experimentation, we also provide a deeper insight into our concept filtering heuristic to accelerate DD, abbreviated as DDCF (Diverse Density with Concept Filtering), which proves itself to be suitable for POMDPs with landmarks. DDCF outperforms its antecedent in terms of computation speed and solution quality without loss of generality.

The methods are empirically shown to be effective via extensive experimentation on a number of known and newly introduced problems with hidden state, and the results are discussed.

Information

Type: Research Article
Information: The Knowledge Engineering Review , Volume 34 , 2019 , e11

DOI: https://doi.org/10.1017/S026988891900002X [Opens in a new window]
Copyright: © Cambridge University Press, 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Chrisman, L. 1992. Reinforcement learning with perceptual aliasing: the perceptual distinctions approach. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI ’92, 183–188. AAAI Press. https://www.aaai.org/Papers/AAAI/1992/AAAI92-029.pdf.Google Scholar

Daniel, C., van Hoof, H., Peters, J. & Neumann, G. 2016. Probabilistic inference for determining options in reinforcement learning. In Machine Learning 104. 2-3, 337–357. doi: 10.1007/s10994-016-5580-x.CrossRef Google Scholar

Demir, A., Çilden, E. & Polat, F. 2017. A concept filtering approach for diverse density to discover subgoals in reinforcement learning. In: Proceedings of the 29th IEEE International Conference on Tools with Artificial Intelligence. ICTAI ’17, 1–5, Short Paper. doi: 10.1109/ICTAI.2017.00012.Google Scholar

Dietterich, T. G. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13, 227–303. doi: 10.1613/jair.639.CrossRef Google Scholar

Digney, B. L. 1998. Learning hierarchical control structures for multiple tasks and changing environments. In Proceedings of the Fifth International Conference on Simulation of Adaptive Behavior: From Animals to Animats 5. SAB ’98, 321–330. MIT Press, ISBN: 0-262-66144-6.Google Scholar

Dung, L. T., Komeda, T., & Takagi, M. 2007. Reinforcement learning in non-Markovian environments using automatic discovery of subgoals. In SICE, 2007 Annual Conference, 2601–2605. doi: 10.1109/SICE.2007.4421430.CrossRef Google Scholar

Elkawkagy, M., Bercher, P., Schattenberg, B., & Biundo, S. 2012. Improving hierarchical planning performance by the use of landmarks. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, 1763–1769. https://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/view/5070.Google Scholar

Frommberger, L. 2008. Representing and selecting landmarks in autonomous learning of robot navigation. In ICIRA 2008. LNAI 5314, 488–497. Springer-Verlag, Berlin, Heidelberg. doi: 10.1007/978-3-540-88513-9_53.Google Scholar

Goel, S. & Huber, M. 2003. Subgoal discovery for hierarchical reinforcement learning using learned policies. In Proceedings of the 16th International FLAIRS Conference, FLAIRS ’03, 346–350. AAAI Press. ISBN 1-57735-177-0.Google Scholar

Hengst, B. 2012. Hierarchical approaches. In: Reinforcement Learning: State-of-the-Art, Adaptation, Learning, and Optimization 12, 293–323. Springer, Berlin, Heidelberg. doi: 10.1007/978-3-642-27645-3_9.CrossRef Google Scholar

Hoffmann, J., Porteous, J. & Sebastia, L. 2004. Ordered landmarks in planning. Journal of Artificial Intelligence Research 22, 215–278. doi: 10.1613/jair.1492.CrossRef Google Scholar

Howard, A. & Kitchen, L. 1999. Navigation using natural landmarks. Robotics and Autonomous Systems 26(2–3), 99–115. doi: 10.1016/S0921-8890(98)00063-3.CrossRef Google Scholar

Hwang, W., Kim, T., Ramanathan, M. & Zhang, A. 2008. Bridging centrality: graph mining from element level to group level. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 336–344. ACM. doi: 10.1145/1401890.1401934.CrossRef Google Scholar

James, M. R. & Singh, S. P. 2009. SarsaLandmark: an algorithm for learning in POMDPs with landmarks. In 8th International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS ’09, 585–591. http://www.ifaamas.org/Proceedings/aamas09/pdf/01_Full%20Papers/09_50_FP_0850.pdf.Google Scholar

Jiang, B. & Claramunt, C. 2004 Topological analysis of urban street networks. Environment and Planning B: Planning and Design 31(1), 151–162. doi: 10.1068/b306.CrossRef Google Scholar

Jonsson, A. & Barto, A. 2006. Causal graph based decomposition of factored MDPs. Journal of Machine Learning Research 7, 2259–2301. http://dl.acm.org/citation.cfm?id=1248547.1248628.Google Scholar

Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Artificial Intelligence 101(1–2), 99–134. doi: 10.1016/S0004-3702(98)00023-X.CrossRef Google Scholar

Karpas, E., Wang, D., Williams, B. C. & Haslum, P. 2015. Temporal landmarks: what must happen, and when. In: Proceedings of the Twenty-Fifth International Conference on Automated Planning and Scheduling, ICAPS ’15, 138–146. https://www.aaai.org/ocs/index.php/ICAPS/ICAPS15/paper/view/10605.Google Scholar

Koenig, S. & Simmons, R. G. 1998. Xavier: a robot navigation architecture based on partially observable Markov decision process models. In Artificial Intelligence and Mobile Robots. MIT Press, 91–122. http://idm-lab.org/bib/abstracts/papers/book98.pdf.Google Scholar

Lazanas, A. & Latombe, J.-C. 1995. Motion planning with uncertainty: a landmark approach. Artificial Intelligence 76(1–2), 287–317. doi: 10.1016/0004-3702(94)00079-G.CrossRef Google Scholar

Loch, J. & Singh, S. P. 1998. Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, 323–331. https://dl.acm.org/citation.cfm?id=657452.Google Scholar

Mannor, S., Menache, I., Hoze, A. & Klein, U. 2004. Dynamic abstraction in reinforcement learning via clustering. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML’ 04, 71–78. ACM. doi: 10.1145/1015330.1015355.CrossRef Google Scholar

Maron, O. & Lozano-Pérez, T. 1998. A framework for multiple-instance learning. In Proceedings of the 1997 conference on Advances in Neural Information Processing Systems 10, NIPS ’97, 570–576. MIT Press. http://papers.nips.cc/paper/1346-a-framework-for-multiple-instance-learning.pdf.Google Scholar

McGovern, A. & Barto, A. G. 2001. Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML’01, 361–368. Morgan Kaufmann Publishers Inc., https://scholarworks.umass.edu/cs_faculty_pubs/8/.Google Scholar

Menache, I., Mannor, S. & Shimkin, N. 2002. Q-cut—dynamic discovery of sub-goals in reinforcement learning. In 13th European Conference on Machine Learning Proceedings, Machine Learning: ECML ’02, 295–306. Springer-Verlag. doi: 10.1007/3-540-36755-1_25.Google Scholar

Mugan, J. & Kuipers, B. 2009. Autonomously learning an action hierarchy using a learned qualitative state representation. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, IJCAI ’09, 1175–1180. Morgan Kaufmann Publishers Inc. https://www.aaai.org/ocs/index.php/IJCAI/IJCAI-09/paper/viewPaper/617.Google Scholar

Pickett, M. & Barto, A. G. 2002. PolicyBlocks: an algorithm for creating useful macro-actions in reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, ICML ’02, 506–513. Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id=645531.655988.Google Scholar

Simsek, O. 2008. Behavioral Building Blocks for Autonomous Agents: Description, Identification, and Learning. PhD thesis, University of Massachusetts Amherst.Google Scholar

Simsek, O., Wolfe, A. P. & Barto, A. G. 2005. Identifying useful subgoals in reinforcement learning by local graph partitioning. In Proceedings of the 22nd international conference on Machine Learning, ICML ’05, 816–823. ACM. doi: 10.1145/1102351.1102454.CrossRef Google Scholar

Stolle, M. & Precup, D. 2002. Learning options in reinforcement learning. In Proceedings of the 5th International Symposium on Abstraction, Reformulation, and Approximation, Koenig, S. & Holte, R. C. (eds), LNCS 2371, 212–223. Springer, Berlin, Heidelberg. doi: 10.1007/3-540-45622-8_16.CrossRef Google Scholar

Sutton, R. S. & Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press. ISBN 978-0-262-19398-6.Google Scholar

Sutton, R. S., Precup, D. & Singh, S. 1999. Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1–2), 181–211. doi: 10.1016/S0004-3702(99)00052-1.CrossRef Google Scholar

Uther, W. & Veloso, M. 2003. TTree: tree-based state generalization with temporally abstract actions. In AAMAS 2002, Lecture Notes in Computer Science, 2636, 260–290. Springer, Berlin, Heidelberg. doi: 10.1007/3-540-44826-8_16.Google Scholar

Välimäki, T. & Ritala, R. 2016. Optimizing gaze direction in a visual navigation task. In IEEE International Conference on Robotics and Automation, ICRA ’16, 1427–1432. IEEE. doi: 10.1109/ICRA.2016.7487276.CrossRef Google Scholar

Watts, D. J. & Strogatz, S. H. 1998. Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440–442. doi: 10.1038/30918.CrossRef Google Scholar PubMed

Whitehead, S. D. & Ballard, D. H. 1991. Learning to perceive and act by trial and error. In Machine Learning 7(1), 45–83. doi: 10.1023/A:1022619109594.CrossRef Google Scholar

Wikipedia 2018. Landmark. https://en.wikipedia.org/wiki/Landmark (visited on 22 January 2018).Google Scholar

Xiao, D., Li, Y. & Shi, C. 2014. Autonomic discovery of subgoals in hierarchical reinforcement learning. The Journal of China Universities of Posts and Telecommunications 21(5), 94–104. doi: 10.1016/S1005-8885(14)60337-X.CrossRef Google Scholar

Yang, B. & Liu, J. 2008. Discovering global network communities based on local centralities. ACM Transactions on the Web 2(1), 1–32. doi: 10.1145/1326561.1326570.CrossRef Google Scholar

Yoshikawa, T. & Kurihara, M. 2006. An acquiring method of macro-actions in reinforcement learning. In IEEE International Conference on Systems, Man, and Cybernetics, SMC ’06 6, 4813–4817. doi: 10.1109/ICSMC.2006.385067.CrossRef Google Scholar

Article contents

Automatic landmark discovery for learning agents under partial observability

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests