Skip to main content Accessibility help

Automatic landmark discovery for learning agents under partial observability

  • Alper Demіr (a1), Erkіn Çіlden (a2) and Faruk Polat (a3)


In the reinforcement learning context, a landmark is a compact information which uniquely couples a state, for problems with hidden states. Landmarks are shown to support finding good memoryless policies for Partially Observable Markov Decision Processes (POMDP) which contain at least one landmark. SarsaLandmark, as an adaptation of Sarsa(λ), is known to promise a better learning performance with the assumption that all landmarks of the problem are known in advance.

In this paper, we propose a framework built upon SarsaLandmark, which is able to automatically identify landmarks within the problem during learning without sacrificing quality, and requiring no prior information about the problem structure. For this purpose, the framework fuses SarsaLandmark with a well-known multiple-instance learning algorithm, namely Diverse Density (DD). By further experimentation, we also provide a deeper insight into our concept filtering heuristic to accelerate DD, abbreviated as DDCF (Diverse Density with Concept Filtering), which proves itself to be suitable for POMDPs with landmarks. DDCF outperforms its antecedent in terms of computation speed and solution quality without loss of generality.

The methods are empirically shown to be effective via extensive experimentation on a number of known and newly introduced problems with hidden state, and the results are discussed.



Hide All
Chrisman, L. 1992. Reinforcement learning with perceptual aliasing: the perceptual distinctions approach. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI92, 183188. AAAI Press.
Daniel, C., van Hoof, H., Peters, J. & Neumann, G. 2016. Probabilistic inference for determining options in reinforcement learning. In Machine Learning 104. 2-3, 337357. doi: 10.1007/s10994-016-5580-x.
Demir, A., Çilden, E. & Polat, F. 2017. A concept filtering approach for diverse density to discover subgoals in reinforcement learning. In: Proceedings of the 29th IEEE International Conference on Tools with Artificial Intelligence. ICTAI17, 15, Short Paper. doi: 10.1109/ICTAI.2017.00012.
Dietterich, T. G. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13, 227303. doi: 10.1613/jair.639.
Digney, B. L. 1998. Learning hierarchical control structures for multiple tasks and changing environments. In Proceedings of the Fifth International Conference on Simulation of Adaptive Behavior: From Animals to Animats 5. SAB98, 321330. MIT Press, ISBN: 0-262-66144-6.
Dung, L. T., Komeda, T., & Takagi, M. 2007. Reinforcement learning in non-Markovian environments using automatic discovery of subgoals. In SICE, 2007 Annual Conference, 26012605. doi: 10.1109/SICE.2007.4421430.
Elkawkagy, M., Bercher, P., Schattenberg, B., & Biundo, S. 2012. Improving hierarchical planning performance by the use of landmarks. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, 17631769.
Frommberger, L. 2008. Representing and selecting landmarks in autonomous learning of robot navigation. In ICIRA 2008. LNAI 5314, 488497. Springer-Verlag, Berlin, Heidelberg. doi: 10.1007/978-3-540-88513-9_53.
Goel, S. & Huber, M. 2003. Subgoal discovery for hierarchical reinforcement learning using learned policies. In Proceedings of the 16th International FLAIRS Conference, FLAIRS03, 346350. AAAI Press. ISBN 1-57735-177-0.
Hengst, B. 2012. Hierarchical approaches. In: Reinforcement Learning: State-of-the-Art, Adaptation, Learning, and Optimization 12, 293323. Springer, Berlin, Heidelberg. doi: 10.1007/978-3-642-27645-3_9.
Hoffmann, J., Porteous, J. & Sebastia, L. 2004. Ordered landmarks in planning. Journal of Artificial Intelligence Research 22, 215278. doi: 10.1613/jair.1492.
Howard, A. & Kitchen, L. 1999. Navigation using natural landmarks. Robotics and Autonomous Systems 26(2–3), 99115. doi: 10.1016/S0921-8890(98)00063-3.
Hwang, W., Kim, T., Ramanathan, M. & Zhang, A. 2008. Bridging centrality: graph mining from element level to group level. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 336344. ACM. doi: 10.1145/1401890.1401934.
James, M. R. & Singh, S. P. 2009. SarsaLandmark: an algorithm for learning in POMDPs with landmarks. In 8th International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS09, 585591.
Jiang, B. & Claramunt, C. 2004 Topological analysis of urban street networks. Environment and Planning B: Planning and Design 31(1), 151162. doi: 10.1068/b306.
Jonsson, A. & Barto, A. 2006. Causal graph based decomposition of factored MDPs. Journal of Machine Learning Research 7, 22592301.
Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Artificial Intelligence 101(1–2), 99134. doi: 10.1016/S0004-3702(98)00023-X.
Karpas, E., Wang, D., Williams, B. C. & Haslum, P. 2015. Temporal landmarks: what must happen, and when. In: Proceedings of the Twenty-Fifth International Conference on Automated Planning and Scheduling, ICAPS15, 138146.
Koenig, S. & Simmons, R. G. 1998. Xavier: a robot navigation architecture based on partially observable Markov decision process models. In Artificial Intelligence and Mobile Robots. MIT Press, 91122.
Lazanas, A. & Latombe, J.-C. 1995. Motion planning with uncertainty: a landmark approach. Artificial Intelligence 76(1–2), 287317. doi: 10.1016/0004-3702(94)00079-G.
Loch, J. & Singh, S. P. 1998. Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In Proceedings of the Fifteenth International Conference on Machine Learning, ICML98, 323331.
Mannor, S., Menache, I., Hoze, A. & Klein, U. 2004. Dynamic abstraction in reinforcement learning via clustering. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML04, 7178. ACM. doi: 10.1145/1015330.1015355.
Maron, O. & Lozano-Pérez, T. 1998. A framework for multiple-instance learning. In Proceedings of the 1997 conference on Advances in Neural Information Processing Systems 10, NIPS97, 570576. MIT Press.
McGovern, A. & Barto, A. G. 2001. Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML’01, 361368. Morgan Kaufmann Publishers Inc.,
Menache, I., Mannor, S. & Shimkin, N. 2002. Q-cut—dynamic discovery of sub-goals in reinforcement learning. In 13th European Conference on Machine Learning Proceedings, Machine Learning: ECML ’02, 295306. Springer-Verlag. doi: 10.1007/3-540-36755-1_25.
Mugan, J. & Kuipers, B. 2009. Autonomously learning an action hierarchy using a learned qualitative state representation. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, IJCAI ’09, 11751180. Morgan Kaufmann Publishers Inc.
Pickett, M. & Barto, A. G. 2002. PolicyBlocks: an algorithm for creating useful macro-actions in reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, ICML ’02, 506513. Morgan Kaufmann Publishers Inc.
Simsek, O. 2008. Behavioral Building Blocks for Autonomous Agents: Description, Identification, and Learning. PhD thesis, University of Massachusetts Amherst.
Simsek, O., Wolfe, A. P. & Barto, A. G. 2005. Identifying useful subgoals in reinforcement learning by local graph partitioning. In Proceedings of the 22nd international conference on Machine Learning, ICML ’05, 816823. ACM. doi: 10.1145/1102351.1102454.
Stolle, M. & Precup, D. 2002. Learning options in reinforcement learning. In Proceedings of the 5th International Symposium on Abstraction, Reformulation, and Approximation, Koenig, S. & Holte, R. C. (eds), LNCS 2371, 212223. Springer, Berlin, Heidelberg. doi: 10.1007/3-540-45622-8_16.
Sutton, R. S. & Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press. ISBN 978-0-262-19398-6.
Sutton, R. S., Precup, D. & Singh, S. 1999. Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1–2), 181211. doi: 10.1016/S0004-3702(99)00052-1.
Uther, W. & Veloso, M. 2003. TTree: tree-based state generalization with temporally abstract actions. In AAMAS 2002, Lecture Notes in Computer Science, 2636, 260290. Springer, Berlin, Heidelberg. doi: 10.1007/3-540-44826-8_16.
Välimäki, T. & Ritala, R. 2016. Optimizing gaze direction in a visual navigation task. In IEEE International Conference on Robotics and Automation, ICRA ’16, 14271432. IEEE. doi: 10.1109/ICRA.2016.7487276.
Watts, D. J. & Strogatz, S. H. 1998. Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440442. doi: 10.1038/30918.
Whitehead, S. D. & Ballard, D. H. 1991. Learning to perceive and act by trial and error. In Machine Learning 7(1), 4583. doi: 10.1023/A:1022619109594.
Wikipedia 2018. Landmark. (visited on 22 January 2018).
Xiao, D., Li, Y. & Shi, C. 2014. Autonomic discovery of subgoals in hierarchical reinforcement learning. The Journal of China Universities of Posts and Telecommunications 21(5), 94104. doi: 10.1016/S1005-8885(14)60337-X.
Yang, B. & Liu, J. 2008. Discovering global network communities based on local centralities. ACM Transactions on the Web 2(1), 132. doi: 10.1145/1326561.1326570.
Yoshikawa, T. & Kurihara, M. 2006. An acquiring method of macro-actions in reinforcement learning. In IEEE International Conference on Systems, Man, and Cybernetics, SMC ’06 6, 48134817. doi: 10.1109/ICSMC.2006.385067.

Automatic landmark discovery for learning agents under partial observability

  • Alper Demіr (a1), Erkіn Çіlden (a2) and Faruk Polat (a3)


Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed