Mining Tree-Structured Data on Multicore Systems

doi:10.1017/CBO9781139042918.021

20 - Mining Tree-Structured Data on Multicore Systems

from Part Four - Applications

Published online by Cambridge University Press: 05 February 2012

Shirish Tatikonda and

Srinivasan Parthasarathy

Edited by

Ron Bekkerman ,

Mikhail Bilenko and

John Langford

Show author details

Shirish Tatikonda: Affiliation:
IBM Research, San Jose, CA, USA
Srinivasan Parthasarathy: Affiliation:
Ohio State University
Ron Bekkerman: Affiliation:
LinkedIn Corporation, Mountain View, California
Mikhail Bilenko: Affiliation:
Microsoft Research, Redmond, Washington
John Langford: Affiliation:
Yahoo! Research, New York

Book contents

Get access

Summary

Mining frequent subtrees in a database of rooted and labeled trees is an important problem in many domains, ranging from phylogenetic analysis to biochemistry and from linguistic parsing to XML data analysis. In this work, we revisit this problem and develop an architecture-conscious solution targeting emerging multicore systems. Specifically, we identify a sequence of memory-related optimizations that significantly improve the spatial and temporal locality of a state-of-the-art sequential algorithm – alleviating the effects of memory latency. Additionally, these optimizations are shown to reduce the pressure on the front-side bus, an important consideration in the context of large-scale multicore architectures. We then demonstrate that these optimizations, although necessary, are not sufficient for efficient parallelization on multicores, primarily because of parametric and data-driven factors that make load balancing a significant challenge. To address this challenge, we present a methodology that adaptively and automatically modulates the type and granularity of the work being shared among different cores. The resulting algorithm achieves near perfect parallel efficiency on up to 16 processors on challenging real-world applications. The optimizations we present have general-purpose utility, and a key outcome is the development of a generalpurpose scheduling service for moldable task scheduling on emerging multicore systems.

The field of knowledge discovery is concerned with extracting actionable knowledge from data efficiently. Although most of the early work in this field focused on mining simple transactional datasets, recently there has been a significant shift toward analyzing data with complex structure such as trees and graphs.

Information

Type: Chapter
Information: Scaling up Machine Learning
Parallel and Distributed Approaches
, pp. 420 - 445

DOI: https://doi.org/10.1017/CBO9781139042918.021 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Aho, A. V., Ganapathi, M., and Tjiang, S. W. K. 1989. Code Generation Using Tree Matching and Dynamic Programming. ACM Transactions on Programming Languages and Systems, 11(4), 491–516.CrossRef Google Scholar

Asai, T., Abe, K., Kawasoe, S., Arimura, H., Satamoto, H., and Arikawa, S. 2002. Efficient Substructure Discovery from Large Semi-structured Data. Pages 158–174 of: Proceedings of the SIAM International Conference on Data Mining (SDM).Google Scholar

Baxter, I. D., Yahin, A., Moura, L., SantcAnna, M., and Bier, L. 1998. Clone Detection Using Abstract Syntax Trees. Pages 368–377 of: Proceedings of the International Conference on Software Maintenance (ICSM).CrossRef Google Scholar

Berndt, D. J., and Clifford, J. 1996. Finding Patterns in Time Series: A Dynamic Programming Approach. Pages 229–248 of: Advances in Knowledge Discovery and Data Mining.Google Scholar

Buehrer, G., Parthasarathy, S., and Chen, Y. 2006. Adaptive Parallel Graph Mining for CMP Architectures. Pages 97–106 of: Proceedings of the Sixth International Conference on Data Mining. IEEE Computer Society, Washington, DC.CrossRef Google Scholar

Charniak, E. 1996. Tree-Bank Grammars. Proceedings of the Thirteenth National Conference on Artificial Intelligence, 2, 1031–1036.Google Scholar

Chi, Y., Yang, Y., Xia, Y., and Muntz, R. R. 2004. CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees. Pages 63–73 of: Proceedings of 8th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD).CrossRef Google Scholar

Chi, Y., Muntz, R. R., Nijssen, S., and Kok, N. J. 2005. Frequent Subtree Mining – An Overview. Fundamenta Informaticae, 66(1), 161–198.Google Scholar

Gan, H. H., Pasquali, S., and Schlick, T. 2003. Exploring the Repertoire of RNA Secondary Motifs Using Graph Theory: Implications for RNA Design. Nucleic Acids Research, 31(11), 2926.CrossRef Google Scholar PubMed

Ghoting, A., Buehrer, G., Parthasarathy, S., Kim, D., Nguyen, A., Chen, Y. K., and Dubey, P. 2005. Cache-conscious Frequent Pattern Mining on aModern Processor. Pages 577–588 of: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB).Google Scholar

Han, J., Pei, J., and Yin, Y. 2000. Mining Frequent Patterns without Candidate Generation. Pages 1–12 of: Proceedings of the ACM SIGMOD International Conference on Management of Data.CrossRef Google Scholar

Kumar, R., Farkas, K. I., Jouppi, N. P., Ranganathan, P., and Tullsen, D. M. 2003. Single-ISA Heterogeneous Multi-core Architectures: The Potential for Processor Power Reduction. Pages 81–92 of: Proceedings of 36th Annual IEEE/ACM International Symposium on Microarchitecture.Google Scholar

Le, S. Y., Owens, J., Nussinov, R., Chen, J. H., Shapiro, B., and Maizel, J. V. 1989. RNA Secondary Structures: Comparison and Determination of Frequently Recurring Substructures by Consensus. Bioinformatics, 5(3), 205.CrossRef Google Scholar PubMed

Needleman, S. B., and Wunsch, C. D. 1970. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology, 48(3), 443–453.CrossRef Google Scholar PubMed

Nijssen, S., and Kok, J. N. 2003. Efficient Discovery of Frequent Unordered Trees. Pages 55–64 of: First International Workshop on Mining Graphs, Trees and Sequences.Google Scholar

Olson, C. F. 1995. Parallel Algorithms for Hierarchical Clustering. Parallel Computing, 21(8), 1313–1325.CrossRef Google Scholar

Parthasarathy, S., Zaki, M. J., Ogihara, M., and Li, W. 2001. Parallel Data Mining for Association Rules on Shared-Memory Systems. Knowledge and Information Systems, 3(1), 1–29.CrossRef Google Scholar

Parthasarathy, S., Tatikonda, S., Buehrer, G., and Ghoting, A. 2008. Architecture Conscious Data Mining: Current Directions and Future Outlook. Boca Raton, FL: Chapman & Hall/CRC.

Qiao, L., Raman, V., Reiss, F., Haas, P. J., and Lohman, G. M. 2008. Main-memory Scan Sharing for Multi-core CPUs. Pages 610–621 of: Proceedings of 34th International conference on Very Large Data Bases (VLDB).Google Scholar

Ruckert, U., and Kramer, S. 2004. Frequent Free Tree Discovery in Graph Data. Pages 564–570 of: ACM Symposium on Applied Computing.Google Scholar

Saha, B., et al. 2007. Enabling Scalability and Performance in a Large scale CMP Environment. Pages 73–860 of: Proceedings of the ACM European Conference on Computer Systems (EuroSys).Google Scholar

Shapiro, B. A., and Zhang, K. 1990. Comparing Multiple RNA Secondary Structures Using Tree Comparisons. Bioinformatics, 6(4), 309.CrossRef Google Scholar PubMed

Shasha, D. W., and Zhang, J. T. L. S. 2004. Unordered Tree Mining with Applications to Phylogeny. Pages 708–719 of: Proceedings 20th International Conference on Data Engineering (ICDE).CrossRef Google Scholar

Steel, M., and Warnow, T. 1993. Tree Theorems: Computing the Maximum Agreement Subtree. Information Processing Letters, 48, 77–82.CrossRef Google Scholar

Tan, H., Dillon, T. S., Hadzic, F., Chang, E., and Feng, L. 2006. IMB3-Miner: Mining Induced/ embedded Subtrees by Constraining the Level of Embedding. Pages 450–461: Proceedings of 8th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD).CrossRef Google Scholar

Tatikonda, S. 2010. Towards Efficient Data Analysis and Management of Semi-Structured Data. Ph.D. thesis, The Ohio State University.

Tatikonda, S., and Parthasarathy, S. 2009. Mining Tree-structured Data on Multicore Systems. Pages 694–705 of: Proceedings of the 35rd International Conference on Very Large Data Bases.Google Scholar

Tatikonda, S., Parthasarathy, S., and Kurc, T. 2006. TRIPS and TIDES: New Algorithms for Tree mining. Pages 455–464 of: Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM).Google Scholar

Tatikonda, S., Parthasarathy, S., and Goyder, M. 2007. LCS-TRIM: Dynamic Programming Meets XML Indexing and Querying. Pages 63–74 of: Proceedings of the 33rd international conference on Very Large Data Bases (VLDB).Google Scholar

Termier, A., Rousset, M. C., and Sebag, M. 2002. TreeFinder: A First Step Towards XML Data Mining. Page 450 of: Proceedings of IEEE International Conference on Data Mining (ICDM).Google Scholar

Termier, A., Rousset, M. C., and Sebag, M. 2004. DRYADE: A New Approach for Discovering Closed Frequent Trees in Heterogeneous Tree Databases. Pages 543–546 of: Proceedings of 4th IEEE International Conference on Data Mining (ICDM).CrossRef Google Scholar

Wagner, R., and Fischer, M. 1974. The String-to-String Correction Problem. Journal of the ACM (JACM), 21(1), 168–173.CrossRef Google Scholar

Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., and Shi, B. 2004. Efficient Pattern-growth Methods for Frequent Tree Pattern Mining. Pages 441–451 of: Proceedings of the Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD).CrossRef Google Scholar

Yang, L. H., Lee, M. L., and Hsu,W. 2004. Finding Hot Query Patterns Over an XQuery Stream. The VLDB Journal: The International Journal on Very Large Data Bases, 13(4), 318–332.CrossRef Google Scholar

Zaki, M. J. 1999a. Parallel Sequence Mining on Shared-Memory Machines. Large-Scale Parallel Data Mining, 804–804.Google Scholar

Zaki, M. J. 1999b. Parallel and Distributed Association Mining: A Survey. In IEEE Concurrency, 7(4), 14–25.CrossRef Google Scholar

Zaki, M. J. 2005. Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications. IEEE Transactions on Knowledge and Data Engineering, 17(8), 1021–1035.CrossRef Google Scholar

Zaki, M. J., and Aggarwal, C.C. 2003. XRules:An Effective Structural Classifier for XML Data. Pages 316–325 of: Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining (KDD).Google Scholar

Zezula, P., Amato, G., Debole, F., and Rabitti, F. 2003. Tree Signatures for XML Querying and Navigation. Pages 149–163 of: Proceedings of 1st XML Database Symposium (XSym).Google Scholar

Zhang, K. 1998. Computing Similarity between RNA Secondary Structures. Pages 126–132 of: Proceedings of IEEE International Joint Symposia on Intelligence and Systems.CrossRef Google Scholar

Accessibility standard: Unknown

Accessibility compliance for the PDF of this book is currently unknown and may be updated in the future.