Skip to main content Accessibility help
×
Hostname: page-component-8448b6f56d-xtgtn Total loading time: 0 Render date: 2024-04-18T22:52:38.929Z Has data issue: false hasContentIssue false

20 - Mining Tree-Structured Data on Multicore Systems

from Part Four - Applications

Published online by Cambridge University Press:  05 February 2012

Shirish Tatikonda
Affiliation:
IBM Research, San Jose, CA, USA
Srinivasan Parthasarathy
Affiliation:
Ohio State University
Ron Bekkerman
Affiliation:
LinkedIn Corporation, Mountain View, California
Mikhail Bilenko
Affiliation:
Microsoft Research, Redmond, Washington
John Langford
Affiliation:
Yahoo! Research, New York
Get access

Summary

Mining frequent subtrees in a database of rooted and labeled trees is an important problem in many domains, ranging from phylogenetic analysis to biochemistry and from linguistic parsing to XML data analysis. In this work, we revisit this problem and develop an architecture-conscious solution targeting emerging multicore systems. Specifically, we identify a sequence of memory-related optimizations that significantly improve the spatial and temporal locality of a state-of-the-art sequential algorithm – alleviating the effects of memory latency. Additionally, these optimizations are shown to reduce the pressure on the front-side bus, an important consideration in the context of large-scale multicore architectures. We then demonstrate that these optimizations, although necessary, are not sufficient for efficient parallelization on multicores, primarily because of parametric and data-driven factors that make load balancing a significant challenge. To address this challenge, we present a methodology that adaptively and automatically modulates the type and granularity of the work being shared among different cores. The resulting algorithm achieves near perfect parallel efficiency on up to 16 processors on challenging real-world applications. The optimizations we present have general-purpose utility, and a key outcome is the development of a generalpurpose scheduling service for moldable task scheduling on emerging multicore systems.

The field of knowledge discovery is concerned with extracting actionable knowledge from data efficiently. Although most of the early work in this field focused on mining simple transactional datasets, recently there has been a significant shift toward analyzing data with complex structure such as trees and graphs.

Type
Chapter
Information
Scaling up Machine Learning
Parallel and Distributed Approaches
, pp. 420 - 445
Publisher: Cambridge University Press
Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aho, A. V., Ganapathi, M., and Tjiang, S. W. K. 1989. Code Generation Using Tree Matching and Dynamic Programming. ACM Transactions on Programming Languages and Systems, 11(4), 491–516.CrossRefGoogle Scholar
Asai, T., Abe, K., Kawasoe, S., Arimura, H., Satamoto, H., and Arikawa, S. 2002. Efficient Substructure Discovery from Large Semi-structured Data. Pages 158–174 of: Proceedings of the SIAM International Conference on Data Mining (SDM).Google Scholar
Baxter, I. D., Yahin, A., Moura, L., SantcAnna, M., and Bier, L. 1998. Clone Detection Using Abstract Syntax Trees. Pages 368–377 of: Proceedings of the International Conference on Software Maintenance (ICSM).CrossRefGoogle Scholar
Berndt, D. J., and Clifford, J. 1996. Finding Patterns in Time Series: A Dynamic Programming Approach. Pages 229–248 of: Advances in Knowledge Discovery and Data Mining.Google Scholar
Buehrer, G., Parthasarathy, S., and Chen, Y. 2006. Adaptive Parallel Graph Mining for CMP Architectures. Pages 97–106 of: Proceedings of the Sixth International Conference on Data Mining. IEEE Computer Society, Washington, DC.CrossRefGoogle Scholar
Charniak, E. 1996. Tree-Bank Grammars. Proceedings of the Thirteenth National Conference on Artificial Intelligence, 2, 1031–1036.Google Scholar
Chi, Y., Yang, Y., Xia, Y., and Muntz, R. R. 2004. CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees. Pages 63–73 of: Proceedings of 8th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD).CrossRefGoogle Scholar
Chi, Y., Muntz, R. R., Nijssen, S., and Kok, N. J. 2005. Frequent Subtree Mining – An Overview. Fundamenta Informaticae, 66(1), 161–198.Google Scholar
Gan, H. H., Pasquali, S., and Schlick, T. 2003. Exploring the Repertoire of RNA Secondary Motifs Using Graph Theory: Implications for RNA Design. Nucleic Acids Research, 31(11), 2926.CrossRefGoogle ScholarPubMed
Ghoting, A., Buehrer, G., Parthasarathy, S., Kim, D., Nguyen, A., Chen, Y. K., and Dubey, P. 2005. Cache-conscious Frequent Pattern Mining on aModern Processor. Pages 577–588 of: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB).Google Scholar
Han, J., Pei, J., and Yin, Y. 2000. Mining Frequent Patterns without Candidate Generation. Pages 1–12 of: Proceedings of the ACM SIGMOD International Conference on Management of Data.CrossRefGoogle Scholar
Kumar, R., Farkas, K. I., Jouppi, N. P., Ranganathan, P., and Tullsen, D. M. 2003. Single-ISA Heterogeneous Multi-core Architectures: The Potential for Processor Power Reduction. Pages 81–92 of: Proceedings of 36th Annual IEEE/ACM International Symposium on Microarchitecture.Google Scholar
Le, S. Y., Owens, J., Nussinov, R., Chen, J. H., Shapiro, B., and Maizel, J. V. 1989. RNA Secondary Structures: Comparison and Determination of Frequently Recurring Substructures by Consensus. Bioinformatics, 5(3), 205.CrossRefGoogle ScholarPubMed
Needleman, S. B., and Wunsch, C. D. 1970. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology, 48(3), 443–453.CrossRefGoogle ScholarPubMed
Nijssen, S., and Kok, J. N. 2003. Efficient Discovery of Frequent Unordered Trees. Pages 55–64 of: First International Workshop on Mining Graphs, Trees and Sequences.Google Scholar
Olson, C. F. 1995. Parallel Algorithms for Hierarchical Clustering. Parallel Computing, 21(8), 1313–1325.CrossRefGoogle Scholar
Parthasarathy, S., Zaki, M. J., Ogihara, M., and Li, W. 2001. Parallel Data Mining for Association Rules on Shared-Memory Systems. Knowledge and Information Systems, 3(1), 1–29.CrossRefGoogle Scholar
Parthasarathy, S., Tatikonda, S., Buehrer, G., and Ghoting, A. 2008. Architecture Conscious Data Mining: Current Directions and Future Outlook. Boca Raton, FL: Chapman & Hall/CRC.
Qiao, L., Raman, V., Reiss, F., Haas, P. J., and Lohman, G. M. 2008. Main-memory Scan Sharing for Multi-core CPUs. Pages 610–621 of: Proceedings of 34th International conference on Very Large Data Bases (VLDB).Google Scholar
Ruckert, U., and Kramer, S. 2004. Frequent Free Tree Discovery in Graph Data. Pages 564–570 of: ACM Symposium on Applied Computing.Google Scholar
Saha, B., et al. 2007. Enabling Scalability and Performance in a Large scale CMP Environment. Pages 73–860 of: Proceedings of the ACM European Conference on Computer Systems (EuroSys).Google Scholar
Shapiro, B. A., and Zhang, K. 1990. Comparing Multiple RNA Secondary Structures Using Tree Comparisons. Bioinformatics, 6(4), 309.CrossRefGoogle ScholarPubMed
Shasha, D. W., and Zhang, J. T. L. S. 2004. Unordered Tree Mining with Applications to Phylogeny. Pages 708–719 of: Proceedings 20th International Conference on Data Engineering (ICDE).CrossRefGoogle Scholar
Steel, M., and Warnow, T. 1993. Tree Theorems: Computing the Maximum Agreement Subtree. Information Processing Letters, 48, 77–82.CrossRefGoogle Scholar
Tan, H., Dillon, T. S., Hadzic, F., Chang, E., and Feng, L. 2006. IMB3-Miner: Mining Induced/ embedded Subtrees by Constraining the Level of Embedding. Pages 450–461: Proceedings of 8th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD).CrossRefGoogle Scholar
Tatikonda, S. 2010. Towards Efficient Data Analysis and Management of Semi-Structured Data. Ph.D. thesis, The Ohio State University.
Tatikonda, S., and Parthasarathy, S. 2009. Mining Tree-structured Data on Multicore Systems. Pages 694–705 of: Proceedings of the 35rd International Conference on Very Large Data Bases.Google Scholar
Tatikonda, S., Parthasarathy, S., and Kurc, T. 2006. TRIPS and TIDES: New Algorithms for Tree mining. Pages 455–464 of: Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM).Google Scholar
Tatikonda, S., Parthasarathy, S., and Goyder, M. 2007. LCS-TRIM: Dynamic Programming Meets XML Indexing and Querying. Pages 63–74 of: Proceedings of the 33rd international conference on Very Large Data Bases (VLDB).Google Scholar
Termier, A., Rousset, M. C., and Sebag, M. 2002. TreeFinder: A First Step Towards XML Data Mining. Page 450 of: Proceedings of IEEE International Conference on Data Mining (ICDM).Google Scholar
Termier, A., Rousset, M. C., and Sebag, M. 2004. DRYADE: A New Approach for Discovering Closed Frequent Trees in Heterogeneous Tree Databases. Pages 543–546 of: Proceedings of 4th IEEE International Conference on Data Mining (ICDM).CrossRefGoogle Scholar
Wagner, R., and Fischer, M. 1974. The String-to-String Correction Problem. Journal of the ACM (JACM), 21(1), 168–173.CrossRefGoogle Scholar
Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., and Shi, B. 2004. Efficient Pattern-growth Methods for Frequent Tree Pattern Mining. Pages 441–451 of: Proceedings of the Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD).CrossRefGoogle Scholar
Yang, L. H., Lee, M. L., and Hsu,W. 2004. Finding Hot Query Patterns Over an XQuery Stream. The VLDB Journal: The International Journal on Very Large Data Bases, 13(4), 318–332.CrossRefGoogle Scholar
Zaki, M. J. 1999a. Parallel Sequence Mining on Shared-Memory Machines. Large-Scale Parallel Data Mining, 804–804.Google Scholar
Zaki, M. J. 1999b. Parallel and Distributed Association Mining: A Survey. In IEEE Concurrency, 7(4), 14–25.CrossRefGoogle Scholar
Zaki, M. J. 2005. Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications. IEEE Transactions on Knowledge and Data Engineering, 17(8), 1021–1035.CrossRefGoogle Scholar
Zaki, M. J., and Aggarwal, C.C. 2003. XRules:An Effective Structural Classifier for XML Data. Pages 316–325 of: Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining (KDD).Google Scholar
Zezula, P., Amato, G., Debole, F., and Rabitti, F. 2003. Tree Signatures for XML Querying and Navigation. Pages 149–163 of: Proceedings of 1st XML Database Symposium (XSym).Google Scholar
Zhang, K. 1998. Computing Similarity between RNA Secondary Structures. Pages 126–132 of: Proceedings of IEEE International Joint Symposia on Intelligence and Systems.CrossRefGoogle Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×