Skip to main content Accessibility help
×
Hostname: page-component-7c8c6479df-nwzlb Total loading time: 0 Render date: 2024-03-27T11:48:38.088Z Has data issue: false hasContentIssue false

3 - Large-Scale Machine Learning Using DryadLINQ

from Part One - Frameworks for Scaling Up Machine Learning

Published online by Cambridge University Press:  05 February 2012

Mihai Budiu
Affiliation:
Microsoft Research, Mountain View, CA, USA
Dennis Fetterly
Affiliation:
Microsoft Research, Mountain View, CA, USA
Michael Isard
Affiliation:
Microsoft Research, Mountain View, CA, USA
Frank McSherry
Affiliation:
Microsoft Research, Mountain View, CA, USA
Yuan Yu
Affiliation:
Microsoft Research, Mountain View, CA, USA
Ron Bekkerman
Affiliation:
LinkedIn Corporation, Mountain View, California
Mikhail Bilenko
Affiliation:
Microsoft Research, Redmond, Washington
John Langford
Affiliation:
Yahoo! Research, New York
Get access

Summary

This chapter describes DryadLINQ, a general-purpose system for large-scale data-parallel computing, and illustrates its use on a number of machine learning problems.

The main motivation behind the development of DryadLINQ was to make it easier for nonspecialists to write general-purpose, scalable programs that can operate on very large input datasets. In order to appeal to nonspecialists, we designed the programming interface to use a high level of abstraction that insulates the programmer from most of the detail and complexity of parallel and distributed execution. In order to support general-purpose computing, we embedded these high-level abstractions in .NET, giving developers access to full-featured programming languages with rich type systems and proven mechanisms (such as classes and libraries) for managing complex, long-lived, and geographically distributed software projects. In order to support scalability over very large data and compute clusters, the DryadLINQ compiler generates code for the Dryad runtime, a well-tested and highly efficient distributed execution engine.

As machine learning moves into the industrial mainstream and operates over diverse data types including documents, images, and graphs, it is increasingly appealing to move away from domain-specific languages like MATLAB and toward general-purpose languages that support rich types and standardized libraries. The examples in this chapter demonstrate that a general-purpose language such as C# supports effective, concise implementations of standard machine learning algorithms and that DryadLINQ efficiently scales these implementations to operate over hundreds of computers and very large datasets primarily limited by disk capacity.

Type
Chapter
Information
Scaling up Machine Learning
Parallel and Distributed Approaches
, pp. 49 - 68
Publisher: Cambridge University Press
Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Dean, J., and Ghemawat, S. 2004 (Dec.). MapReduce: Simplified Data Processing on Large Clusters. Pages 137–150 of: Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI).Google Scholar
Duffy, J. 2007 (January). A Query Language for Data Parallel Programming. In: Proceedings of the 2007 Workshop on Declarative Aspects of Multicore Programming.Google Scholar
Google, . 2010 (Accessed 27 August). Protocol Buffers. http://code.google.com/apis/protocolbuffers/.
Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. 2007 (March). Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. Pages 59–72 of: Proceedings of European Conference on Computer Systems (EuroSys).Google Scholar
Microsoft, . 2010 (Accessed 27 August). The LINQ Project. http://msdn.microsoft.com/netframework/future/linq/.
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., and Blake, A., Real-Time Human Pose Recognition in Parts from a Single Depth Image, In Computer Vision and Pattern Recognition (CVPR), 2011.
Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P. K., and Currey, J. 2008 (December 8–10). DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In: Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI).Google Scholar
Yu, Y., Gunda, P. K., and Isard, M. 2009. Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations. Pages 247–260 of: SOSP '09: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. New York: ACM.CrossRefGoogle Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×