Skip to main content Accessibility help
×
Hostname: page-component-8448b6f56d-m8qmq Total loading time: 0 Render date: 2024-04-24T13:08:50.299Z Has data issue: false hasContentIssue false

5 - Big data analytics systems

from Part II - Big data over cyber networks

Published online by Cambridge University Press:  18 December 2015

Ganesh Ananthanarayanan
Affiliation:
Microsoft Research, USA
Ishai Menache
Affiliation:
Microsoft Research, USA
Shuguang Cui
Affiliation:
Texas A & M University
Alfred O. Hero, III
Affiliation:
University of Michigan, Ann Arbor
Zhi-Quan Luo
Affiliation:
University of Minnesota
José M. F. Moura
Affiliation:
Carnegie Mellon University, Pennsylvania
Get access

Summary

Performing timely analysis on huge datasets is the central promise of big data analytics. To cope with the high volumes of data to be analyzed, computation frameworks have resorted to “scaling out” – parallelization of analytics that allows for seamless execution across large clusters. These frameworks automatically compose analytics jobs into a DAG of small tasks, and then aggregate the intermediate results from the tasks to obtain the final result. Their ability to do so relies on an efficient scheduler and a reliable storage layer that distributes the datasets on different machines.

In this chapter, we survey the above two aspects, scheduling and storage, which are the foundations of modern big data analytics systems.We describe their key principles, and how these principles are realized in widely deployed systems.

Introduction

Analyzing large volumes of data has become the major source for innovation behind large Internet services as well as scientific applications. Examples of such “big data analytics” occur in personalized recommendation systems, online social networks, genomic analyses, and legal investigations for fraud detection. A key property of the algorithms employed for such analyses is that they provide better results with increasing amount of data processed. In fact, in certain domains (like search) there is a trend towards using relatively simpler algorithms and instead relying on more data to produce better results.

While the amount of data to be analyzed increases on the one hand, the acceptable time to produce results is shrinking on the other hand. Timely analyses have significant ramifications for revenue as well as productivity. Low latency results in online services leads to improved user satisfaction and revenue. Ability to crunch large datasets in short periods results in faster iterations and progress in scientific theories.

To cope with the dichotomy of ever-growing datasets and shrinking times to analyze them, analytics clusters have resorted to scaling out. Data are spread across many different machines, and the computations on them are executed in parallel. Such scaling out is crucial for fast analytics and allows coping with the trend of datasets growing faster than Moore's laws increase in processor speeds.

Many data analytics frameworks have been built for large scale-out parallel executions. Some of the widely used frameworks are MapReduce [1], Dryad [2] and Apache Yarn [3].

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1] J., Dean and S., Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.Google Scholar
[2] M., Isard, M., Budiu, Y., Yu, A., Birrell, and D., Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks,” in ACM EuroSys, 2007.Google Scholar
[3] V., Vavilapalli et al., “Apache hadoop yarn: yet another resource negotiator,” in ACM SoCC, 2013.Google Scholar
[4] B., Hindman, A., Konwinski, M., Zaharia, et al., “Mesos: a platform for fine-grained resource sharing in the data center,” in USENIX NSDI, 2011.Google Scholar
[5] M., Shreedhar and G., Varghese, “Efficient fair queuing using deficit round-robin,” IEEE/ACM Transactions on Networking, vol. 4, no. 3, pp. 375–385, 1996.Google Scholar
[6] A., Demers, S., Keshav, and S., Shenker, “Analysis and simulation of a fair queueing algorithm,” in ACM SIGCOMM Computer Communication Review, vol. 19, no. 4. ACM, 1989, pp. 1–12.Google Scholar
[7] A., Ghodsi, M., Zaharia, B., Hindman, et al., “Dominant resource fairness: fair allocation of multiple resource types.” in NSDI, vol. 11, 2011, pp. 24–24.Google Scholar
[8] M., Zaharia, D., Borthakur, J. S., Sarma, et al., “Job scheduling for multi-user mapreduce clusters,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS- 2009-55, 2009.
[9] R., Grandl, G., Ananthanarayanan, S., Kandula, S., Rao, and A., Akella, “Multi-resource packing for cluster schedulers,” in ACMSIGCOMM, 2014, pp. 455–466. [Online]. Available: http://doi.acm.org/10.1145/2619239.2626334.Google Scholar
[10] A., Ghodsi,M., Zaharia, S., Shenker, and I., Stoica, “Choosy: max-min fair sharing for datacenter jobs with constraints,” in Proceedings of the 8th ACMEuropean Conference on Computer Systems, ACM, 2013, pp. 365–378.Google Scholar
[11] B., Sharma, V., Chudnovsky, J. L., Hellerstein, R., Rifaat, and C. R., Das, “Modeling and synthesizing task placement constraints in google compute clusters,” in Proceedings of the 2nd ACM Symposium on Cloud Computing, ACM, 2011, p. 3.Google Scholar
[12] M., Zaharia, D., Borthakur, J. Sen, Sarma, et al., “Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling,” in Proceedings of the 5th European Conference on Computer Systems, ACM, 2010, pp. 265–278.Google Scholar
[13] M., Isard, V., Prabhakaran, J., Currey, U., Wieder, K., Talwar, and A., Goldberg, “Quincy: fair scheduling for distributed computing clusters,” in Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, ACM, 2009, pp. 261–276.Google Scholar
[14] P., Bodík, I., Menache, M., Chowdhury, et al., “Surviving failures in bandwidth-constrained datacenters,” in Proceedings of the ACMSIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, ACM, 2012, pp. 431– 442.Google Scholar
[15] A. D., Ferguson, P., Bodik, S., Kandula, E., Boutin, and R., Fonseca, “Jockey: guaranteed job latency in data parallel clusters,” in Proceedings of the 7th ACM european conference on Computer Systems, ACM, 2012, pp. 99–112.Google Scholar
[16] C., Curino, D. E., Difallah, C., Douglas, et al., “Reservation-based scheduling: If you're late don't blame us!” in Proceedings of the ACM Symposium on Cloud Computing, ACM, 2014, pp. 1–14.Google Scholar
[17] N., Jain, I., Menache, J., Naor, and J., Yaniv, “Near-optimal scheduling mechanisms for deadline-sensitive jobs in large computing clusters,” in SPAA, 2012, pp. 255–266.Google Scholar
[18] B., Lucier, I., Menache, J., Naor, and J., Yaniv, “Efficient online scheduling for deadlinesensitive jobs: extended abstract,” in SPAA, 2013, pp. 305–314.Google Scholar
[19] P., Bodík, I., Menache, J. S., Naor, and J., Yaniv, “Brief announcement: deadline-aware scheduling of big-data processing jobs,” in Proceedings of the 26th ACM symposium on Parallelism in algorithms and architectures, ACM, 2014, pp. 211–213.Google Scholar
[20] M., Zaharia, A., Konwinski, A. D., Joseph, R. H., Katz, and I., Stoica, “Improving mapreduce performance in heterogeneous environments,” in OSDI, vol. 8, no. 4, 2008, p. 7.Google Scholar
[21] G., Ananthanarayanan, S., Kandula, A. G., Greenberg, I., Stoica, Y., Lu, B., Saha, and E., Harris, “Reining in the outliers in map-reduce clusters using mantri,” in OSDI, vol. 10, no. 1, 2010, p. 24.Google Scholar
[22] G., Ananthanarayanan, A., Ghodsi, S., Shenker, and I., Stoica, “Effective straggler mitigation: attack of the clones.” in NSDI, vol. 13, 2013, pp. 185–198.Google Scholar
[23] “Posix,” http://pubs.opengroup.org/onlinepubs/9699919799/.
[24] G., Ananthanarayanan, A., Ghodsi, A., Wang, et al., “Pacman: coordinated memory caching for parallel jobs,” in USENIX NSDI, 2012.Google Scholar
[25] G., Ananthanarayanan, S., Agarwal, S., Kandula, et al., “Scarlett: coping with skewed popularity content in mapreduce clusters,” in ACM EuroSys, 2011.Google Scholar
[26] M., Zaharia, M., Chowdhury, T., Das, et al., “Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing,” in USENIX NSDI, 2012.Google Scholar
[27] S., Melnik, A., Gubarev, J. J., Long, et al., “Dremel: Interactive analysis of web-scale datasets,” in Proceedings of the 36th International Conf on Very Large Data Bases, 2010, pp. 330–339.Google Scholar
[28] R., Xin, J., Rosen, M., Zaharia, et al., “Shark: SQL and rich analytics at scale,” in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013.Google Scholar
[29] S., Agarwal, B., Mozafari, A., Panda, et al., “Blinkdb: queries with bounded errors and bounded response times on very large data,” in Proceedings of the 8th European Conference on Computer Systems, ACM, 2013.Google Scholar
[30] J., Liu,W.-K., Shih, K.-J., Lin, R., Bettati, and J.-Y., Chung, “Imprecise computations.” in IEEE, 1994.Google Scholar
[31] S., Lohr, “Sampling: design and analysis,” in Thomson, 2009.
[32] Y., Chen, S., Alspaugh, D., Borthakur, and R., Katz, “Energy efficiency for large-scale mapreduce workloads with significant interactive analysis,” in Proceedings of the 7th ACM European Conference on Computer Systems, ACM, 2012, pp. 43–56.Google Scholar
[33] Z., Liu, Y., Chen, C., Bash, et al., “Renewable and cooling aware workload management for sustainable data centers,” in ACM SIGMETRICS Performance Evaluation Review, vol. 40, no. 1, ACM, 2012, pp. 175–186.Google Scholar
[34] A., Beloglazov, R., Buyya, Y. C., Lee, et al., “A taxonomy and survey of energy-efficient data centers and cloud computing systems,” Advances in Computers, vol. 82, no. 2, pp. 47–111, 2011.Google Scholar
[35] A., Gandhi, M., Harchol-Balter, R., Das, and C., Lefurgy, “Optimal power allocation in server farms,” in ACM SIGMETRICS Performance Evaluation Review, vol. 37, no. 1, ACM, 2009, pp. 157–168.Google Scholar
[36] A., Gandhi, V., Gupta, M., Harchol-Balter, and M. A., Kozuch, “Optimality analysis of energyperformance trade-off for server farmmanagement,” Performance Evaluation, vol. 67, no. 11, pp. 1155–1171, 2010.Google Scholar
[37] N., Buchbinder, N., Jain, and I., Menache, “Online job-migration for reducing the electricity bill in the cloud,” in NETWORKING 2011, Springer, 2011, pp. 172–185.Google Scholar
[38] “EC2 pricing,” http://aws.amazon.com/ec2/pricing/.
[39] I., Menache, O., Shamir, and N., Jain, “On-demand, spot, or both: dynamic resource allocation for executing batch jobs in the cloud,” in 11th International Conference on Autonomic Computing (ICAC), 2014.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×