Skip to main content Accessibility help
×
Home

Jargon of Hadoop MapReduce scheduling techniques: a scientific categorization

  • Muhammad Hanif (a1) and Choonhwa Lee (a1)

Abstract

Recently, valuable knowledge that can be retrieved from a huge volume of datasets (called Big Data) set in motion the development of frameworks to process data based on parallel and distributed computing, including Apache Hadoop, Facebook Corona, and Microsoft Dryad. Apache Hadoop is an open source implementation of Google MapReduce that attracted strong attention from the research community both in academia and industry. Hadoop MapReduce scheduling algorithms play a critical role in the management of large commodity clusters, controlling QoS requirements by supervising users, jobs, and tasks execution. Hadoop MapReduce comprises three schedulers: FIFO, Fair, and Capacity. However, the research community has developed new optimizations to consider advances and dynamic changes in hardware and operating environments. Numerous efforts have been made in the literature to address issues of network congestion, straggling, data locality, heterogeneity, resource under-utilization, and skew mitigation in Hadoop scheduling. Recently, the volume of research published in journals and conferences about Hadoop scheduling has consistently increased, which makes it difficult for researchers to grasp the overall view of research and areas that require further investigation. A scientific literature review has been conducted in this study to assess preceding research contributions to the Apache Hadoop scheduling mechanism. We classify and quantify the main issues addressed in the literature based on their jargon and areas addressed. Moreover, we explain and discuss the various challenges and open issue aspects in Hadoop scheduling optimizations.

Copyright

References

Hide All
Ahmad, F., Chakradhar, S. T., Raghunathan, A. & Vijaykumar, T. N. 2014. ShuffleWatcher: shuffle-aware scheduling in multi-tenant MapReduce clusters. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), 1–13. https://www.usenix.org/conference/atc14/technical-sessions/presentation/ahmad.
Althebyan, Q., ALQudah, O., Jararweh, Y. & Yaseen, Q. 2014. Multi-threading based MapReduce tasks scheduling. In 2014 5th International Conference on Information and Communication Systems (ICICS), 16. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber= 6841943.
Amazon! 2016a. Amazon! Elastic Block Store (EBS) – AWS Block Storage. https://aws.amazon.com/rds/ [accessed January 18, 2016].
Amazon! 2016b. Amazon! Relational Database Service (RDS). https://aws.amazon.com/rds/. [accessed January 18, 2016]
Amazon! 2016c. Amazon! Simple Storage Service (S3) – Object Storage. https://aws.amazon.com/s3/. [accessed January 18, 2016]
Amazon! 2016d. Elastic Compute Cloud (EC2). https://aws.amazon.com/ec2/. [accessed January 11, 2016]
Anjos, J. C. S., Carrera, I., Kolberg, W., Tibola, A. L., Arantes, L. B. & Geyer, C. R. 2015. MRA++: scheduling and data placement on MapReduce for heterogeneous environments. Future Generation Computer Systems 42, 2235, http://dx.doi.org/10.1016/j.future.2014.09.001.
Apache! 2015a. Apache Hadoop: Capacity Scheduler. https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html [accessed December 31, 2015].
Apache! 2015b. Apache Hadoop: Fair Scheduler. https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html [accessed December 31,2015].
Apache! 2015c. ApacheTM Hadoop®! http://hadoop.apache.org/ [accessed December 31, 2015].
Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R., Konwinski, A., Lee, Q., Patterson, D., Rabkin, A., Stoica, I. & Zaharia, M. 2010. A view of cloud computing. Communications of the ACM 53(4), 5058.
Arslan, E., Shekhar, M. & Kosar, T. 2014. Locality and network-aware reduce task scheduling for data-intensive applications. In 2014 5th International Workshop on Data-Intensive Computing in the Clouds, 1724. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7017949.
Balmin, A. & Beyer, K. S. Adaptive MapReduce using situation-aware mappers. In EDBT ‘12 Proceedings of the 15th International Conference on Extending Database Technology, 420–431.
Bezerra, A., Hernandez, P., Espinosa, A. & Moure, J. C. 2013. Job scheduling for optimizing data locality in Hadoop clusters. In Proceedings of the 20th European MPI User’s Group Meeting on – EuroMPI ‘13, 271. http://dl.acm.org/citation.cfm?doid= 2488551.2488591.
Bincy, P. A. & Binu, A. 2013. Survey on job schedulers in Hadoop cluster. IOSR Journal of Computer Engineering (IOSR-JCE) 15(1), 4650, http://www.iosrjournals.org/iosr-jce/papers/Vol15-issue1/I01514650.pdf?id=7558.
Bortnikov, E., Frank, A., Hillel, E. & Rao, S. 2012. Predicting execution bottlenecks in map-reduce clusters. In Proceedings of 4th USENIX Conference on Hot Topics in Cloud Computing. http://dl.acm.org/citation.cfm?id= 2342781.
Bruno, R. & Ferreira, P. 2014. SCADAMAR: scalable and data-efficient internet MapReduce. In Proceedings of the 2nd International Workshop on CrossCloud Systems, 2. ACM.
Chen, Q., Zhang, D., Guo, M., Deng, Q. & Guo, S. 2010. SAMR: a self-adaptive MapReduce scheduling algorithm in heterogeneous environment. In Proceedings – 10th IEEE International Conference on Computer and Information Technology, CIT-2010, 7th IEEE International Conference on Embedded Software and Systems, ICESS-2010, ScalCom-2010, (Cit), 27362743.
Chen, Q., Liu, C. & Xiao, Z. 2014. Improving MapReduce performance using smart speculative execution strategy. IEEE Transactions on Computers 63(4), 954967.
Chen, T. Y., Wei, H. W., Wei, M. F., Chen, Y. J., Hsu, T. S. & Shih, W. K. 2013. LaSA: a locality-aware scheduling algorithm for Hadoop-MapReduce resource assignment. In Proceedings of the 2013 International Conference on Collaboration Technologies and Systems, CTS 2013, 342346.
Chintapalli, S. R. 2014. Analysis of Data Placement Strategy based on Computing Power of Nodes on Heterogeneous Hadoop Clusters. Doctoral dissertation, Auburn University.
Chu, C. T., Kim, S. K., Lin, Y. A., Yu, Y., Bradski, G., Olukotun, K. & Ng, A. Y. 2007. Map-Reduce for machine learning on multicore. Advances in Neural Information Processing Systems 19, 281288.
Dean, J. & Ghemawat, S. 2008. MapReduce. Communications of the ACM 51(1), 107. http://dl.acm.org/citation.cfm?id= 1327452.1327492.
Douglas, C., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S. & Saha, B. 2013. Apache Hadoop YARN – Yet Another Resource Negotiator. In Proceedings – IEEE Fourth International Conference on eScience, 277–284. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber= 736768.
Ekanayake, J., Pallickara, S. & Fox, G. 2008. MapReduce for data intensive scientific analyses. In 2008 IEEE Fourth International Conference on eScience, 277–284. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber= 736768.
Facebook! 2015. Under the Hood: Scheduling MapReduce jobs more efficiently with Corona. https://www.facebook.com/notes/facebook-engineering/under-the-hoodscheduling-mapreduce-jobs-more-efficiently-withcorona/10151142560538920[accessed December 31, 2015].
Geetha, J., UdayBhaskar, N. & ChennaReddy, P. 2016. Data-local reduce task scheduling. Procedia Computer Science 85, 598605.
Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S. & Stoica, I. 2011. Dominant resource fairness: fair allocation of multiple resource types. In Nsdi, 11, 24–24. http://www.usenix.org/events/nsdi11/tech/fullpapers/Ghodsi.pdf.
Gu, L., Tang, Z. & Xie, G. 2014. The implementation of MapReduce scheduling algorithm based on priority. Parallel Computational Fluid Dynamics, (61103047), 100–111. http://link.springer.com/chapter/10.1007/978-3-642-53962-69.
Gu, T., Zuo, C., Liao, Q., Yang, Y. & Li, T. 2013. Improving MapReduce performance by data prefetching in heterogeneous or shared environments. International Journal of Grid and Distributed Computing 6(5), 7182, http://www.sersc.org/journals/IJGDC/vol6no5/7.pdf.
Gulati, A., Shanmuganathan, G., Holler, A. M. & Ahmad, I. 2011. Cloud-scale resource management: challenges and techniques. HotCloud 2011, 16 papers2://publication/uuid/EE3F25DD-34BB-4C32-9F0C-1FA53AAB86FD.
Gunelius, S. 2015. Per day information processed. http://aci.info/2014/07/12/the-data-explosion-in-2014-minute-by-minute-infographic/ [accessed December 31, 2015].
Hammoud, M., Rehman, M. S. & Sakr, M. F. 2012. Center-of-gravity reduce task scheduling to lower MapReduce network traffic. In Proceedings – 2012 IEEE 5th International Conference on Cloud Computing, CLOUD 2012, 4958.
Hammoud, M. & Sakr, M. F. 2011. Locality-aware reduce task scheduling for MapReduce. In Proceedings – 2011 3rd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2011, 570–576.
Hanif, M. & Lee, C. 2016. An efficient key partitioning scheme for heterogeneous MapReduce clusters. In 2016 18th International Conference on Advanced Communication Technology (ICACT), 364–367. IEEE.
He, C., Lu, Y. & Swanson, D. 2011. Matchmaking: a new MapReduce scheduling technique. In Proceedings – 2011 3rd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2011, 40–47.
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R.H., Shenker, S. & Stoica, I. 2011. Mesos: a platform for fine-grained resource sharing in the data center. NSDI, 11, 22–22. http://static.usenix.org/events/nsdi11/tech/fullpapers/Hindmannew.pdfnhttps://www.usenix.org/conference/nsdi11/mesos-platform-fine-grained-resource-sharing-data-center.
Ibrahim, S., Jin, H., Lu, L., Wu, S., He, B. & Qi, L. 2010. LEEN: locality/fairness-aware key partitioning for MapReduce in the cloud. In Proceedings – 2nd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2010, (2), 17–24.
Ibrahim, S., Jin, H., Lu, L., He, B., Antoniu, G. & Wu, S. 2012. Maestro: replica-aware map scheduling for MapReduce. In Proceedings – 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012, 435–442.
Jiang, W. Z. & Sheng, Z. Q. 2012. A new task scheduling algorithm in hybrid cloud environment. In International Conference on Cloud and Service Computing, 45–49. http://dl.acm.org/citation.cfm?id= 2469449.2469626.
Jin, J., Luo, J., Song, A., Dong, F. & Xiong, R. 2011. BAR: an efficient data locality driven task scheduling algorithm for cloud computing. In Proceedings – 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2011, 295–304.
Jin, S., Yang, S. & Jia, Y. 2012. Optimization of task assignment strategy for map-reduce. In Proceedings of 2nd International Conference on Computer Science and Network Technology, ICCSNT 2012, 57-61.
Jung, H. & Nakazato, H. 2014. Dynamic scheduling for speculative execution to improve MapReduce performance in heterogeneous environment. In 2014 IEEE 34th International Conference on Distributed Computing Systems Workshops (ICDCSW), 119–124.
Kc, K. & Anyanwu, K. 2010. Scheduling Hadoop jobs to meet deadlines. In Proceedings – 2nd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2010, 388–392.
Ko, S. Y. & Cho, B. 2009. On availability of intermediate data in cloud computations. Solutions, 66, http://portal.acm.org/citation.cfm?id= 1855574.
Kondikoppa, P., Chiu, C. H., Cui, C., Xue, L. & Park, S. J. 2012. Network-aware scheduling of MapReduce framework on distributed clusters over high speed networks. In Proceedings of the 2012 workshop on Cloud services, federation, and the 8th open cirrus summit, 39–44. http://doi.acm.org/10.1145/2378975.2378985.
Lee, G., Chun, B. & Katz, R. H. 2011. Heterogeneity-aware resource allocation and scheduling in the cloud. In Proceedings of HotCloud, 1, 47–52. http://www.usenix.org/events/hotcloud11/tech/finalfiles/Lee.pdf.
Li, H. PWBRR Algorithm of Hadoop Platform.
Li, W., Yang, H., Luan, Z. & Qian, D. 2011. Energy prediction for mapreduce workloads. In 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing (DASC), 443–448. IEEE.
Liang, A., Xiao, L. & Li, R. 2013. An energy-aware dynamic clustering-based scheduling algorithm for parallel tasks on clusters. International Journal of Advancements in Computing Technology, 5(5), 785792, http://www.aicit.org/ijact/global/paperdetail.html?jname=IJACT&q=2412.
Liu, H. 2011. Cutting MapReduce Cost with Spot Market. USENIX HotCloud'11, 5.
Mackey, G., Sehrish, S., Bent, J., Lopez, J., Habib, S. & Wang, J. 2008. Introducing map-reduce to high end computing. In 2008 3rd Petascale Data Storage Workshop, 3, 1–6. http://ieeexplore.ieee.org/articleDetails.jsp?arnumber= 4811889.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. & Byers, A. H. 2011. Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute, (June), 156.
Matsunaga, A., Tsugawa, M. & Fortes, J. 2008. CloudBLAST: combining MapReduce and virtualization on distributed resources for bioinformatics applications. In 2008 IEEE Fourth International Conference on eScience, 222–229. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber= 4736761.
Morton, K., Balazinska, M. & Grossman, D. 2010. ParaTimer: a progress indicator for MapReduce DAGs. In Proceedings of the 2010 International Conference on Management of Data, 507–518. papers://b48995dc-e14b-47dc-9998-dcf47f651d40/P aper/p66.
Nanduri, R., Maheshwari, N., Reddyraja, A. & Varma, V. 2011. Job aware scheduling algorithm for MapReduce framework. In Proceedings – 2011 3rd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2011, (November), 724–729.
Nita, M. C., Pop, F., Voicu, C., Dobre, C. & Xhafa, F. 2015. MOMTH: multi-objective scheduling algorithm of many tasks in Hadoop. Cluster Computing, 18(3), 1–14. http://dl.acm.org/citation.cfm?id= 2740070.2626334.
Palanisamy, B., Singh, A., Liu, L. & Jain, B. 2011. Purlieus: locality-aware resource allocation for MapReduce in a cloud. In 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 1–11.
Park, J., Lee, D., Kim, B., Huh, J. & Maeng, S. 2012. Locality-aware dynamic VM reconfiguration on MapReduce clouds. In HPDC ‘12: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing SE – HPDC ‘12, 27–36. http://dx.doi.org/10.1145/2287076.2287082.
Phan, L. T., Zhang, Z., Loo, B. T. & Lee, I. 2010. Real-time MapReduce scheduling. Technical Reports (CIS), (January). http://repository.upenn.edu/cisreports/942.
Polo, J., Carrera, D., Becerra, Y., Torres, J., Ayguadé, E., Steinder, M. & Whalley, I. 2010. Performance-driven task co-scheduling for MapReduce environments. In Proceedings of the 2010 IEEE/IFIP Network Operations and Management Symposium, NOMS 2010, 373–380.
Rao, B. T., Sridevi, N. V., Reddy, V. K. & Reddy, L. S. S. 2012. Performance issues of heterogeneous Hadoop clusters in cloud computing. XI(Viii), 6. http://arxiv.org/abs/1207.0894.
Rao, B. T. & Reddy, L. S. S. 2012. Survey on improved scheduling in Hadoop MapReduce in cloud environments. International Journal of Computer Applications 34(9), 2933, http://adsabs.harvard.edu/abs/2012arXiv1207.0780T.
Ren, X. 2015. Speculation-Aware Resource Allocation for Cluster Schedulers. CITP, California, 2015.
Sandholm, T. & Lai, K. 2010. Dynamic Proportional Share Scheduling in Hadoop. Job scheduling Strategies for Parallel Processing 2010. Springer Berlin Heidelberg, 110131.
Seo, S., Jang, I., Woo, K., Kim, I., Kim, J. S. & Maeng, S. 2009. HPMR: prefetching and pre-shuffling in shared MapReduce computation environment. In 2009 IEEE International Conference on Cluster Computing and Workshops, 1–8. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber= 5289171.
Shafer, J., Rixner, S. & Cox, A. L. 2010. The Hadoop distributed filesystem: balancing portability and performance. In ISPASS 2010 – IEEE International Symposium on Performance Analysis of Systems and Software, 122–133.
Shang, F., Chen, X. & Yan, C. 2017. A Strategy for Scheduling Reduce Task Based on Intermediate Data Locality of the MapReduce. Cluster Computing.
Su, Y. L., Chen, P. C., Chang, J. B. & Shieh, C. K. 2011. Variable-sized map and locality-aware reduce on public-resource grids. Future Generation Computer Systems 27(6), 843849, http://dx.doi.org/10.1016/j.future.2010.09.001.
Sun, R., Yang, J., Gao, Z. & He, Z. 2014. A virtual machine based task scheduling approach to improving data locality for virtualized Hadoop. In 2014 IEEE/ACIS 13th International Conference on Computer and Information Science (ICIS), 297–302.
Sun, X., He, C. & Lu, Y. 2012. ESAMR: an enhanced self-adaptive mapreduce scheduling algorithm. In Proceedings of the International Conference on Parallel and Distributed Systems – ICPADS, 148–155.
Suresh, S. & Gopalan, N. 2014. An optimal task selection scheme for Hadoop scheduling. IERI Procedia 10, 7075, http://dx.doi.org/10.1016/j.ieri.2014.09.093.
Tanenbaum, A. S. 2009. Modern Operating Systems. Education, 2. http://www.amazon.com/dp/0136006639.
Tang, X., Wang, L. & Geng, Z. 2015. A reduce task scheduler for MapReduce with minimum transmission cost based on sampling. Evaluation. 8(1), 110.
Tang, Z., Zhou, J., Li, K. and Li, R. 2012. MTSD: a task scheduling algorithm for MapReduce base on deadline constraints. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012, 2012–2018.
Teng, F., Magoulès, F., Yu, L. & Li, T. 2014. A novel real-time scheduling algorithm and performance analysis of a MapReduce-based cloud. The Journal of Supercomputing 69(2), 739765, http://link.springer.com/10.1007/s11227-014-1115-z.
Tian, C., Zhou, H., He, Y. & Zha, L. 2009. A dynamic MapReduce scheduler for heterogeneous workloads. In 8th International Conference on Grid and Cooperative Computing, GCC 2009, 218–224.
Tiwari, N., Sarkar, S., Bellur, U. & Indrawan, M. 2015. Classification framework of MapReduce scheduling algorithms. ACM Computing Surveys 47(3), 138, http://dl.acm.org/citation.cfm?doid= 2737799.2693315.
Wei, H. W., Wu, T. Y., Lee, W. T. & Hsu, C. W. 2015. Shareability and locality aware scheduling algorithm in Hadoop for mobile cloud computing. Journal of Information Hiding and Multimedia Signal Processing 6, 12151230.
Wolf, J., Nabi, Z., Nagarajan, V., Saccone, R., Wagle, R., Hildrum, K., Pring, E. & Sarpatwar, K. 2014. The X-flex cross-platform scheduler: who’s the fairest of them all? In Proceedings of the Middleware Industry Track, 1. ACM.
Xia, Y., Wang, L., Zhao, Q. & Zhang, G. 2011. Research on job scheduling algorithm in Hadoop. Journal of Computational Information Systems 7(16), 57695775.
Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares, A. & Qin, X. 2010. Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and PhD Forum (IPDPSW), 1–9. IEEE.
Yoo, D. & Sim, K. M. 2011. A comparative review of job scheduling for MapReduce. In CCIS2011 – Proceedings: 2011 IEEE International Conference on Cloud Computing and Intelligence Systems, 353–358.
Yu, X. & Hong, B. 2013. Bi-Hadoop: extending Hadoop to improve support for binary-input applications. In Proceedings – 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2013, 245–252.
Zaharia, M., Borthakur, D. et al.. 2010. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European Conference on Computer Systems, 265–278. http://portal.acm.org/citation.cfm?id= 1755913.1755940.
Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H. & Stoica, I. 2008. Improving MapReduce performance in heterogeneous environments. In Osdi, 8(4), 29–42. http://www.usenix.org/event/osdi08/tech/fullpapers/zaharia/zahariahtml/.
Zaharia, M., Borthakur, D., Sarma, J. S., Elmeleegy, K., Shenker, S. & Stoica, I. 2009. Job scheduling for multi-user MapReduce clusters. EECS Department University of California Berkeley Tech Rep UCBEECS200955 Apr, (UCB/EECS-2009-55), 2009-55. http://www.eecs.berkeley.edu/P ubs/T echRpts/2009/EECS-2009-55.pdf.
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. 2010. Spark: cluster computing with working sets. In HotCloud'10 Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 10.
Zhang, X., Feng, Y., Feng, S., Fan, J. & Ming, Z. 2011. An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments. In Proceedings – 2011 International Conference on Cloud and Service Computing, CSC 2011, 235–242.

Jargon of Hadoop MapReduce scheduling techniques: a scientific categorization

  • Muhammad Hanif (a1) and Choonhwa Lee (a1)

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed