Recent Advances in Computer Science and Communications

Author(s): Gnanendra Kotikam* and Lokesh Selvaraj

DOI: 10.2174/2666255816666220831125012

YARN Schedulers for Hadoop MapReduce Jobs: Design Goals, Issues and Taxonomy

Article ID: e310822208349 Pages: 12

  • * (Excluding Mailing and Handling)

Abstract

Objective: Big Data processing is a demanding task, and several big data processing frameworks have emerged in recent decades. The performance of these frameworks is greatly dependent on resource management models.

Methods: YARN is one of such models which acts as a resource management layer and provides computational resources for execution engines (Spark, MapReduce, storm, etc.) through its schedulers. The most important aspect of resource management is job scheduling.

Results: In this paper, we first present the design goal of YARN real-life schedulers (FIFO, Capacity, and Fair) for the MapReduce engine. Later, we discuss the scheduling issues of the Hadoop MapReduce cluster.

Conclusion: Many efforts have been carried out in the literature to address issues of data locality, heterogeneity, straggling, skew mitigation, stragglers and fairness in Hadoop MapReduce scheduling. Lastly, we present the taxonomy of different scheduling algorithms available in the literature based on some factors like environment, scope, approach, objective and addressed issues.

Keywords: Hadoop Map Reduce, YARN schedulers, scheduling issues, Fair Scheduling, Energy consumption, Virtualization

[1]
"MapReduce tutorial", Available from:, hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html (Accessed on June 10, 2022).
[2]
J. Dean, and S. Ghemawat, "MapReduce: Simplified data processing on large clusters", Commun. ACM, vol. 51, no. 1, pp. 107-113, 2008.
[http://dx.doi.org/10.1145/1327452.1327492]
[3]
Apache Hadoop", Available from: https://hadoop.apache.org/ (Accessed on June 10, 2022).
[4]
T. White, "Hadoop: The Definitive Guide. 4th Edition. vol. 54, pp. 258, 2012",
[5]
"Apache Hadoop 2.7.4 – Hadoop: Fair Scheduler", Available from: https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/FairScheduler.html (Accessed on June 10, 2022).
[6]
Apache Hadoop 2.7.1 – Hadoop: Capacity Scheduler.. Available from: https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html (Accessed on June 10, 2022).
[7]
A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, "“Dominant resource fairness: Fair allocation of multiple resource types”, Proc. NSDI 2011 8th USENIX Symp. Networked Syst", Des. Implement., vol. 8, pp. 323-336, 2011.
[8]
M. Hanif, and C. Lee, "Jargon of Hadoop MapReduce scheduling techniques: A scientific categorization", Knowl. Eng. Rev., vol. 34, pp. 1-33, 2019.
[http://dx.doi.org/10.1017/S0269888918000371]
[9]
V. Pandey, and P. Saini, "How heterogeneity affects the design of hadoop mapreduce schedulers: A state-of-the-art survey and challenges", Big Data, vol. 6, no. 2, pp. 72-95, 2018.
[http://dx.doi.org/10.1089/big.2018.0013] [PMID: 29924647]
[10]
T. Xue, X. You, and M. Yan, "Research on hadoop job scheduling based on an improved genetic algorithm", Int. J. Grid Distrib. Comput., vol. 10, no. 2, pp. 1-12, 2017.
[http://dx.doi.org/10.14257/ijgdc.2017.10.2.01]
[11]
A.K. Javanmardi, S.H. Yaghoubyan, K. Bagherifard, S. Nejatian, and H. Parvin, "A unit-based, cost-efficient scheduler for heterogeneous Hadoop systems", J. Supercomput., vol. 77, no. 1, 2021.
[http://dx.doi.org/10.1007/s11227-020-03256-4]
[12]
R. Jeyaraj, V.S. Ananthanarayana, and A. Paul, "Improving mapreduce scheduler for heterogeneous workloads in a heterogeneous environment", Concurr. Comput., vol. 32, no. 7, pp. 1-10, 2020.
[http://dx.doi.org/10.1002/cpe.5558]
[13]
J. Rathinaraja, V.S. Ananthanarayana, and A. Paul, "Dynamic ranking-based MapReduce job scheduler to exploit heterogeneous performance in a virtualized environment", J. Supercomput., vol. 75, no. 11, pp. 7520-7549, 2019.
[http://dx.doi.org/10.1007/s11227-019-02960-0]
[14]
Y. Guo, L. Wu, W. Yu, B. Wu, and X. Wang, "The improved job scheduling algorithm of hadoop platform arXiv Preprint arXiv:1506.03004. 2015",
[15]
N.S. Naik, A. Negi, and V.N. Sastry, "Performance improvement of mapreduce framework in heterogeneous context using reinforcement learning", Procedia Comput. Sci., vol. 50, pp. 169-175, 2015.
[http://dx.doi.org/10.1016/j.procs.2015.04.080]
[16]
J. Jin, Q. An, W. Zhou, J. Tang, and R. Xiong, DynDL: Scheduling data-locality-aware tasks with dynamic data transfer cost for multicore-server-based big data clusters Appl. Sci., (Basel), vol. 8, no. 11, pp. 1-25, 2018.
[http://dx.doi.org/10.3390/app8112216]
[17]
M. Brahmwar, M. Kumar, and G. Sikka, "Tolhit - A scheduling algorithm for hadoop cluster", Procedia Comput. Sci., vol. 89, pp. 203-208, 2016.
[http://dx.doi.org/10.1016/j.procs.2016.06.043]
[18]
A. Spivak, and D. Nasonov, "Data preloading and data placement for mapreduce performance improving", Procedia Comput. Sci., vol. 101, pp. 379-387, 2016.
[http://dx.doi.org/10.1016/j.procs.2016.11.044]
[19]
A. Gandomi, M. Reshadi, A. Movaghar, and A. Khademzadeh, "HybSMRP: A hybrid scheduling algorithm in Hadoop MapReduce framework", J. Big Data, p. 6, 2019.
[http://dx.doi.org/10.1186/s40537-019-0253-9]
[20]
X. Ling, Y. Yuan, D. Wang, J. Liu, and J. Yang, "Joint scheduling of MapReduce jobs with servers: Performance bounds and experiments", J. Parallel Distrib. Comput., vol. 90-91, pp. 52-66, 2016.
[http://dx.doi.org/10.1016/j.jpdc.2016.02.002]
[21]
Q. Lu, S. Li, W. Zhang, and L. Zhang, "A genetic algorithm-based job scheduling model for big data analytics", EURASIP J. Wirel. Commun. Netw., vol. 2016, p. 152, 2016.
[http://dx.doi.org/10.1186/s13638-016-0651-z] [PMID: 27429611]
[22]
D. Cheng, X. Zhou, P. Lama, J. Wu, and C. Jiang, "Cross-platform resource scheduling for spark and mapreduce on YARN", IEEE Trans. Comput., vol. 66, no. 8, pp. 1941-1353, 2017.
[http://dx.doi.org/10.1109/TC.2017.2669964]
[23]
V. Pandey, and P. Saini, "An Energy-Efficient Greedy Mapreduce Scheduler for Heterogeneous Hadoop YARN Cluster., vol. 11297 LNCS. Springer International Publishing, 2018",
[24]
L. Mashayekhy, M.M. Nejad, D. Grosu, D. Lu, and W. Shi, "Energy-aware scheduling of MapReduce jobs IEEE Int. Congr. Big Data, BigData Congr, 2014, pp. 32-39",
[25]
M. Wang, C.Q. Wu, H. Cao, Y. Liu, Y. Wang, and A. Hou, "on mapreduce scheduling in hadoop yarn on heterogeneous clusters", In 2018 17th IEEE International Conference on Trust, Security And Privacy In Computing and Communications/12th IEEE International Conference on Big Data Science and Engineering, 1-3 Aug, 2018, New York, NY, USA, 2018, pp. 1747-1754.
[http://dx.doi.org/10.1109/TrustCom/BigDataSE.2018.00264]
[26]
M. Soualhia, F. Khomh, and S. Tahar, "A dynamic and failure-aware task scheduling framework for hadoop", IEEE Trans. Cloud Comput., vol. 8, no. 2, pp. 553-569, 2020.
[27]
L. Vasiliu, F. Pop, C. Negru, M. Mocanu, V. Cristea, and J. Kolodziej, "A hybrid scheduler for many task computing in big data systems", Int. J. Appl. Math. Comput. Sci., vol. 27, no. 2, pp. 385-399, 2017.
[http://dx.doi.org/10.1515/amcs-2017-0027]
[28]
R. Alanazi, F. Alhazmi, H. Chung, and Y. Nah, "A multi-optimization technique for improvement of Hadoop performance with a dynamic job execution method based on artificial neural network", SN Comput. Sci., vol. 1, pp. 1-11, 2020.
[http://dx.doi.org/10.1007/s42979-020-00182-3]
[29]
U. Upadhyay, and G. Sikka, "STDADS: An efficient slow task detection algorithm for deadline schedulers", Big Data, vol. 8, no. 1, pp. 62-69, 2020.
[http://dx.doi.org/10.1089/big.2019.0039] [PMID: 31995397]
[30]
M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, "Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling", In Proceedings of the 5th European conference on Computer systems 15-19 Apr, 2010 New York, NY, USA, 2010, pp. 265-278
[http://dx.doi.org/10.1145/1755913.1755940]
[31]
Q. Xie, and Y. Lu, "Priority algorithm for near-data scheduling: Throughput and heavy-traffic optimality", In 2015 IEEE Conference on Computer Communications (INFOCOM) 26 Apr- 01 May 2015, Hong Kong, China, 2015, pp. 963-972
[http://dx.doi.org/10.1109/INFOCOM.2015.7218468]
[32]
F. Chen, M. Kodialam, and T.V. Lakshman, "Joint scheduling of processing and shuffle phases in MapReduce systems", Proc. IEEE INFOCOM, pp. 1143-1151, 2012.
[33]
X. Sun, C. He, and Y. Lu, "ESAMR: An enhanced self-adaptive MapReduce scheduling algorithm", In 2012 IEEE 18th International Conference on Parallel and Distributed Systems, 17- 19 Dec, 2012, Singapore, 2012.
[http://dx.doi.org/10.1109/ICPADS.2012.30]