Research on MapReduce Heuristic Multi Table Join Algorithm Based on
Binary Optimization and Pancake Parallel Strategy

Songhao      Jia; Jizheng      Yang; Cai      Yang; Haiyu      Zhang

Abstract

Background: With the development of technology, the data amount has increased significantly. In data processing, the multi table query is the most frequent operation. Because the join keys cannot correspond one by one, there will be much redundant data transmission, resulting in a waste of network bandwidth.

Objective: In order to solve the problems of network overhead and low efficiency, this paper proposes a heuristic multi table join optimization method. By sharing information, the unconnected tuples are eliminated so as to reduce the amount of data transmitting. This shortens response time and improves execution performance.

Methods: Firstly, the join key information of one table is compressed by the algorithm to make the filtered information for sharing. Then, the concurrent execution is controlled according to the pancake parallel strategy. Finally, the selection strategy of multi table join order is proposed.

Results/Discussion: The experiments show that the proposed algorithm can filter a large amount of useless data and improve query efficiency. At the same time, the proposed algorithm reduces a lot of network overhead, improves the algorithm performance, and better solves the problem of low efficiency of multi table join.

Conclusion: This paper introduces the heuristic strategy to optimize the algorithm, so that it can perform the join tasks in parallel, which further improves the performance of multi table join. The algorithm creatively combines heuristic data filtering, which greatly improves the quality of data processing. The algorithm is worth popularizing and applying.

Keywords: Multi table join optimization, bloom filter algorithm, binary optimization, pancake spreading strategy, pipeline model, heuristic algorithm.

Graphical Abstract

[1]
E. Coppa, I. Finocchi,  and R.L. Garcia, "Counting cliques in parallel without a cluster: Engineering a fork/join algorithm for shared-memory platforms", Inf. Sci., vol. 496, pp. 553-571, 2019.
 [http://dx.doi.org/10.1016/j.ins.2018.07.018]
[2]
P. Koutris, S. Salihoglu,  and D. Suciu, "Algorithmic aspects of parallel data processing", Found. Trends Databases, vol. 8, no. 4, pp. 239-370, 2018.
 [http://dx.doi.org/10.1561/1900000055]
[3]
F.N. Afrati, N. Stasinopoulos, J.D. Ullman,  and A. Vassilakopoulos, "SharesSkew: An algorithm to handle skew for joins in MapReduce", Inf. Syst., vol. 77, pp. 129-150, 2018.
 [http://dx.doi.org/10.1016/j.is.2018.06.005]
[4]
S. Rababa,  and A. Al-Badarneh, "Optimizations for filter-based join algorithms in MapReduce", J. Intell. Fuzzy Syst., vol. 40, no. 5, pp. 8963-8980, 2021.
 [http://dx.doi.org/10.3233/JIFS-201220]
[5]
Á.M. García-Vico, F. Charte, P. González, D. Elizondo,  and C.J. Carmona, "E2PAMEA: A fast evolutionary algorithm for extracting fuzzy emerging patterns in big data environments", Neurocomputing, vol. 415, pp. 60-73, 2020.
 [http://dx.doi.org/10.1016/j.neucom.2020.07.007]
[6]
F. García-García, A. Corral, L. Iribarne,  and M. Vassilakopoulos, "Improving distance-join query processing with voronoi-diagram based partitioning in spatialhadoop", Future Gener. Comput. Syst., vol. 111, pp. 723-740, 2020.
 [http://dx.doi.org/10.1016/j.future.2019.10.037]
[7]
S. Benbernou, X. Huang,  and M. Ouziri, "Semantic-based and entity-resolution fusion to enhance quality of big RDF Data", IEEE Trans. Big Data, vol. 7, no. 2, pp. 436-450, 2021.
 [http://dx.doi.org/10.1109/TBDATA.2017.2710346]
[8]
S. Tamil Selvan, P. Balamurugan,  and M. Vijayakumar, "Prefetched wald adaptive boost classification based Czekanowski similarity MapReduce for user query processing with bigdata", Distrib. Parallel Databases, vol. 39, no. 4, pp. 855-872, 2021.
 [http://dx.doi.org/10.1007/s10619-020-07319-6]
[9]
M. Fidler, B. Walker,  and Y. Jiang, "Non-asymptotic delay bounds for multi-server systems with synchronization constraints", IEEE Trans. Parallel Distrib. Syst., vol. 29, no. 7, pp. 1545-1559, 2018.
 [http://dx.doi.org/10.1109/TPDS.2017.2779872]
[10]
U. Suthakar, L. Magnoni, D.R. Smith,  and A. Khan, "Optimised lambda architecture for monitoring scientific infrastructure", IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 6, pp. 1395-1408, 2021.
 [http://dx.doi.org/10.1109/TPDS.2017.2772241]
[11]
M. Aksa, J. Rashid, M. Wasif Nisar, T. Mahmood, H.Y. Kwon,  and A. Hussain, "Bitmapaligner: Bit-parallelism string matching with-mapreduce and hadoop", Comput. Mater. Continua, vol. 68, no. 3, pp. 3931-3946, 2021.
 [http://dx.doi.org/10.32604/cmc.2021.016081]
[12]
B.R. Prasad,  and S. Agarwal, "Design development and performance analysis of distributed least square twin support vector machine for binary classification", Turk. J. Electr. Eng. Comput. Sci., vol. 29, no. 7, pp. 2934-2949, 2021.
 [http://dx.doi.org/10.3906/elk-2008-155]
[13]
R. Sujitha,  and B. Paramasivan, "Distributed healthcare framework using MMSM-SVM and P-SVM classificationt", Comput. Mater. Continua, vol. 70, no. 1, pp. 1557-1572, 2022.
 [http://dx.doi.org/10.32604/cmc.2022.019323]
[14]
O. Rottenstreich, P. Reviriego, E. Porat,  and S. Muthukrishnan, "Avoiding flow size overestimation in count-min sketch with bloom filter constructions", IEEE Trans. Netw. Serv. Manag., vol. 18, no. 3, pp. 3662-3676, 2021.
 [http://dx.doi.org/10.1109/TNSM.2021.3068604]
[15]
L. Toumi,  and A. Ugur, "Static and incremental dynamic approaches for multi-objective bitmap join indexes selection in data ware-houses", J. Supercomput., vol. 77, no. 4, pp. 3933-3958, 2021.
 [http://dx.doi.org/10.1007/s11227-020-03423-7]
[16]
S.N. Bhattu, A. Potluri, P. Kadari,  and R.B.V. Subramanyam, "Generalized communication cost efficient multi-way spatial join: Revisiting the curse of the last reducer", GeoInformatica, vol. 24, pp. 557-589, 2020.
 [http://dx.doi.org/10.1007/s10707-019-00387-6]
[17]
D. Rafiei,  and F. Deng, "Similarity join and similarity self-join size estimation in a streaming environment", IEEE Trans. Knowl. Data Eng., vol. 32, no. 4, pp. 768-781, 2020.
 [http://dx.doi.org/10.1109/TKDE.2019.2893175]
[18]
M. Gowanlock, "Hybrid KNN-join: Parallel nearest neighbor searches exploiting CPU and GPU architectural features", J. Parallel Distrib. Comput., vol. 149, pp. 119-137, 2021.
 [http://dx.doi.org/10.1016/j.jpdc.2020.11.004]
[19]
M.A. Naeem, "Optimization and extension of stream-relation joins", Int. J. Inf. Technol. Decis. Mak, vol. 18, no. 4, pp. 1289-1315, 2019.
 [http://dx.doi.org/10.1142/S0219622019500214]
[20]
I.M. Al Jawarneh, P. Bellavista, A. Corradi, L. Foschini,  and R. Montanari, "Efficient QoS-aware spatial join processing for scalable NoSQL storage frameworks", IEEE Trans. Netw. Serv. Manag., vol. 18, no. 2, pp. 2437-2449, 2021.
 [http://dx.doi.org/10.1109/TNSM.2020.3034150]
[21]
R. Ebenstein,  and G. Agrawal, "DistriPlan: An optimized join execution framework for geo-distributed scientific data", Distrib. Parallel Databases, vol. 38, no. 1, pp. 127-152, 2020.
 [http://dx.doi.org/10.1007/s10619-019-07264-z]
[22]
S. Dolev, P. Gupta, Y. Li, S. Mehrotra,  and S. Sharma, "Privacy-preserving secret shared computations using mapreduce", IEEE Trans. Depend. Secure Comput., vol. 18, pp. 1645-1666, 2021.
[23]
P. Moutafis, G. Mavrommatis, M. Vassilakopoulos,  and S. Sioutas, "Efficient processing of all-k-nearest-neighbor queries in the MapReduce programming framework", Data Knowl. Eng., vol. 121, pp. 42-70, 2019.
 [http://dx.doi.org/10.1016/j.datak.2019.04.003]
[24]
S. Scherzinger, "Build your own SQL-on-hadoop query engine", SIGMOD Rec., vol. 48, no. 2, pp. 33-38, 2019.
 [http://dx.doi.org/10.1145/3377330.3377336]
[25]
Y. Khan, A. Zimmermann, A. Jha, V. Gadepally, M. D’Aquin,  and R. Sahay, "One size does not fit all: Querying web polystores", IEEE Access, vol. 7, pp. 9598-9617, 2019.
 [http://dx.doi.org/10.1109/ACCESS.2018.2888601]
[26]
Q. Baert, A.C. Caron, M. Morge, J.C. Routier,  and K. Stathis, "An adaptive multi-agent system for task reallocation in a MapReduce job", J. Parallel Distrib. Comput., vol. 153, pp. 75-88, 2021.
 [http://dx.doi.org/10.1016/j.jpdc.2021.03.008]
[27]
K. Watts,  and C. Thuen, "Lexicographically-aware and capability-aware self-advising modules for temporal data assembly", U.S. Patent 20,200,082,012 A1, 2020.
[28]
D.S. Douches, "Overcoming self-cincompatibility in diploid palants for breeding and production of hybrids", U.S. Patent 20,220,025,394 A1, 2022.
[29]
S. Pal, A. Bhattacharjee, R. Delanoy,  and Y. Wang, "Search time estimate in a data intake and query system", U.S. Patent 20, 200,364,223 A1, 2020.
[30]
U. Ben-david, T. Golub, R. Beroukhim, O. Enache,  and V. Rendo, "Dna damage response signature guided rational design of crispr-based systems and therapies", U.S. Patent 20,210,147,828, A1, 2021.
[31]
R.T. Drmanac, B.A. Peters,  and O. Wang, "Single tube bead-based DNA co-barcoding for accurate and cost-effective dequencing, haplo-typing, and assembly", U.S. Patent 20,210,115, 595 A1, 2021.
[32]
B.L. Hurwitz, G.S. Watts, I. Choi,  and J.H. Hartman, "Methods for comparative metagenomic analysis", U.S. Patent 20, 210,249,102 A1, 2021.
[33]
A. Olgiati, R.R. Huilgol,  and V. Kumar, "GPU code injection to summarize machine learning training data", U.S. Patent 20,210, 097,432 A1, 2021.
[34]
M.D. Fuchs, "Master data management technologies", U.S. Patent 20,210,089,512 A1, 2021.
[35]
D.A. Ghazaleh, "Database server embedded process and code accelerator", U.S. Patent 20,200,327,130 A1, 2020.
[36]
R. Redon, G. Loirand, R. Bourcier,  and H. Desal, "Methods and compositions for predicting and treating intracranial aneurysm", U.S. Patent 20,200,256,879 A1, 2020.
[37]
B.T. Adanve, "Methods for decentralized genome storage, distribution, marketing and analysis", U.S. Patent 20,200,073,560 A1, 2020.
[38]
S. Kang, S. Lee,  and J. Kim, "Distributed graph cube generation using Spark framework", J. Supercomput., vol. 76, no. 10, pp. 8118-8139, 2020.
 [http://dx.doi.org/10.1007/s11227-019-02746-4]
[39]
H. Yuan, K.K.R. Patil,  and G.H. Milby, "Spatial joins in multi-processing computing systems including massively parallel processing data-base systems", U.S. Patent 20,200,401,585 A1, 2020.
[40]
S. Hsaini, S. Azzouzi,  and M.E.H. Charaf, "A temporal based approach for MapReduce distributed testing", Int. J. Parallel Emergent Distrib. Syst., vol. 36, no. 4, pp. 293-311, 2021.
 [http://dx.doi.org/10.1080/17445760.2021.1879068]
[41]
S. Behnezhad, L. Dhulipala, H. Esfandiari, J. Lacki, V. Mirrokni,  and W. Schudy, "Parallel graph algorithms in constant adaptive rounds", Proceedings VLDB Endowment, vol. 13, no. 13, pp. 3588-3602, 2020.
 [http://dx.doi.org/10.14778/3424573.3424579]
[42]
Z. Dafir, Y. Lamari,  and S.C. Slaoui, "A survey on parallel clustering algorithms for Big Data", Artif. Intell. Rev., vol. 54, no. 4, pp. 2411-2443, 2021.
 [http://dx.doi.org/10.1007/s10462-020-09918-2]
[43]
E. Gavagsaz, A. Rezaee,  and H. Haj Seyyed Javadi, "Load balancing in join algorithms for skewed data in MapReduce systems", J. Supercomput., vol. 75, no. 1, pp. 228-254, 2019.
 [http://dx.doi.org/10.1007/s11227-018-2578-0]
[44]
D. Medhat, A.H. Yousef,  and C. Salama, "Cost-aware load balancing for multilingual record linkage using MapReduce", Ain Shams Eng. J., vol. 11, no. 2, pp. 419-433, 2020.
 [http://dx.doi.org/10.1016/j.asej.2019.08.009]
[45]
D. Rajeswari, M. Prakash,  and J. Suresh, "Computational grid scheduling architecture using MapReduce model-based nondominated sorting genetic algorithm", Soft Comput., vol. 23, no. 18, pp. 8335-8347, 2019.
 [http://dx.doi.org/10.1007/s00500-019-03946-z]

Cite As

Recent Patents on Engineering

Research on MapReduce Heuristic Multi Table Join Algorithm Based on Binary Optimization and Pancake Parallel Strategy

Abstract

Graphical Abstract