Accelerating Hash Join Performance by Exploiting Data Distribution
Yang Liu#1, Zhen He#2, Xiang Wu Meng "Accelerating Hash Join Performance by Exploiting Data Distribution ". International Journal of Computer Trends and Technology (IJCTT) V56(1):6-20, February 2018. ISSN:2231-2803. www.ijcttjournal.org. Published by Seventh Sense Research Group.
Abstract -
thejoin operator in relational databases is one of the most IO intensive operations. Thelarge size of input relations makes it hard to fit them entirely in RAM during join processing. Therefore therelations are processed in chucks inside a RAM buffer of limited size.The ideabehindasuccessfuljoin algorithm is to make the most efficient use of the limited sized buffer to minimizethenumberof IOs. The hash join algorithm has been a popular algorithm due to its relativelylowIOcostscompared to other methods. In this paper we make the observation that the performanceofthehash join can be dramatically improved if we take advantage of skewed distributionsandmissingvalues in join attributes. We propose the filtered hash join (FH-join) which filtersouttuplesoftheinput relations during the partitioning phase of the hash join to minimize the workleft forthejoinphase. The results show FH-join can outperform the hybrid hash join by up to a factor 4 in terms of total execution time when the data is much skewed.
References
[1] Hayes, T., Palomar, O., Unsal, O., Cristal, A., and Valero, M. (2012) Vector Extensions for Decision Support DBMS Acceleration. The 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO45), pp. 166–176.
[2] DeWitt, D. J., Katz, R. H., Olken, F., Shapiro, L. D., Stonebraker, M. R., and Wood, D. A. (1984) Implementation techniques for main memory database systems. Proceedings ACM SIGMOD ’84, pp. 1–8.
[3] Do, J. and Patel, J. M. (2009) Join processing for flash ssds: remembering past lessons. DaMoN, pp. 1–8.
[4] Mullin, J. K. (1983) A second look at bloom filters. Communications of the ACM, 26, 570–571.
[5] Mishra, P. and Eich, M. H. (1992). Join processing in relational databases. ACM Computing Survey, 24, 63–113.
[6] Nooshin S, Mirzadeh, Kockerber, O, Falsafi B, and Grot B. (2015) Sort vs. Hash join revisited for near-memory execution, pp. 1-6
[7] Balkesen C, Teubner J, Alonso G, et al. (2015). Main-Memory Hash Joins on Modern Processor Architectures [J]. IEEE Transactions on Knowledge and Data Engineering, , 27(7): 1754-1766.?
[8] Kim, C., Kaldewey, T., Lee, V. W., Sedlar, E., Nguyen, A. D., Satish, N., Chhugani, J., Di Blas, A., and Dubey, P. (2009) Sort vs. hash revisited: fast join implementation on modern multi-core cpus. Proceedings of the VLDB, 2, 1378–1389.
[9] Blanas, S., Li, Y., and Patel, J. M. (2011) Design and evaluation of main memory hash join algorithms for multicore cpus. SIGMOD Conference, pp. 37–48.
[10] Chen, S., Ailamaki, A., Gibbons, P. B., and Mowry, T. C (2004) Improving hash join performance through prefetching. Proceedings of the 20th International Conference on Data Engineering ICDE ’04, pp. 116–127.
[11] Manegold, S., Boncz, P. A., and Kersten, M. L. (2000) What happens during a join? Dissecting CPU and memory optimization effects. Proceedings of the VLDB, pp. 339–350.
[12] Chen, S., Ailamaki, A., Gibbons, P. B., and Mowry, T. C. (2005) Inspector joins. Proceedings of VLDB, pp. 817–828.
[13] Manegold, S., Boncz, P., and Kersten, M. (2002) Optimizing main-memory join on modern hardware. IEEE Transactions on Knowledge and Data Engineering, 14, 709–730.
[14] Zeller, H. and Gray, J. (1990) An adaptive hash join algorithm for multiuser environments. In McLeod, D., Sacks-Davis, R., and Schek, H.-J. (eds.), Proceedings of VLDB, pp. 186–197.
[15] Martin, P., Larson, P.-A°., and Deshpande, V. (1994) Paralle hash-based join algorithms for a shared-everything. IEEE Trans. Knowl. Data Eng., 6, 750–763.
[16] Kitsuregawa, M., Nakayama, M., and Takagi, M. (1989) The effect of bucket size tuning in the dynamic hybrid grace hashjoin method. VLDB, pp. 257–266.
[17] Kitsuregawa M, Ogawa Y. Bucket spreading parallel hash: a new, robust, parallel hash join method for data skew in the super database computer (SDC)[J]. Very large data bases, 1990: 210-221.
[18] Kitsuregawa, M., ichiro Tsudaka, S., and Nakano, M. (1992) Parallel grace hash join on shared-everything multiprocessor:Implementation and performance evaluation on symmetry s81. Proceedings of ICDE.
[19] DeWitt, D. J., Naughton, J. F., Schneider, D. A., and Seshadri, S. (1992) Practical skew handling in parallel joins. Proceedings of VLDB, pp. 27–40.
[20] Haas, L. M., Carey, M. J., Livny, M., and Shukla, A. (1997) Seeking the truth about ad hoc join costs. The VLDB Journal, 6, 241–256
Keywords
hash join, relational databases, query processing