Data Analysis using Mapper and Reducer with Optimal Configuration in Hadoop

Sasiniveda.G; Revathi.N

doi:https://doi.org/10.14445/22312803/IJCTT-V4I3P113

Research Article | Open Access | Download PDF

Volume 4 | Issue 3 | Year 2013 | Article Id. IJCTT-V4I3P113 | DOI : https://doi.org/10.14445/22312803/IJCTT-V4I3P113

Data Analysis using Mapper and Reducer with Optimal Configuration in Hadoop

Sasiniveda.G, Revathi.N

Citation :

Sasiniveda.G, Revathi.N, "Data Analysis using Mapper and Reducer with Optimal Configuration in Hadoop," International Journal of Computer Trends and Technology (IJCTT), vol. 4, no. 3, pp. 264-268, 2013. Crossref, https://doi.org/10.14445/22312803/IJCTT-V4I3P113

Abstract

Data analysis is an important functionality in cloud computing which allows a huge amount of data to be processed over very large clusters. Hadoop is a software framework for large data analysis. It provide a Hadoop distributed file system for the analysis and transformation of very large data sets is performed using the MapReduce paradigm. MapReduce is known as a popular way to hold data in the cloud environment due to its excellent scalability and good fault tolerance. Map Reduce is a programming model widely used for processing large data sets. Hadoop Distributed File System is designed to stream those data sets. The Hadoop MapReduce system was often unfair in its allocation and a dramatic improvement is achieved through the Elastic Mapper Reducer System. The proposed Mapper Reducer function allows us to analyze the data set and achieve better performance in executing the job by using optimal configuration of mappers and reducers based on the size of the data sets and also helps the users to view the status of the job and to find the error localization of scheduled jobs. This will efficiently utilize the performance properties of optimized scheduled jobs. So, the efficiency of the system will result in substantially lowered system cost, energy usage, management complexity and increases the performance of the system.

Keywords

Cloud Computing, Hadoop Distributed file System, Performance Paradigm.

References

[1] Apache,“Hadoop,” http://hadoop.apache.org/docs/r0.20.2/hdfs_design.html
[2] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Machine Intell., 24:603–619, 2002.
[3] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google File System,” Proc. 19th ACM Symp. Operating Systems Principles, 2003.
[4] Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker. Map-reduce-merge: simplified relational data processing on large clusters.In SIGMOD ’07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1029–1040, New York, NY, USA, 2007. ACM.
[5] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In OSDI, USENIX Symposium on Operating System design and Implementation pp.1-16August
[6] J. Dean and S. Ghemawat, “Mapreduce: Simplified Data Processing on Large Clusters,” Comm. ACM, vol. 51, no. 1, pp. 107-113,December 2008.
[7] Hadoop, http://lucene.apache.org/hadoop
[8] Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2
[9] Konstantin Shvachko,” The Hadoop Distributed File System”, Yahoo-Inc.com.
[10] T. Sun, C. Shu, F. Li, H. Yu, L. Ma, Y. Fang, An efficient hierarchical clustering method for large datasets with map-reduce, in: PDCAT’09: International Conference on Parallel and Distributed Computing, Applications and Technologies, IEEE Computer Society, Washington, DC, USA, 2009, pp. 494-499.
[11] Matei Zaharia, Dhruba Borthakur, Job Scheduling for Multi-User MapReduce Clusters Electrical Engineering and Computer SciencesUniversity of California at Berkeley April 30, 2009.
[12] D. Jiang et al. Map-join-reduce: Towards scalable and efficient data analysis on large clusters. IEEE Transactions on Knowledge and Data Engineering, 2010.
[13] D. Jiang et al . The performance of mapreduce: An indepth study. Proceedings of the VLDB Endowment,3(12):pp 472–483, 2010
[14] M. Elteir, H. Lin, W. chun Feng, Enhancing mapreduce via asynchronous data processing, in: ICPADS’10: IEEE 16th International Conference on Parallel and Distributed Systems, 2010, pp. 397-405.
[15] Mr. Yogesh Pingle, Vaibhav Kohli, Shruti Kamat, Nimesh Poladia Big Data Processing using Apache Hadoop in Cloud System International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622.
[16] F.N. Afrati and J.D. Ullman, Optimizing Joins in a Map-Reduce Environment, Proc. 13th Int’l Conf. Extending Database Technology (EDBT ’10), 2010.
[17] Y. Bu, B. Howe, M. Balazinska, and M. Ernst, “Hadoop: Efficient Iterative Data Processing on Large Clusters,” Proc. VLDB Endowment, vol. 3, no. 1/2, pp. 285-296, 2010.
[18] Foto N. Afrati and Jeffrey D. Ullman, Optimizing Multiway Joins in a Map-Reduce Environment IEEE Transactions on knowledge and data Engineering, VOL. 23, NO. 9, September 2011.
[19] Indranil Palit and Chandan K. Reddy, Scalable and Parallel Boosting with MapReduce IEEE Transactions on knowledge and data Engineering, VOL. 24, NO. 10, October 2012.