Data Transfer Between RDBMS and HDFS By Using the Spark Framework in Sqoop for Better Performance

  IJCTT-book-cover
 
         
 
© 2021 by IJCTT Journal
Volume-69 Issue-3
Year of Publication : 2021
Authors : Hariteja Bodepudi
DOI :  10.14445/22312803/IJCTT-V69I3P103

How to Cite?

Hariteja Bodepudi, "Data Transfer Between RDBMS and HDFS by Using the Spark Framework in Sqoop for Better Performance," International Journal of Computer Trends and Technology, vol. 69, no. 3, pp. 10-13, 2021. Crossref, 10.14445/22312803/IJCTT-V69I3P103

Abstract
The Usage of the Internet and IOT devices has increased a lot these days. This results in an increase of the data day by day. Data has been increased from Terabytes to Petabytes which Traditional database systems cannot store and process. This data is often referred to as Big Data. This Big Data needs a big storage capacity which becomes more expensive to store. Companies need low commodity hardware and high reliability, and less expensive, which can be achieved and handled by the Hadoop Framework. Organizations started to move across to the Hadoop Ecosystem to store and process large volumes of data to gain more insights out of the data. Traditionally data was stored in the RDBMS, i.e., Relational Database Management Systems. To move this data into the Hadoop Ecosystem, a tool called Sqoop become more prominent to both import and export the data from the RDBMS to Hadoop and the Hadoop to RDBMS. This paper is going to address the importance of Sqoop and the functionality of the Sqoop how it handles the large data sets and is used as ETL to transfer the data from RDBMS to Hadoop Platform, i.e., HDFS(Hadoop Distributed File System) and Vice versa. This paper also provides recommendations on how to increase the performance and reduce the latency of the existing Sqoop processing by using Spark Framework.

Keywords
Hadoop; HDFS; Sqoop; MapReduce; Spark

Reference
[1] What is BigData, [Online]. Available: https://datasciencedegree.wisconsin.edu/data-science/what-isbig-data/.
[2] Apache Hadoop Overview, [Online]. Available: https://hadoop.apache.org/.
[3] What is Sqoop, [Online]. Available: https://www.ucartz.com/clients/index.php?rp=/knowledgebase/ 833/Hadoop-What-is-Sqoop-and-Flume.html.
[4] MapReduceOverview, [Online]. Available: https://docs.marklogic.com/guide/mapreduce/hadoop#:~:text=S tatus%20and%20Logs- ,MapReduce%20Overview,step%20map%20and%20reduce%2 0process.&text=The%20top%20level%20unit%20of,reduce%2 0phase%20can%20be%20omitted.
[5] ApacheSqoop-a-means-to-work-with-traditional-database, [Online]. Available: https://blogs.perficient.com/2016/08/11/apache-sqoop-a-meansto-work-with-traditionaldatabase/#:~:text=Sqoop%20uses%20export%20and%20import ,as%20well%20as%20fault%20tolerance.
[6] SqoopUserGuide v1.4.2, [Online]. Available: https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html.
[7] SqoopUserGuide v1.4.1, [Online]. Available: https://sqoop.apache.org/docs/1.4.1- incubating/SqoopUserGuide.html#:~:text=With%20Sqoop%2C %20you%20can%20import,process%20is%20performed%20in %20parallel.
[8] Sqoop Import Command, [Online]. Available: https://docs.cloudera.com/runtime/7.2.1/migrating-data-intohive/topics/hive_create_a_sqoop_import_command.html.
[9] Sqoop Export Command, [Online]. Available: https://community.cloudera.com/t5/Support-Questions/Usingdirect-option-in-Sqoop-import-export/m-p/66930.
[10] P. S. V. Naresh Kumar, Modern Big Data Processing With Hadoop, Packt Publishing, (2018).
[11] J. Reddy, Introduction to Sqoop Architecture, [Online]. Available: https://www.freecodecamp.org/news/an-in-depthintroduction-to-sqoop-architecture-ad4ae0532583/.
[12] Spark-Sqoop-Job, [Online]. Available: https://www.wikitechy.com/tutorials/sqoop/spark-sqoop-job.
[13] M. Williams, Apache Spark vs. Map Reduce, 30 August 2017. [Online]. Available: https://dzone.com/articles/apache-sparkintroduction-and-its-comparison-toma#:~:text=The%20biggest%20claim%20from%20Spark,O%2 0operations%20with%20the%20disks.