Data Transfer Between RDBMS and HDFS By Using the Spark Framework in Sqoop for Better Performance

Hariteja Bodepudi

doi:10.14445/22312803/ IJCTT-V69I3P103

Research Article | Open Access | Download PDF

Volume 69 | Issue 3 | Year 2021 | Article Id. IJCTT-V69I3P103 | DOI : https://doi.org/10.14445/22312803/IJCTT-V69I3P103

Data Transfer Between RDBMS and HDFS By Using the Spark Framework in Sqoop for Better Performance

Hariteja Bodepudi

Received	Revised	Accepted
18 Jan 2021	05 Mar 2021	08 Mar 2021

Citation :

Hariteja Bodepudi, "Data Transfer Between RDBMS and HDFS By Using the Spark Framework in Sqoop for Better Performance," International Journal of Computer Trends and Technology (IJCTT), vol. 69, no. 3, pp. 10-13, 2021. Crossref, https://doi.org/10.14445/22312803/ IJCTT-V69I3P103

Abstract

The Usage of the Internet and IOT devices has increased a lot these days. This results in an increase of the data day by day. Data has been increased from Terabytes to Petabytes which Traditional database systems cannot store and process. This data is often referred to as Big Data. This Big Data needs a big storage capacity which becomes more expensive to store. Companies need low commodity hardware and high reliability, and less expensive, which can be achieved and handled by the Hadoop Framework. Organizations started to move across to the Hadoop Ecosystem to store and process large volumes of data to gain more insights out of the data. Traditionally data was stored in the RDBMS, i.e., Relational Database Management Systems. To move this data into the Hadoop Ecosystem, a tool called Sqoop become more prominent to both import and export the data from the RDBMS to Hadoop and the Hadoop to RDBMS. This paper is going to address the importance of Sqoop and the functionality of the Sqoop how it handles the large data sets and is used as ETL to transfer the data from RDBMS to Hadoop Platform, i.e., HDFS(Hadoop Distributed File System) and Vice versa. This paper also provides recommendations on how to increase the performance and reduce the latency of the existing Sqoop processing by using Spark Framework.

Keywords

Hadoop; HDFS; Sqoop; MapReduce; Spark

References

[1] What is BigData, [Online]. Available: https://datasciencedegree.wisconsin.edu/data-science/what-isbig-data/.
[2] Apache Hadoop Overview, [Online]. Available: https://hadoop.apache.org/.
[3] What is Sqoop, [Online]. Available: https://www.ucartz.com/clients/index.php?rp=/knowledgebase/ 833/Hadoop-What-is-Sqoop-and-Flume.html.
[4] MapReduceOverview, [Online]. Available: https://docs.marklogic.com/guide/mapreduce/hadoop#:~:text=S tatus%20and%20Logs- ,MapReduce%20Overview,step%20map%20and%20reduce%2 0process.&text=The%20top%20level%20unit%20of,reduce%2 0phase%20can%20be%20omitted.
[5] ApacheSqoop-a-means-to-work-with-traditional-database, [Online]. Available: https://blogs.perficient.com/2016/08/11/apache-sqoop-a-meansto-work-with-traditionaldatabase/#:~:text=Sqoop%20uses%20export%20and%20import ,as%20well%20as%20fault%20tolerance.
[6] SqoopUserGuide v1.4.2, [Online]. Available: https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html.
[7] SqoopUserGuide v1.4.1, [Online]. Available: https://sqoop.apache.org/docs/1.4.1- incubating/SqoopUserGuide.html#:~:text=With%20Sqoop%2C %20you%20can%20import,process%20is%20performed%20in %20parallel.
[8] Sqoop Import Command, [Online]. Available: https://docs.cloudera.com/runtime/7.2.1/migrating-data-intohive/topics/hive_create_a_sqoop_import_command.html.
[9] Sqoop Export Command, [Online]. Available: https://community.cloudera.com/t5/Support-Questions/Usingdirect-option-in-Sqoop-import-export/m-p/66930.
[10] P. S. V. Naresh Kumar, Modern Big Data Processing With Hadoop, Packt Publishing, (2018).
[11] J. Reddy, Introduction to Sqoop Architecture, [Online]. Available: https://www.freecodecamp.org/news/an-in-depthintroduction-to-sqoop-architecture-ad4ae0532583/.
[12] Spark-Sqoop-Job, [Online]. Available: https://www.wikitechy.com/tutorials/sqoop/spark-sqoop-job.
[13] M. Williams, Apache Spark vs. Map Reduce, 30 August 2017. [Online]. Available: https://dzone.com/articles/apache-sparkintroduction-and-its-comparison-toma#:~:text=The%20biggest%20claim%20from%20Spark,O%2 0operations%20with%20the%20disks.