Serverless ETL: Leveraging AWS Glue and PySpark for Efficient Data Processing

Dharanidhar Vuppu; Mounica Achanta

doi:10.14445/22312803/IJCTT-V73I7P109

Research Article | Open Access | Download PDF

Volume 73 | Issue 7 | Year 2025 | Article Id. IJCTT-V73I7P109 | DOI : https://doi.org/10.14445/22312803/IJCTT-V73I7P109

Serverless ETL: Leveraging AWS Glue and PySpark for Efficient Data Processing

Dharanidhar Vuppu, Mounica Achanta

Received	Revised	Accepted	Published
03 Jun 2025	26 Jun 2025	18 Jul 2025	29 Jul 2025

Citation :

Dharanidhar Vuppu, Mounica Achanta, "Serverless ETL: Leveraging AWS Glue and PySpark for Efficient Data Processing," International Journal of Computer Trends and Technology (IJCTT), vol. 73, no. 7, pp. 73-80, 2025. Crossref, https://doi.org/10.14445/22312803/IJCTT-V73I7P109

Abstract

In today’s cloud-native data landscape, data engineers are expected to build ETL pipelines that can scale effortlessly, remain easy to maintain, and stay within budget. With data volumes growing rapidly and business needs constantly evolving, traditional ETL setups—typically run on provisioned clusters—can become a bottleneck. They often bring challenges like over-provisioned resources, ongoing infrastructure upkeep, and complicated scaling mechanisms. This paper explores a serverless approach using AWS Glue and PySpark, aimed at simplifying ETL development while cutting down significantly on operational complexity. We share a hands-on implementation of a serverless ETL setup that takes advantage of AWS Glue’s built-in orchestration, Spark-based distributed processing, and tight integration with the AWS Data Catalog for managing schemas. This approach simplifies the process of ingesting and transforming data from sources like S3 and RDS, cuts down on setup time, and scales effortlessly without the need for manual tuning. Through a real-world case study, we benchmark AWS Glue's performance, scalability, and cost-efficiency against traditional Spark clusters hosted on EC2. The results show tangible benefits in terms of time-to-value, fault tolerance, and operational simplicity, particularly for mid-sized batch processing workloads. The paper concludes with practical considerations, limitations, and lessons learned from adopting serverless ETL, offering guidance for data engineers looking to modernize their pipelines using fully managed, cloud-native solutions.

Keywords

Dharanidhar Vuppu, Mounica Achanta

References

[1] Plale, B., & Kouper, I. (2017). The centrality of data: data lifecycle and data pipelines. In Data analytics for intelligent transportation systems. Elsevier, 91-111.
[Google Scholar] [Publisher Link]
[2] Lee, D. (2020). Data transformation: a focus on the interpretation. Korean journal, 503-508.
[Google Scholar] [Publisher Link]
[3] Kriushanth, M., Arockiam, L., & Mirobi, G. (2013). Auto scaling in Cloud Computing: an overview. IJARCC, 2278-1021.
[Google Scholar] [Publisher Link]
[4] Pogiatzis, A., & Samakovitis, G. (2020). An event-driven serverless ETL pipeline on AWS. Applied Sciences, 191.
[Google Scholar] [Publisher Link]
[5] Sudhakar, K. (2018). Amazon web services (aws) Glue. International Journal of Management, IT and Engineering, 108-122
[Google Scholar] [Publisher Link]
[6] Singh, P. (2021). Manage data with PySpark. In Machine Learning with PySpark. 15-37.
[Google Scholar] [Publisher Link]
[7] Batmaci, G. (2022). Etl Data Pipelines Configurations in Spark.
[Google Scholar] [Publisher Link]
[8] Mehmood, E., & Anees, T. (2022). Distributed real-time ETL architecture for unstructured big data, 3419-3445.
[Google Scholar] [Publisher Link]
[9] Warneke, D., & Kao, O. (2009). Efficient parallel data processing in the cloud. 1-10.
[Google Scholar] [Publisher Link]
[10] Bussa, S., & Hegde, E. (2024). Evolution of Data Engineering in Modern Software Development. Journal of Sustainable Solutions, 116-130.
[Google Scholar] [Publisher Link]