Asynchronous Data Stream Ingestion in Distributed Cloud Infrastructure

  IJCTT-book-cover
 
         
 
© 2024 by IJCTT Journal
Volume-72 Issue-1
Year of Publication : 2024
Authors : Vinay Gupta
DOI :  10.14445/22312803/IJCTT-V72I1P102

How to Cite?

Vinay Gupta, "Asynchronous Data Stream Ingestion in Distributed Cloud Infrastructure," International Journal of Computer Trends and Technology, vol. 72, no. 1, pp. 8-11, 2024. Crossref, https://doi.org/10.14445/22312803/IJCTT-V72I1P102

Abstract
Businesses use cloud infrastructure to boost performance and cut costs, making it a crucial engine for today’s agile software ecosystem. Given the staggering amount of data generated daily, businesses have started offloading most of their data onto the cloud. However, cloud providers need to catch up in meeting the exponential volume of incoming data, pushing cloud providers to come up with innovative solutions to scale ingestion effectively. This paper discusses one innovative solution for scaling the ingestion of asynchronous data streams using distributed Kafka-based ingestion. I will highlight the relevant components of a Kafka cluster, what is a bottleneck in naïve data ingestion, and how Kafka can scale ingested data streams to billions per day. I will also discuss how Kafka clusters are used in distributed cloud infrastructure and why asynchronous data is a good candidate for Kafka-based ingestion. Kafka has significantly increased the scale of data streams a public cloud provider can ingest.

Keywords
Apache Kafka, Asynchronous data streams, Cloud infrastructure, Distributed messaging queues.

Reference

[1] Fabio Duarte, Amount of Data Created Daily (2024), The Exploding Topics website, 2023. [Online]. Available: https://explodingtopics.com/blog/data-generated-per-day
[2] Keith D. Foote, A Brief History of Microservices, The Dataversity website, 2021. [Online]. Available: https://www.dataversity.net/a-brief-history-of-microservices/
[3] Eiman Alothali, Hany Alashwal, and Saad Harous, “Data Stream Mining Techniques: A Review,” TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 17, no. 2, pp. 728-737, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Ali T. Atieh, “The Next Generation Cloud Technologies: A Review on Distributed Cloud, Fog and Edge Computing and their Opportunities and Challenges,” ResearchBerg Review of Science and Technology, vol. 1, no. 1, pp. 1-15, 2021.
[Google Scholar] [Publisher Link]
[5] Anh-Tuan H. Bui et al., “A Comprehensive Distributed Queue-Based Random Access Framework for mMTC in LTE/LTE-A Networks with Mixed-Type Traffic,” IEEE Transactions on Vehicular Technology, vol. 68, no. 12, pp. 12107-121, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Han Wu, Zhihao Shang, and Katinka Wolter, “Performance Prediction for the Apache Kafka Messaging System,” 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Zhangjiajie, China, pp. 154-161, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Jonathan Hasenburg, and David Bermbach, “DisGB: Using Geo-Context Information for Efficient Routing in Geo-Distributed Pub/Sub Systems,” 2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC), Leicester, UK, pp. 67-78, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Kafka Overview, IBM Automation - Event-driven Solution - Sharing knowledge, The IBM cloud website, 2022. [Online]. Available: https://ibm-cloud-architecture.github.io/refarch-eda/technology/kafka-overview/.
[9] Mohamed Ouhssini et al., “Distributed Intrusion Detection System in the Cloud Environment Based on Apache Kafka and Apache Spark,” 2021 Fifth International Conference On Intelligent Computing in Data Sciences (ICDS), Fez, Morocco, pp. 1-6, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[10] The Apache Kafka Documentation, 2021. [Online]. Available: https://kafka.apache.org/documentation/.
[11] Sean Rooney et al., “Kafka: The Database Inverted, but Not Garbled or Compromised,” 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, pp. 3874-3880, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[12] What Is Cloud Computing? Definition, Benefits, Types, and Trends, The Spiceworks Website, 2022. [Online]. Available: https://www.spiceworks.com/tech/cloud/articles/what-is-cloud-computing/.
[13] Reading Avro Streams from Confluent Cloud into Apache Druid, The Hellmar Becker Website, 2021. [Online]. Available: https://blog.hellmar-becker.de/2021/10/19/reading-avro-streams-from-confluent-cloud-into-druid/