Mastering Big Data Formats: ORC, Parquet, Avro, Iceberg, and the Strategy of Selection

Srinivasa Rao Nelluri; Flavia Ann Albert Saldanha

doi:10.14445/22312803/ IJCTT-V73I1P105

Research Article | Open Access | Download PDF

Volume 73 | Issue 1 | Year 2025 | Article Id. IJCTT-V73I1P105 | DOI : https://doi.org/10.14445/22312803/IJCTT-V73I1P105

Mastering Big Data Formats: ORC, Parquet, Avro, Iceberg, and the Strategy of Selection

Srinivasa Rao Nelluri, Flavia Ann Albert Saldanha

Received	Revised	Accepted	Published
19 Nov 2024	26 Dec 2024	14 Jan 2025	30 Jan 2025

Citation :

Srinivasa Rao Nelluri, Flavia Ann Albert Saldanha, "Mastering Big Data Formats: ORC, Parquet, Avro, Iceberg, and the Strategy of Selection," International Journal of Computer Trends and Technology (IJCTT), vol. 73, no. 1, pp. 44-50, 2025. Crossref, https://doi.org/10.14445/22312803/ IJCTT-V73I1P105

Abstract

In today’s times, when data volumes are massive, and the speed of data is continuous, managing and optimizing such extremely large and complex datasets can be a huge ordeal for organizations. Optimizing storage costs and maintaining performance and efficiency becomes key, especially when dealing with Big Data datasets. When it comes to Big Data, it becomes extremely important for data teams to have the right data format and framework strategy from the get-go to be able to design and develop robust, efficient and sustainable processes around this data. A poor choice with data processing file formats could potentially hurt operational and/or analytical consumption, thereby leading to a low return on investment of this data. This paper explores the characteristics, advantages, and limitations of several prominent file formats in big data ecosystems: ORC, Parquet, Avro, Iceberg, and others. Each format is evaluated based on key criteria, including storage efficiency, query performance, schema evolution, and compatibility across platforms and analytical engines. By analyzing these formats in practical scenarios, this paper provides a decision matrix to guide data engineers, architects, and analysts in selecting the most suitable format based on their unique workload and infrastructure requirements. This comparative analysis ultimately serves as a strategic resource for organizations to make informed, efficient, and scalable choices in their big data environments.

Keywords

Apache Kafka, Apache Hadoop Distributed File System (HDFS), Apache Flink, Delta Lake, Snowflake.

References

[1] Seref Sagiroglu, and Duygu Sinanc, “Big Data: A Review,” 2013 International Conference on Collaboration Technologies and Systems, San Diego, CA, USA, pp. 42-47, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Samiya Khan, and Mansaf Alam, “File Formats for Big Data Storage Systems,” International Journal of Engineering and Advanced Technology, vol. 9, no. 1, pp. 1-7, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Todor Ivanov, and Matteo Pergolesi, “The Impact of Columnar File Formats on SQL‐on‐Hadoop Engine Performance: A Study on ORC and Parquet,” Concurrency and Computation: Practice and Experience, vol. 32, no. 5, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Vishal Naidu, “Performance Enhancement Using Appropriate File Formats in Big Data Hadoop Ecosystem,” International Research Journal of Engineering and Technology, vol. 9, no. 1, pp. 1247-1251, 2022.
[Google Scholar] [Publisher Link]
[5] Ibrar Yaqoob et al., “Big Data: From Beginning to Future,” International Journal of Information Management, vol. 36, no. 6, pp. 1231 1247, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Amanpreet Kaur Sandhu, “Big Data with Cloud Computing: Discussions and Challenges,” Big Data Mining and Analytics, vol. 5, no. 1, pp. 32-40, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Akram Elomari, Larbi Hassouni, and Abderrahim Maizate, “The Main Characteristics of Five Distributed File Systems Required for Big Data: A Comparative Study,” Advances in Science, Technology and Engineering Systems Journal, vol. 2, no. 4, pp. 78-91, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Spyros Blanas et al., “Parallel Data Analysis Directly on Scientific File Formats,” Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird Utah USA, pp. 385-396, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Software: Landmark Solution, Halliburton. [Online]. Available: https://www.halliburton.com/en/software
[10] Eileen McNulty, Understanding Big Data: The Seven Vs, 2014. [Online]. Available: https://dataconomy.com/2014/05/22/seven-vs-big data/
[11] Thomas H. Davenport, and Jill Dyche, “Big Data in Big Companies,” International Institute for Analytics, 2013.
[Google Scholar]
[12] James Manyika et al., “Big Data: The Next Frontier for Innovation, Competition, and Productivity,” Mickensy Global Institute, 2011.
[Google Scholar] [Publisher Link]
[13] Avita Katal, Mohammad Wazid, and R.H. Goudar, “Big Data: Issues, Challenges, Tools and Good Practices,” 2013 Sixth International Conference on Contemporary Computing, Noida, India, pp. 404-409, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Xin Luna Dong, and Divesh Srivastava, “Big Data Integration,” 2013 IEEE 29th International Conference on Data Engineering, Brisbane, QLD, Australia, pp. 1245-1248, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Firat Tekiner, and John A. Keane, “Big Data Framework,” 2013 IEEE International Conference on Systems, Man, and Cybernetics, Manchester, UK, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Chun-Wei Tsai et al., “Big Data Analytics: A Survey,” Journal of Big Data, vol. 2, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Uthayasankar Sivarajah et al., “Critical Analysis of Big Data Challenges and Analytical Methods,” Journal of Business Research, vol. 70, pp. 263-286, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Xiaolong Jin et al., “Significance and Challenges of Big Data Research,” Big Data Research, vol. 2, no. 2, pp. 59-64, 2015.
[CrossRef] [Google Scholar] [Publisher Link]