
Pradeep Bhosale / ESP JETA 4(3), 170 -170, 2024
170
Streaming + Columnar: Emerging solutions handle near-real-time appends in columnar files, bridging the gap
between batch columnar and streaming ingestion.
Encrypted Data: Many data lakes require encryption at rest or in flight, so advanced formats might embed key
management at the file/column level [18].
XIV. CONCLUSION
Avro, Parquet, and ORC each represent distinct solutions to the evolving demands of big data storage:
Avro: Row-based, simple schema evolution, well-suited for streaming ingestion or row-level events.
Parquet: Columnar, widely supported, ideal for partial column queries in data lakes, efficient compression, broad
Spark integration.
ORC: Another columnar format with advanced indexing, bloom filters, strong synergy in Hive-based ecosystems.
Selecting the right format depends on the workload streaming vs. batch analytics, partial column queries vs. row
writes, or advanced indexing needs vs. simpler ingestion. In practice, many organizations adopt a multi-format approach,
using Avro for ingestion and Parquet/ORC for final analytical storage. By understanding each format’s architecture,
performance trade-offs, and best usage patterns, data engineers and platform architects can shape robust pipelines that
address the scale and complexity of modern data systems. As data ecosystems continue to evolve, new solutions (like
Apache Iceberg, Delta Lake, or advanced merges) may further unify or extend the capabilities of these file formats.
Nonetheless, Avro, Parquet, and ORC remain foundational pillars for high-performance data storage in an era of
distributed analytics and large-scale computing.
XV.REFERENCES
[1] Fowler, M. and Lewis, J., “Microservices Resource Guide,” martinfowler.com, 2016.
[2] Newman, S., Building Microservices, O’Reilly Media, 2015.
[3] Apache Avro Documentation, https://avro.apache.org/, Accessed 2022.
[4] Amazon Whitepaper, “Understanding CAP Theorem in NoSQL,” 2020.
[5] Parquet Documentation, https://parquet.apache.org/, Accessed 2021.
[6] Avro Schema Evolution Guide, Confluent Blog, 2019.
[7] Netflix Tech Blog, “Parquet in a Large-Scale Production Environment,” 2018.
[8] Databricks Blog, “Predicate Pushdown in Parquet for Spark SQL,” 2020.
[9] Brandolini, A., Introducing EventStorming, Leanpub, 2013.
[10] ORC Documentation, https://orc.apache.org/, Accessed 2021.
[11] Blum, A. and Mansfield, G., “Indexing Columnar Data in ORC,” ACMQueue, 2019.
[12] CNCF Whitepaper, “Data Formats in Cloud-Native Analytics,” 2021.
[13] G. Cockcroft, “Comparing Avro vs. Parquet for Analytical Queries,” ACM DevOps Conf, 2019.
[14] M. Turnbull, The Data Lake Book, Independently Published, 2020.
[15] Gilt Tech Blog, “Hive and Columnar Evolution Patterns,” 2018.
[16] Krishnan, S., “E-Commerce Data Lake Architecture: Avro Ingestion, Parquet Analytics,” IEEE Software, vol. 35, no. 2, 2019.
[17] Blum, A. et al., “Logistics and Graph Queries with ORC Backed Aggregates,” ACM SoCC Workshops, 2020.
[18] Netflix Tech Blog, “Future of Columnar Data in 2024,” 2022.