
sources to improve customer understanding and
support strategic decision-making. Integrating and
connecting these diverse data sources is essential
for actionable insights, but this task is complex and
critical.
Traditional databases and data warehouses, the
primary solutions for business intelligence, offer
structured storage and processing but come with
significant drawbacks, including high costs for on-
premises hardware, proprietary data formats, and
reliance on centralized IT departments for analy-
sis [1]. They also need help scaling to meet the
increasing demand for real-time data analytics and
the massive volume, velocity, and variety of big
data [1]. This challenge set the stage for alterna-
tive solutions.
2.1. The Rise of Big Data and Hadoop
Hadoop, introduced in 2006, enabled parallel pro-
cessing and large-scale data handling through the
Map-Reduce model across clusters of standard
hardware, making it an attractive option for han-
dling growing data volumes [2]. The Hadoop
ecosystem helped organizations handle big data
more cost-effectively, but its complex architecture,
high IT resource demands, and focus on data ac-
cumulation often resulted in data swamps, where
valuable data becomes difficult to manage. Fur-
thermore, cloud computing advancements, includ-
ing platforms like Amazon S3 (AWS S3), Azure
Data Lake Storage (ADLS), and Google Cloud
Storage (GCS), offered more flexible, cost-effective
alternatives, highlighting Hadoop’s operational and
economic limitations [3].
2.2. Transition to Open Data Architecture
Although Hadoop did not meet all expectations, it
laid the foundation for the open data ecosystem
based on openness, modularity, and diversity [4].
Today’s ecosystem benefits from cloud scalability,
query acceleration technologies, and open-source
data formats, making data accessible for business
analytics and AI/ML workloads at reduced costs.
Four main trends have simplified the adoption
of open data solutions. First, cloud data lakes,
like Amazon S3, ADLS, and GCS, enable scal-
able storage of structured and unstructured data
without on-premises hardware, supported by
decreasing cloud costs [3]. Second, open-source
data formats like Apache Parquet and Iceberg
improve data compatibility across platforms and
help organizations reduce storage costs [5].
Third, cloud-native vendors now offer managed
solutions, eliminating the need for costly hardware
investments while enabling pay-as-you-go models
for greater flexibility [3]. Finally, modern open data
tools empower users to operate at varying levels
of abstraction, allowing data analysts, scientists,
and business users to focus on insights rather than
database management.
The open data environment offers advantages
like cost-effective storage, scalability, and vendor
choice, mainly due to the separation of compute
and storage resources. Most solutions are open-
source or SaaS, easing integration and manage-
ment [6]. Additionally, democratization enables di-
verse users to access data using preferred tools,
while processing engines like Spark and Dremio
provide flexibility in storing data in various formats,
even for enterprises with legacy systems [6].
2.3. Data Lakehouse Architecture
Traditional data warehouses excel at structured
data analysis but are costly and lack flexibility,
while data lakes provide low-cost, flexible storage
for raw data but often face data quality issues,
risking transformation into data swamps [7]. Data
lakehouses integrate the benefits of both models,
supporting flexible storage of raw data while
ensuring data governance, quality, and ACID
compliance for analytics and machine learning
workloads [7].
Data lakehouses also facilitate efficient data
management by supporting diverse processing
engines and query tools, allowing seamless ac-
cess across various data science, analytics, and
business intelligence applications. This flexibility
empowers organizations to unify data storage
and processing workflows under one architecture,
reducing the need for complex data movement
and enabling real-time analytics on vast datasets.
Moreover, lakehouses often integrate with cloud-
based and open-source ecosystems, making them
adaptable to evolving business requirements and
technology landscapes.
In addition to unifying data storage and process-
ing, data lakehouse architectures organize data as
it flows through distinct stages, each designed to
enhance data structure, quality, and accessibility.
This layered approach, the medallion architecture,
ensures that data ingestion, processing, and
consumption are efficiently managed from raw
input to enriched, ready-for-analysis datasets [8].
Data ingestion frameworks leverage distributed
processing and open-source tools to scale data
pipelines efficiently, drawing data from diverse
sources and formats into the lakehouse. This
process typically begins by storing data in its
native format within a foundational layer, then
progressively transforming it through structured
2