Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges PDF Free Download

1 / 38
0 views38 pages

Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges PDF Free Download

Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges PDF free Download. Think more deeply and widely.

Academic Editors: Haridimos
Kondylakis, Yuan Tian, Le Sun,
Vidyasagar Potdar and Biao Song
Received: 27 August 2025
Revised: 28 September 2025
Accepted: 22 October 2025
Published: 26 October 2025
Citation: Koukaras, P. Data
Integration and Storage Strategies in
Heterogeneous Analytical Systems:
Architectures, Methods, and
Interoperability Challenges.
Information 2025,16, 932. https://
doi.org/10.3390/info16110932
Copyright: © 2025 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license
(https://creativecommons.org/
licenses/by/4.0/).
Review
Data Integration and Storage Strategies in Heterogeneous Analytical
Systems: Architectures, Methods, and Interoperability Challenges
Paraskevas Koukaras
School of Science and Technology, International Hellenic University, 14th km Thessaloniki-Moudania,
57001 Thessaloniki, Greece; p.koukaras@ihu.edu.gr
Abstract
In the current scenario of universal accessibility of data, organisations face highly com-
plex challenges related to integrating and processing diverse sets of data in order to meet
their analytical needs. This review paper analyses traditional and innovative methods
used for data storage and integration, with particular focus on their implications for scal-
ability, consistency, and interoperability within an analytical ecosystem. In particular, it
contributes a cross-layer taxonomy linking integration mechanisms (schema matching,
entity resolution, and semantic enrichment) to storage/query substrates (row/column
stores, NoSQL, lakehouse, and federation), together with comparative tables and figures
that synthesise trade-offs and performance/governance levers. Through schema mapping
solutions addressing the challenges brought about by structural heterogeneity, storage
architectures varying from traditional storage solutions all the way to cloud storage solu-
tions, and ETL pipeline integration using federated query processors, the research provides
specific attention for the application of metadata management, with a focus on semantic
enrichment using ontologies and lineage management to enable end-to-end traceability
and governance. It also covers performance hotspots and caching techniques, along with
consistency trade-offs arising out of distributed systems. Empirical case studies from real
applications in enterprise lakehouses, scientific exploration activities, and public gover-
nance applications serve to invoke this review. Following this work is the possibility of
future directions in convergent analytical platforms with support for multiple workloads,
along with metadata-centric orchestration with provisions for AI-based integration. Com-
bining technological advancement with practical considerations results in an enabling
resource for researchers and practitioners seeking the creation of fault-tolerant, reliable,
and future-ready data infrastructure. This review is primarily aimed at researchers, system
architects, and advanced practitioners who design and evaluate heterogeneous analytical
platforms. It also offers value to graduate students by serving as a structured overview of
contemporary methods, thereby bridging academic knowledge with industrial practice.
Keywords: data integration; data storage; metadata management; heterogeneous
systems; schema mapping; lakehouse architecture; federated querying; data lineage; data
interoperability; pipeline orchestration
1. Introduction
Modern-day analytics operate over a heterogeneous collection of formats, systems,
and semantics, introducing integration challenges poorly served by schema or storage-
centric approaches. This section introduces the contextual background around this re-
search topic. It highlights the importance of defining consistent methodologies to bridge
Information 2025,16, 932 https://doi.org/10.3390/info16110932
Information 2025,16, 932 2 of 38
heterogeneity-aware integration to scalable storage and clarifies relevant parameters and
research contributions.
1.1. The Growing Heterogeneity in Analytical Ecosystems
In today’s society, in which the value of data usage has become even more salient,
organisations have increasingly relied on a heterogeneous set of data sources for to boost
decision-making activities, knowledge gathering, and machine learning activities. These
sources include traditional relational databases, key-value storage repositories, document-
oriented storage repositories, semi-structured repositories like JavaScript Object Notation
(JSON) and Extensible Markup Language (XML), and public cloud storage solutions like
Google Cloud Storage and Amazon Simple Storage Service (S3), along with structured flat
files organised in a column-wise structure like Apache Parquet and ORC [
1
4
]. The diversity
inherent in these data sources comes not only from the storage structure, along with
supported query methods, but also from heterogeneities in schemata, data representation
methods, access protocols, and update characteristics. This diversity becomes a significant
obstacle for data management activities, along with integration activities, mainly because
end users aim to produce uniform, high-quality outputs from environments that happen to
be highly heterogeneous but also show considerable distribution characteristics [46].
Researchers, engineers, and data scientists regularly find themselves dealing with
issues related to heterogeneous data pipelines—more specifically, with the problem of the
integration of heterogeneous data from multiple storage systems, models, and schemas.
Although there have been significant boosts towards scaled storage solutions and high-
query-throughput technologies, various systems still reel from issues related to schema
mismatches, redundancy, deceptive metadata, and varying data semantics from a multi-
plicity of sources. These issues are highly noted within enterprise data lakes, federated info
structures, science research campuses, and analytical platforms across multiple operational
areas. The increased occurrence of these problems within hybrid cloud infrastructure
clearly highlights the need for more advanced integration methods, along with adaptive
storage solutions offering varying levels of flexibility with regards to changing schemas,
lowered redundancy, and semi-structured data [5,7,8].
1.2. Motivation for Unified Integration and Storage Strategies
It is important to acquire holistic knowledge of integration approaches to meet the
challenges of diverse data types that comprise schema matching, instance alignment, and
data fusion. The need is especially accentuated when such approaches are combined with
modern storage platforms like distributed file systems, Not Only SQL (NoSQL) stores,
and lakehouse implementations [
3
,
9
11
]. Static relational schema datasets were the focus
of traditional integration approaches. This vision is now less sufficient to address the
workloads that require more flexible approaches to process-dynamic, semi-structured, and
rapidly changing datasets.
Storage innovation has brought about new trade-offs between performance, flexibility,
and consistency. Schema-on-read data querying practices, late-binding policies, and feder-
ated access protocols can be used for the querying of data without pre-data modelling but
generally weaken semantic expressiveness or operational efficiency [
5
,
6
]. Therefore, data in-
tegration solutions require storage infrastructure alignment for scalability, interoperability,
and reliable analytics [7,8,12].
This study provides a valuable and beneficial question for researchers and practi-
tioners alike, as it enables effective decision-making in system development, helps de-
termine challenges of integration due to discipline-specific differences in infrastructure,
Information 2025,16, 932 3 of 38
and ensures the application of best practices in managing analytical processes in various
infrastructural environments.
1.3. Scope and Contributions of the Review
This research undertakes a broad review of integration and storage methods utilised
within various analytical architectures, providing an equitable focus on areas that include
both structured and semi-structured data. The research includes schema-based approaches
such as schema matching, Entity Resolution (ER), and semantic enrichment, as well as
architecture-centric strategies that involve columnar database systems, key-value store
alternatives, cloud storage solutions, and integrated query processing systems [13,7,9].
Modern methodologies and frameworks are categorised according to their roles in
the data management continuum, including features that range from data acquisition to
storage setup, metadata management, and cross-source query optimisation. This analysis
is centred on the solutions offered by these methodologies with respect to heterogeneity,
scalability, and interoperability concerns, as well as regarding the trade-offs that arise in
real-world deployments [5,6].
This work responds to three main aspects: (i) synthesis of data integration and storage
paradigms in heterogeneous analytical contexts, drawing connections between schema-
level and infrastructure-level heterogeneity; (ii) a comparative analysis of representative
systems and tools, highlighting their functionalities, limitations, and integration strategies;
and (iii) identification of outstanding challenges and potential paths, including opportuni-
ties for integration spurred by metadata, schema evolution management, and AI-enabled
interoperability. Beyond these three aforementioned concerns, this work offers four sub-
stantial contributions.
It introduces an end-to-end cross-layer taxonomy that distinguishes integration mech-
anisms like schema matching, entity resolution, and semantic enrichment over a het-
erogeneous collection of storage methods and access models, including row/column
stores, many families of NoSQL stores, cloud-native columnar stores, and lakehouse
systems. This taxonomy clearly illustrates the interplay of workloads and architectures
and their effects for governance and performance. By considering Sections 2and 3
simultaneously, practitioners can immediately map data unification to its concomitant
storage point and the corresponding utilised query format.
Operational planning of ETL, ELT, and data virtualisation/federation is laid out by
considering popular process models and tools, followed by an analysis of the impact
of optimisers and pushdown methods in integrated SQL engines. This creates a prag-
matic architecture for the balancing of the latency, cost, and freshness of data across
varying backends.
Metadata, lineage, and schema evolution are highlighted as central constructs in
ensuring reproducibility and dependability, in addition to formalising performance
enhancements such as caching/materialisation and freshness/consistency control, in
a checklist to support architectural decision-making.
The findings are then consolidated with cross-domain examples such as enterprise
lake/lakehouse, scientific integration, and public analytics domains, which leverage
pragmatic patterns for managed, heterogeneous schema-on-read/write pipelines.
These developments integrate embedded methodologies and infrastructure into a
unified, decision-focused view for data teams and architects.
Information 2025,16, 932 4 of 38
Positioning vs. Prior Surveys
Relative to recent surveys on data lakes [
5
], data federation [
6
], lakehouses [
13
], unified
analytics [
14
], and data integration [
9
], Table 1contrasts scope and adds a side-by-side view
of what this synthesis contributes beyond these works.
This review contributes (i) a cross-layer taxonomy that explicitly couples integration
mechanisms (schema matching, entity resolution, and semantic enrichment) with storage
and access models (row/column stores, NoSQL, lakehouses, and federation), operational-
ising this via an application cross-walk (Section 7); (ii) working, reproducible resolution
workflows for canonical heterogeneity problems—schema name mismatch and instance-
level date ambiguity—with concrete enforcement/lineage policies (Section 2.4); (iii) a
governance-aware treatment that binds performance levers (pushdown and materialisa-
tion/caching) to data-quality SLAs, freshness, and consistency (Section 6); and (iv) the
first synthesis, according to known information, that integrates peer-reviewed 2024–2025
evidence on AI-assisted integration—LLM-based schema/ontology matching and cost-
aware/active ER—into an architectural decision frame [1520].
This positioning clarifies how this survey complements system-focused reviews by
providing design-level guidance that traverses schema, semantics, storage, and operations
by translation into domain patterns (Section 7).
Table 1. Positioning against prior surveys: scope, coverage, and added value of this review.
Survey Primary Focus Coverage What This Review Adds
[5]Data lake
functions/systems
Ingestion, metadata,
governance, and
lake infrastructure
Cross-layer linkage of
schema/ER/semantics
with storage + perfor-
mance/governance levers
[6]Federation architectures
and capabilities
Connectors, query
translation, and
execution models
Integration of federation
with storage models,
ETL/ELT, and
performance tuning
[13]Lakehouse
concepts/technologies
Transactional tables, open
formats, and hybrid
lake/warehouse design
Comparative role of
lakehouses among
integration strategies;
lineage/governance
integration
[14] Unified analytics vision Fabrics, abstractions, and
open challenges
Operational taxonomy,
workflows, domain case
studies, and incorporation
of 2024–2025
AI instruments
[9]Big data
integration foundations
Schema matching,
mapping, and fusion
Updated synthesis with
recent metadata
catalogues, lakehouses,
and federated engines
The goal of this work is to act as a reference point for data engineers, system architects,
and researchers involved in the integration of data analytics and data management in
terms of connecting theoretical integration models with relevant technical frameworks for
the storage and retrieval of data. To clarify the intended audience, consider the follow-
ing: (i) researchers benefit from a synthesis of taxonomies, comparative evaluations, and
identification of open challenges; (ii) practitioners and system architects gain guidance on
trade-offs, architectural patterns, and integration strategies applicable to real deployments;
and (iii) students obtain an accessible yet rigorous entry point into heterogeneous data
integration and storage, which complements more specialised literature. This multi-tiered
Information 2025,16, 932 5 of 38
orientation ensures that the review delivers concrete utility across scholarly, professional,
and pedagogical contexts.
Finally, to make such a taxonomy directly actionable, this work explicitly connects
its layers to the application domains analysed in Section 7. Concretely, (i) integra-
tion mechanisms (schema matching and entity resolution for semantic enrichment from
Sections 2.2 and 2.3) are mapped to enterprise lakehouse harmonisation of customer and
product identifiers; (ii) metadata, lineage, and ontology scaffolding (Section 5) underpin
scientific workflows for reproducibility and semantic alignment; and (iii) hybrid schema
strategies and federated access (Sections 4.14.3 address governance and inter-agency
interoperability in public-sector pipelines.
1.4. Review Methodology
This work is conceived as a structured survey rather than a systematic review. Ac-
cordingly, it did not aim at exhaustive coverage of all publications but, instead, sought to
capture representative methods, architectures, and trends that define the state of practice in
heterogeneous analytical systems.
The literature selection followed three guiding principles. First, it focused on peer-
reviewed venues where novel data integration and storage strategies are usually published,
including ACM SIGMOD;VLDB;IEEE ICDE; and journals such as TKDE,VLDBJ, and
Information Fusion. Second, the time span of 2015–2025 was targeted to reflect the transition
from early data-lake and NoSQL adoption to recent lakehouse and AI-assisted integration
methods while retaining seminal pre-2015 contributions for context (e.g., schema match-
ing [
21
], data fusion [
22
], and provenance [
23
]). Third, an inclusion criterion was applied
that required explicit linkage between integration mechanisms (schema matching, ER, and
semantics) and storage/query substrates (row/column, NoSQL, lakehouse, and federation),
as well as relevance to metadata, lineage, governance, or performance trade-offs.
Searches were performed in ACM Digital Library, IEEE Xplore, DBLP, and Google
Scholar using combinations of terms such as “data lake”, “lakehouse”, “federated query”,
“schema matching”, “entity resolution”, and “metadata catalog”. Reference snowballing
from key surveys (e.g., [5,6,13,14]) ensured coverage of influential works.
Finally, to structure the review, the selected literature was organised into a cross-layer
taxonomy spanning schema/ER/semantic integration (Section 2), storage models and
architectures (Section 3), integration strategies (Section 4), metadata/lineage/semantics
(Section 5), performance levers (Section 6), and application domains (Section 7). This taxon-
omy provides a balanced view that prioritises clarity, representativeness, and analytical
depth over completeness.
2. Foundations of Data Integration
In this part, schema, semantic, and instance levels are pinpointed as sources of het-
erogeneity, followed by methodologies suggested to overcome these challenges, such as
schema mapping or matching, entity resolution, and other forms of data fusion. These are
the basic terms and central notions applied throughout the research.
2.1. Types of Heterogeneity: Schema, Semantic, Instance Levels
Data integration is the process of combining information from independent sources
into a unified scheme. This stage is related to various aspects of heterogeneity [
5
,
21
].
Schema heterogeneity occurs when there are different datasets that use differing structures
or schemas to portray repeated or similar data. For instance, two databases can refer to
customer details using different terms for attributes (like customer_id and client_no) or
can use differing types (like string and integer with respect to identification-based data)
Information 2025,16, 932 6 of 38
with respect to customer details. The common method of harmonising these structurally
differing variations often forms a considerable part of the first phase of the integration
process [21].
Besides schema mismatches, semantic heterogeneity appears in the case of differing
meanings of the elements of data, regardless of their structure or name being similar. A
classic example of this occurrence is an application using the word salary for gross annual
earnings but another using it for net take-home monthly earnings. Addressing semantic
heterogeneity calls for context understanding, often aided by metadata, external ontologies,
or domain-specific vocabularies [23,24].
Instance-level heterogeneity refers to value differences arising from different sources.
Value differences can occur in the form of format differences (1/12/2025 vs. 2025-12-01),
unit mismatches (kilometres vs. miles), or identifier disagreements (customer IDs varying
across systems). These instances pose barriers to dataset integration and, hence, necessitate
the use of instance-level reconciliation methods, which can be in the form of normalisa-
tions, transformations, or record linkages [
25
,
26
]. The integration of multiple types of
heterogeneity poses significant challenges, particularly in the integration of semi-structured
and even unstructured data with traditional structured databases [
5
]. Next-generation
data integration platforms need to have the ability to deal with the dynamic alignment
of schemas, enabling semantic mediation and recording of the reconciliations of records,
along with maintenance of data integrity and provenance [23,27].
For pedagogical clarity, the resolution steps are also made explicit. Typical schema
name mismatches (e.g., customer_id vs. client_no) are first detected via semi-automated
schema matching (lexical/structural/instance evidence), then compiled into executable
mappings (SQL/XQuery) and validated against gold (ground-truth curated by domain
experts) or ontology-backed correspondences. Instance-level ambiguities (e.g., the date
“01/12/2025”) are eliminated through schema-on-write or hybrid enforcement, which
standardises canonical formats (e.g., ISO 8601 [
28
]) at ingestion and validates them via
metadata-driven checks. Where legacy sources cannot be conformed at write time, a
compensating read-time normaliser is applied, with lineage being recorded.
Table 2summarises the main forms of heterogeneity encountered in data integration,
illustrating schema, semantic, and instance-level challenges with examples.
Table 2. Types of heterogeneity in data integration, with examples and resolution approaches.
Heterogeneity Type Description and Example Resolution Approach
Schema heterogeneity
Differing data schemas
(e.g., customer_id vs.
client_no)
Schema mapping and
translation [21]
Semantic heterogeneity
Same term has different
meanings (e.g.,
salary = gross vs. net)
Ontology mapping [24]
Instance-level
heterogeneity
Different formats/values
(e.g., dates 01/12/2025 vs.
2025-12-01)
Data cleaning and
normalisation [25]
2.2. Schema Matching and Map Generation
Schema matching is the activity of identifying matches between the elements of vari-
ous data schemas. This is a basic activity for data integration where heterogeneous datasets
can be combined into an integrated frame of reference [
21
,
29
]. The algorithms used for
schema matching can broadly be classified into three main categories: name-based methods,
structure-based methods, and instance-based methods (Table 3summarises these). Recent
AI-driven instruments demonstrate that LLM prompting over schema documentation
Information 2025,16, 932 7 of 38
can bootstrap high-quality correspondences in instance-restricted domains and reduce
verification effort, exceeding purely lexical baselines [
30
]. Beyond instance-free discovery,
fine-tuned LLMs have been used to materialise executable mappings against standard infor-
mation models in industrial settings, thereby strengthening semantic interoperability [
31
].
Table 3. Categories of schema matching algorithms, with matching strategies and examples.
Matcher Type Matching Strategy Representative Work
Name-based (lexical) Compares schema element names
(string similarity and synonyms) [32]
Structure-based Exploits schema structure (hierarchies
and parent–child structures) [29]
Instance-based Uses actual data values (overlapping
ranges and distributions) [21]
Lexical matchers rely upon known lexical relationships present at both the attribute
and table levels. String distance measuring techniques, including edit-distance dynamic
programming and set-overlap measures, may be used in combination with methods em-
ploying synonyms from linguistic resources to determine potential matches [32].
Structure-based matchers exploit the inherent relationships between the elements of a
schema, such as type constraints, hierarchies, and foreign keys, to boost the effectiveness
of a match. For example, tables that share similar parent–child relationships or similar
referential patterns could be considered structurally equivalent [29].
Instance-based matchers evaluate the compatibility of real data values stored within
schemas via comparative analysis. Signs of semantic similarity between schema elements
may depend on measures like the intersection of value ranges, shared value distributions,
or statistical interdependencies. Once correspondences are identified, mapping generation
produces transformation rules, which specify how source schemas are converted into
the target schema. These mappings can then be specified in a declarative manner using
languages like the Structured Query Language (SQL), XQuery, or Datalog or, conversely,
can be built visually using ETL software [
29
]. For scenarios where higher sophistication is
a necessity, map generation is performed using semi-automated or interactive software,
which still requires human intervention for validation or fine-tuning of the proposed
recommendations, especially in environments where precision is of high value.
Schema matching solutions using AI methods like embeddings and neural models
have been reported in recent literature [
33
]. Studies indicate an encouraging possibility of
providing high-quality mappings with limited manual intervention to facilitate comprehen-
sive integration efforts accommodating a heterogeneous set of sources, inconsistent naming
conventions, and limited documentation assistance. Foundational mapping systems such
as Clio demonstrate map creation and data exchange at scale [34].
2.3. Entity Resolution and Data Fusion
Though the schema is the same throughout, a major challenge refers to ER, which is
more commonly termed record linkage, duplication identification, or reference reconcilia-
tion. It refers to the correlation and comparison of records for a single real-world entity
across multiple platforms. Data relating to customers gathered from more than a single
Customer Relationship Management system can employ varying identifiers, spellings,
or unique address patterns for a single individual [
25
]. Concurrently, cost-aware LLM
prompting has been systematised. The authors of [
15
] designed batch prompting for ER
that preserves accuracy while substantially reducing API cost. In engineering contexts,
fine-tuned open-source LLMs used as ER classifiers surpass the prior SOTA (and GPT-4
with in-context learning) on benchmarking datasets, indicating practical headroom for
Information 2025,16, 932 8 of 38
LLM-driven ER beyond zero/few-shot architectures [
31
]. Moreover, active in-context
learning improves cross-domain ER without task-specific fine-tuning, mitigating domain
shift [16].
Most ER algorithms are based on a rich framework consisting of similarity functions,
blocking methods, and classification models [
26
]. Similarity functions are applied to assess
particular attributes like names, dates, and locations. Blocking, on the other hand, reduces
the search space by dividing potential matches into subsets that enable fast processing.
Most classification methodologies use supervised learning to infer whether two records
refer to the same entity using previously extracted features. Recent surveys also emphasise
probabilistic and Bayesian approaches for ER at scale [35].
Upon completion of the entity identification process, the next step is data fusion, which
is the combination of similar records into a single, unified representation. This involves the
resolution of inconsistencies inherent in the varying sets of attribute values. Techniques
used to enable this fusion include the following:
Source prioritisation: Prioritisation of a primary source over other sources;
Aggregation or voting: The combination of many values applying statistical methods;
Provenance-aware fusion: the use of metadata in selecting values based on their
timeliness, prevalence, or reliability.
Advanced fusion techniques have the ability to incorporate domain-specific data,
ontologies, or user input to enable conflict resolution. Across many real-world applications,
data fusion needs to preserve lineage and traceability, allowing users to better understand
the processes by which the resultant fused representation was constructed, as well as the
initial sources from which it was derived [
22
,
23
]. Modern ER and data fusion techniques
are increasingly being integrated into data management systems, especially into Master
Data Management (MDM) systems and data lakehouses [
3
]. In analytical applications
that necessitate real-time or near-real-time processing performance, it is critical that such
software tools improve their scalability while making it trouble-free to integrate with
metadata repositories and support workflow orchestration [
5
,
26
]. The overall process of
entity resolution and data fusion is summarised in Figure 1, illustrating the sequence from
similarity computation through blocking and supervised classification to fusion strategies
that produce unified records with preserved lineage.
2.4. Illustrative Resolution Scenarios and Workflows
This subsection provides practical examples that move beyond conceptual statements
and demonstrate concrete, reproducible resolution flows for common heterogeneity issues.
2.4.1. Scenario A: Schema Name Mismatch (customer_id vs. client_no)
1.
Evidence gathering: Apply name-, structure-, and instance-based matching algo-
rithms to recommend a correspondence amongst customer_id (RDBMS A) and
client_no (RDBMS B).
2.
Candidate alignment: Check with domain ontology (e.g., “Customer with one main
identifier) to resolve homonyms and enforce cardinality/uniqueness constraints.
3.
Mapping synthesis: Develop an operational mapping (for instance, an SQL view
or an ELT transformation) that presents client_no as customer_id while ensuring
type harmonisation.
4. Validation and lineage: Validate the precision and recall with a duly chosen sample.
Keep the mapping, quality measures, and provenance in the metadata catalogue and
implement it as a reusable module in the pipeline.
Information 2025,16, 932 9 of 38
Figure 1. Entity resolution and data fusion: similarity functions; blocking; supervised classification us-
ing previously extracted features; clustering; and fusion via source prioritisation, aggregation/voting,
and provenance-aware selection, producing unified records that preserve lineage/traceability.
Figure 2visualises the matchermappingvalidation workflow.
2.4.2. Scenario B: Instance-Level Ambiguity (Date “01/12/2025”)
1.
Policy: Adopt a canonical date format (ISO 8601, YYYY-MM-DD) as a data contract
for analytical layers.
2.
Enforcement (schema-on-write where feasible): At ingestion, parse source dates
using locale-aware parsers with explicit day/month disambiguation. Reject or quar-
antine unparsable records. Normalise to ISO.
3.
Hybrid fallback (schema-on-read): For immutable/legacy sources, apply a determin-
istic normaliser at query time with source-specific locale rules. Mark records with
confidence and retain the raw value.
4.
Validation and observability: Define expectations (e.g., no ambiguous DD/MM vs.
MM/DD overlaps) and monitor drift. Expose freshness and conformance metrics in
the catalogue/lineage system.
Information 2025,16, 932 10 of 38
Figure 2. Schema-matching
mapping
validation workflow for resolution of customer_id vs.
client_no.
Table 4summarises the policy, and Figure 3shows the control flow.
Table 4. Ambiguous date inputs and canonicalisation policy for analytical layers.
Ambiguous Input Enforced Rule
(Contract) Canonical Form Validation
Mechanism
01/12/2025
(unknown locale)
YYYY-MM-DD
(ISO 8601) 2025-12-01 or reject
Expectation checks,
quarantine
on ambiguity
12/01/2025
(unknown locale) YYYY-MM-DD 2025-01-12 or reject Locale profile,
confidence tagging
2025/12/01 YYYY-MM-DD 2025-12-01 Strict pattern
enforcement
Information 2025,16, 932 11 of 38
Figure 3. Date disambiguation pipeline with schema-on-write preference and hybrid read-time
normalisation under lineage/quality monitoring.
2.4.3. Scenario C: Semantic Conflict (Salary = Gross vs. Net)
1.
Vocabulary alignment: Bind source attributes to ontology concepts (e.g., GrossAnnu-
alSalary and NetMonthlySalary).
2.
Normalisation: Define transformation rules (unit/calendar/period adjustments) with
explicit semantics. Compute derived measures only where definable.
3.
Query mediation: Expose a semantically consistent view (virtualised or materialised)
that prevents cross-meaning joins. Annotate with provenance and assumptions in
the catalogue.
2.5. Limitations and Possible Biases in Reviewed Methods
Despite substantial progress, each strand of data integration exhibits inherent risks
and biases, which may constrain reproducibility.
Schema matching and mapping. Lexical/structural matchers inherit biases from
naming conventions and language resources. Instance-based approaches are sensitive
to distributional drift, identifier inconsistencies, and noisy or limited gold standards.
Consequently, reported accuracy may overstate generalisability, and human validation
remains indispensable, introducing subjectivity [21,33].
Entity resolution and data fusion. Blocking and filtering reduce comparison space
but can depress recall. Severe class imbalance biases classifiers toward majority decisions.
Fusion rules that ignore source reliability, provenance, and timeliness may propagate
systematic errors [
25
,
26
,
35
]. These factors become more acute at scale and under streaming
or near-real-time constraints.
Ontology-based and semantic integration. Alignment suffers from coverage gaps,
ambiguous correspondences, and evolving vocabularies. Fully automatic matching remains
elusive in heterogeneous domains, often requiring expert intervention that is costly and
potentially subjective. Alignment errors can propagate downstream and bias integrated
views [17,18,24].
Information 2025,16, 932 12 of 38
System-level evaluation in federated/lakehouse settings. Heterogeneous connectors
and partial statistics create optimiser blind spots, yielding mis-costed pushdown and
inefficient cross-source joins. Materialisation and caches risk staleness if not incrementally
maintained. Thus, reported performance may reflect system-specific assumptions rather
than general capability [6,7].
3. Storage Architectures for Analytical Workloads
This section discusses how storage choices shape analytics, i.e., Row vs. column
stores for OLTP/OLAP trade-offs, NoSQL models (key value, document, and graph)
for semi-structured needs, and cloud-native columnar formats and lakehouse layers for
scale and reliability. It sets the stage for the mapping of workload patterns to the right
storage medium.
3.1. Row-Oriented and Column-Oriented Stores
The choice of storage structure strongly affects the scalability and efficiency of ana-
lytical systems. Traditional row-oriented relational databases like PostgreSQL, MySQL,
and Oracle store data records sequentially, with the fields for each record located near each
other on the storage device. This structure is a very efficient storage mechanism for the
typical Online Transaction Processing (OLTP) activities like reading or updating complete
records. Classic examples are updating a customer’s details and placing a new order [
4
,
36
].
Analytical queries, also known as Online Analytical Processing (OLAP), typically
require the retrieval of many columns from large rows (e.g., summing up sales amounts
over many geographic regions or filtering over integer attributes). These types of queries
suffer from poor performance in row-oriented databases because of the large number
of input/output operations that they require, as well as poor caching mechanisms. A
suggested solution to this weakness has been the advent of column-oriented databases.
Excellent examples of this category of databases are MonetDB, Vertica, and ClickHouse,
which isolate individual columns into separate contiguous blocks, thereby allowing for fast
column scanning and increased compression ratios and encouraging better CPU vectorisa-
tion [4,37,38].
Column storage architectures use a series of encoding methods like run-length encod-
ing (RLE), dictionary encoding, and bit-packing to store data efficiently while supporting
fast filtering and aggregation. These storage frameworks for data also use advanced index-
ing methods like zone maps, along with bitmap indexes to boost analytical query execution
performance [
39
41
]. Row-oriented databases were once predominant within traditional
real-time transaction environments but no longer comprise the single type of database
being used today. Column-store databases like Google BigQuery, Amazon Redshift, and
Snowflake became popular within analytical environments, as well as cloud data ware-
housing scenarios. Figure 4contrasts row stores and column stores and situates hybrid
systems that support mixed workloads.
Modern systems like Apache HBase, along with SAP HANA, use dual-storage engine
or hybrid settings, which provide storage for the data in either a column-oriented or row-
oriented scheme and meet the varied needs of heterogeneous workloads accordingly [
42
,
43
].
This blending of storage models indicates the progress in optimally matched storage
frameworks, keeping in view the major access patterns of analytical workloads.
Information 2025,16, 932 13 of 38
Figure 4. Storage models and workload fit.
3.2. Key-Value, Document, and Graph Databases
To manage semi-structured or non-tabular data efficiently, several NoSQL storage
models have been developed, each with distinct data models and query paradigms. Most
NoSQL storage systems can be classified into three main categories (Table 5): key-value
stores, document stores, and graph databases [4446].
Table 5. NoSQL database categories, their key characteristics, and example systems.
NoSQL Category Characteristics Examples
Key-value store
Data stored as key/value
pairs, optimised
for lookups
Amazon DynamoDB
and Redis
Document store
Semi-structured JSON-like
documents and
flexible schema
MongoDB and Couchbase
Graph database Data as nodes and edges;
suited for relationships
Neo4j and
Amazon Neptune
Key-value stores like Redis, Amazon DynamoDB, and Riak structure data in pairs,
where a unique key acts as an identifier attached to a corresponding value that can be
returned in an object or an unstructured blob form. These key-value stores’ architectures
have important performance attributes defined mainly by high reading capability combined
with low latency, along with distributed scalability, which greatly simplifies application
design in areas like session management, caching storage, and keeping huge records [
44
,
46
].
Information 2025,16, 932 14 of 38
Although their primitive querying capability severely constrains their use in analytical
scenarios unless complemented by secondary indexing or specialised processing methods,
document store databases like MongoDB, Couchbase, and Amazon DocumentDB structure
their data into hierarchical forms similar to JSON-like forms. A key feature of document
stores is sa flexible schema definition combined with nested structures and pipeline ag-
gregation, which greatly simplifies the handling of relatively complex queries involving
semi-structured data [
47
,
48
]. Document store implementations are often used in scenarios
where structuring of the data is prone to constant changes or heterogeneous types of records
like in application areas related to content management, product catalogues, or telemetry
data storage within Internet of Things scenarios.
Graph databases such as Neo4j, Amazon Neptune, and JanusGraph are designed to
store entities (nodes) and their relationships (edges), along with attributes. These databases
excel at handling highly interconnected datasets and support graph pattern matching via
languages such as Cypher, Gremlin, or the SPARQL Protocol and Resource Description
Framework (RDF) Query Language (SPARQL) [
49
,
50
]. In addition, they provide specialised
aspects geared toward supporting a wide range of applications, ranging from social net-
work data analysis to recommendation systems, anti-fraud filters, and the querying of
bioinformatics data. In spite of the attractive features of schema flexibility combined with
scalability benefits inherent in these NoSQL databases, issues evolve concerning the guaran-
tee of strong consistency requirements when compared to the more complex functionalities
of traditional relational database management systems (RDBMSs). As such, in the analytical
settings considered in this research, these systems operate as adjunct resources or data
sources in computational pipelines, adding value by enabling querying and aggregation in
subsequent stages of computation.
To complement the categorical overview, Table 6benchmarks the three NoSQL families
on consistency/availability, scale-out, and query-path performance (including scans vs.
traversals), consolidating peer-reviewed evidence.
Table 6. Benchmarking synthesis across NoSQL families used for graphs: consistency/availability
and scale-out mechanics, plus workload-level performance (scans vs. traversals) [44,4850].
Metric Key-Value (e.g., Dynamo) Document (e.g., Mon-
goDB/CouchDB/Couchbase)
Graph DBs (LPG/RDF)
Consistency vs. availability
“Always writable” design,
quorum-tuned R/W,
eventual consistency with
vector clocks, and hinted
handoff for node outages
Stronger per node
consistency, cluster settings
vary by product, and
designed for CRUD with
JSON/BSON
Transaction models vary;
many support ACID for
OLTP, and global analytics
are typically read-only
Partitioning/scale-out
Consistent hashing, virtual
nodes, sloppy quorum,
and seamless node
add/remove function
Sharding and replica sets
are common, as well as per
collection partitioning
Sharding/replication
depend on the engine;
many native stores
optimize locality
for traversals
Write path
Fast, partition-local writes;
durability ensured via a
configurable W
Bulk inserts and high-rate
CRUD; durability ensured
via journaling/replica sync
OLTP writes supported
(engine-dependent), and
heavy analytics often
separated from the
write path
Information 2025,16, 932 15 of 38
Table 6. Cont.
Metric Key-Value (e.g., Dynamo) Document (e.g., Mon-
goDB/CouchDB/Couchbase) Graph DBs (LPG/RDF)
Scan/range queries
Limited (key-oriented);
range needs
secondary/indexed paths
YCSB: MongoDB has the best
overall runtime, while a
scan-heavy workload makes
CouchDB faster and
CouchDB scales best
with threads
Scans expressed via
label/property predicates; the
cost depends on index design;
not a primary strength
vs. documents
Traversals/locality
Multi-hop traversals require
app-level joins or
pre-materialization
Multi-hop joins across
collections are costly and not
traversal-centric
Native adjacency (AL and
direct pointers), and traversal
cost grows with the number
of visited subgraphs, not
graph size; suited to
path/pattern queries
Indexing
Primary key and optional
secondary indexes for
ranges/filters
Rich secondary and
compound indexes; text/geo
often available
Structural (neighbourhood) +
data indexes;
languages expose
pattern/path operators
Query languages KV APIs and
app-side composition
Aggregation pipelines and
SQL-like DSLs
SPARQL (RDF),
Cypher/Gremlin (LPG),
and mature
pattern/path semantics
3.3. Cloud-Native Formats and Distributed File Systems (e.g., Parquet and Delta Lake)
Recent developments in cloud-native infrastructure and data lakes have accelerated
the adoption of open, columnar file formats with distributed storage systems. Apache
Parquet, Apache Optimised Row Columnar (ORC), and Avro enable fast storage solutions
with efficient query operations for large amounts of structured and semi-structured data in
cloud-centric and distributed environments. Beyond disk-oriented formats, evaluations
also consider in-memory columnar exchange frameworks such as Apache Arrow and its
Feather variant. While Arrow/Feather are not designed for long-term storage, they provide
extremely fast (de)serialisation and zero-copy interoperability across engines, making them
an important baseline in benchmarking studies [1,51].
Parquet is a columnar storage format specially designed to support nested data struc-
tures and is heavily optimised for use cases that arise from high-frequency read workloads.
The layout of this format exhibits excellent compatibility with multiple big data frameworks,
such as Apache Spark, Hive, Dremio, and Amazon Web Services (AWS) Athena. The fast
encoding protocol, with strong metadata, allows query filtering and planning to be more
efficient, essentially reducing the need for full scans. Likewise, ORC is designed to enhance
the performance of reading and writing and includes advanced compression mechanisms,
predicate pushdown, and indexing features [
1
,
52
]. Table 7consolidates results on Parquet,
ORC, and Arrow/Feather, highlighting compression efficiency, query performance, nested
data handling, and workload-specific trade-offs as reported in peer-reviewed studies.
The development of lakehouse architectures is due to the natural limitations inherent
in raw data lakes—namely, their lack of sufficient data management and lack of Atomicity,
Consistency, Isolation, and Durability (ACID) compliance. Lakehouse architectures take
advantage of cloud-native infrastructure in providing support for transactions, versioning,
incremental schema evolution, and temporal capabilities [3,10].
Information 2025,16, 932 16 of 38
Table 7. Benchmarking results for open columnar formats: reported performance characteristics of
Parquet, ORC, and Arrow/Feather [1,2,41,52].
Metric Parquet ORC Arrow/Feather
Compression ratio
Strong overall
compression, especially
with dictionary encoding
Often higher compression
on structured/numeric
workloads
Not a storage format;
minimal compression;
focuses on speed
Scan/decoding speed
Faster end-to-end
decoding in
mixed workloads
Slightly slower, but
predicate evaluation
is stronger
Fastest (de)serialisation
throughput; zero-copy
in-memory
Predicate pushdown/skipping Effective but limited by
column statistics
Fine-grained zone maps
yield strong selective
query performance
Not applicable
(in-memory only)
Nested data handling R/D-level encoding and
efficient leaf-only access
Presence/length streams;
overhead increases
with depth
Dependent on
producer/consumer; no
disk encoding
Workload trade-offs
Performs best on wide
tables and
vectorised execution
Strong on narrow/deep
workloads with
high selectivity
Best as interchange for
ML/analytics pipelines
In addition, lakehouse architectures provide a single platform that combines the scala-
bility features usually found with data lakes and the reliability and ease of management
that define data warehouses. Current storage infrastructure enables the use of batch and
streaming data, allows for unified read–write operations across multiple data nodes, and
provides inherent compatibility with SQL engines, along with Spark [
3
,
10
]. Additionally,
these storage platforms effectively address latent issues related to large-scale analytical
systems, like keeping data fresh, imposing server-level schema constraints, ensuring con-
sistency for update operations, and more. Usage of these storage platforms is rapidly
gaining momentum for the building out of analytical pipelines for various industries, like
finance, healthcare, advertising, and scientific exploration. Notably, lakehouse platforms,
along with cloud-native storage platforms, provide storage-agnostic features that boost
interoperability between a heterogeneous set of storage backends like Amazon S3, Google
Cloud Storage, Azure Data Lake, and others. This storage-agnostic capability enables multi-
cloud adoption, with streamlined integration across multiple environments. Inclusion of
federated query engines like Presto, Trino, and Starburst in these platforms provides a
storage-oriented foundation design for modern scale-out analytics systems [7,5355].
4. Bridging Integration and Storage
The interplay between data mobility and accessibility can be seen by analysing the
interplay between multiple mechanisms, such as ETL and ELT pipelines, data virtuali-
sation, and federated SQL engines and by the compromises involved with schema-on-
read and schema-on-write. The overall effects of choosing these tools are the minimi-
sation of redundancy, an increase in agility, and holding performance and governance
levels constant.
4.1. ETL/ELT Pipelines and Data Virtualisation
In heterogeneous environments, integration depends on efficiently moving and trans-
forming data across diverse storage backends rather than enforcing a single schema upfront.
Extract–Transform–Load (ETL) and Extract–Load–Transform (ELT) are the main methods
used to enable this type of integration [5,56].
Information 2025,16, 932 17 of 38
In conventional ETL, data is extracted, then cleaned and transformed to conform to
a normalised target schema. Eventually, the data is stored in a main repository, which is
most often a relational data warehouse [
56
]. These practices are common in enterprise
business intelligence scenarios, where data quality maintenance, along with conformance
with schema specifications, is an essential requirement prior to the data ingestion step.
Commercial ETL offerings like Talend, Informatica, and Pentaho have powerful platforms
for the design of transformation workflows with rule-based data cleansing, deduplication,
and aggregation capabilities [56,57].
Nevertheless, the arrival of storage solutions featuring schema-on-read abilities, along
with cloud-native data lakes, has made it possible for the ELT methodology to be used
more extensively [
3
,
5
]. In such a context, data that is unstructured or slightly structured is
first stored in a horizontally scalable data repository like Amazon S3, Hadoop Distributed
File System (HDFS), or Azure Data Lake, then processed at the query execution point or
any subsequent point in the processing pipeline. This approach increases flexibility, along
with data ingestion rates, more significantly, especially in scenarios involving stream data
management, where more regular changes take place to the schema, or where exploratory
analysis is beneficial. Both Extract–Transform–Load (ETL) and Extract–Load–Transform
(ELT) practices require careful management of metadata, schema versioning, and data
lineage tracking in order for transformations to be repeatable, auditable, and appropriate
for auditing practices.
Declarative transformation pipelines like Apache NiFi, Airbyte, dbt (data build tool),
and Dagster have features for automatic testing, distribution, and scheduling. Their integra-
tion with cloud storage infrastructure, version control systems like Git, and distributable
computation environments increases their contribution within modern data integration
solutions [
57
]. Data virtualisation components represent optional or complementary means
of addressing traditional ETL/ELT practices using a logical abstraction layer for the in-
tegration heterogeneous data sources without pre-transfer or pre-transformation phases.
Denodo, Red Hat JBoss Data Virtualisation, and Dremio represent solutions using abstrac-
tion layers to provide unified views for heterogeneous back-end systems through federated
query engines. These solutions integrate relational databases, NoSQL databases, REST APIs,
and file storage within a semantic modelling context, thereby allowing end users to query
the combined data using standard SQL queries, with the platform automatically managing
data access tasks in the background [
6
,
58
]. To address heterogeneous data pipelines, one
may use ETL for warehouses, ELT on cloud storage, or logical data virtualisation. Table 8
compares these approaches and tools. This allows for the minimisation of redundancy,
boosts query response efficiency in today’s data environments, and allows for the integra-
tion of multiple analytical models. On the downside, it can cause possible performance
loss because of latency issues, data inconsistencies, or the unavailability of sources of data
in the course of carrying out the indexing process. For these reasons, hybrid infrastructure
often utilises virtualisation in a bid for on-demand querying functionality, complemented
by ETL/ELT practices for optimal structured data management.
Table 8. Comparison of data integration strategies: ETL, ELT, and data virtualisation.
Approach Process Typical Tools
ETL (Extract–Transform–Load) Extract Transform Load into
data warehouse Talend, Informatica, and Pentaho
ELT (Extract–Load–Transform) Extract Load (raw) Transform
later in data lake Hadoop HDFS, Spark, and dbt
Data virtualisation Virtual layer integrates sources in
real time Denodo, Dremio, and Presto/Trino
Information 2025,16, 932 18 of 38
Decision Guidance: ETL vs. ELT vs. Virtualisation
ETL is preferable when conformance and upstream data quality must precede ana-
lytical use (e.g., governed marts and slowly evolving schemas) [
56
]. ELT is effective for
high-volume, schema-evolving feeds landing in open columnar storage. Enforcement is
deferred to query time or periodic consolidation in the lakehouse, leveraging vectorised
execution engines [
51
,
59
]. Virtualisation/federation is appropriate where duplication is
undesirable or sources must remain authoritative. Performance hinges on connector-aware
pushdown and cost-guided planning [
55
,
58
]. In practice, hybrids persist in frequently used
aggregates while federating the long tail, with promotion from late-bound to enforced
schemas once contracts stabilise (see Table 8and Figure 5).
Figure 5. Federated SQL engine architecture and pushdown.
4.2. Federated Querying and Unified Query Engines
Federated query systems are central to storage integration. Federated systems allow
the end user to run analytical queries on different heterogeneous data sources through a
single unified interface. As opposed to traditional integration methods based on materi-
alised data warehouses, federated systems operate under a query-time integration model,
which allows for synchronous and dynamic integration of multiple sources of data [6,55].
Prominent federated query engines include Presto, Trino (formerly known as Presto),
Apache Drill, and Starburst Enterprise. These engines provide a unified SQL-focused
querying interface for users by interacting with multiple relational databases, NoSQL
databases, cloud object storage services, and distributed file systems using connectors [
7
,
55
].
The automatically generated queries are able to carry out joins, groupings, and filtering
across several systems, like MySQL, MongoDB, Hive, and S3, without the need for data
relocation or pre-loading.
Such engines are defined by high-tech characteristics such as the following:
Cost-centric query optimisation over various backend systems;
Predicate pushdown, allowing filtering operations to be executed near their corre-
sponding data sources;
Parallel execution plans, enabling distributed scalability;
Information 2025,16, 932 19 of 38
Connector extensibility for custom data-source support.
Figure 5shows a federated SQL engine orchestrating connectors, predicate pushdown,
and parallel execution across heterogeneous sources.
To provide a practical comparison, Table 9benchmarks representative federated
engines across optimisation, execution, and scalability features.
In independent YCSB experiments, MongoDB achieves the best overall runtime across
diverse CRUD workloads, while scan-heavy workloads favour CouchDB, which also
exhibits the strongest thread scale-up among the three evaluated document stores. In
contrast, native graph engines optimise adjacency locality (e.g., adjacency lists and, in some
designs, direct pointers), so traversal cost is governed by the visited subgraph rather than
the total graph size.
The capabilities of the systems have been documented and evolved in production
deployments [
7
]. For instance, a financial analyst can leverage Trino to combine reference
data gleaned from a PostgreSQL table with transactional data stored in Parquet format
within an Amazon S3 bucket, along with metadata gained from a MongoDB collection,
without resorting to reorganising or duplicating data [
55
,
58
]. Federated querying allows for
a decrease in data transfer costs. At the same time, it brings about latency issues, along with
concerns around the availability of data sources and maintaining consistency guarantees.
In addition, performance improvements are hindered by performance-related restrictions
specific to certain data sources, which are enhanced by a lack of in-depth indexing or
statistics data. To counter these issues, different approaches are being utilised by systems,
for example, the use of caching layers, the use of materialised views, or the reuse of query
results. Orchestration platforms (for example, Airflow) and catalogue services (for example,
AWS Glue and Hive Metastore) are incorporated with federated engines to manage schema
definitions and enhance execution effectiveness. These components form the backbone
of lakehouses, allowing for analysis at continuous real-time or near-real-time speeds for
structured and semi-structured data in formats such as Delta Lake, Apache Iceberg, or
ORC [3,10].
4.3. Schema-on-Read vs. Schema-on-Write Trade-Offs
A key design trade-off is choosing between schema-on-write and schema-on-read,
which reflect different philosophies for schema management, validation, and integration.
In schema-on-write, data is required to conform to a predefined schema before be-
ing ingested into the system. This approach, prevalent in traditional RDBMS and data
warehouses, enforces strong consistency, data quality, and semantic clarity. It facilitates
indexing, integrity checks, and query optimisation. However, it also introduces rigidity,
requiring costly transformation steps and limiting the ability to incorporate evolving or
irregular data sources quickly [56].
On the other hand, the schema-on-read approach postpones the creation of the schema
until the execution of a query takes place. This approach is commonly used, together with
data lakes and semi-structured storage implementations, using data types like JSON or
Avro, thereby enabling more flexibility plus better data ingestion mechanisms [
5
]. It allows
for the storage of uncompressed or poorly structured data while deferring the application
of the schema, thereby enabling analyses that require little preprocessing to be performed
by the analyst and with the researcher. This method is particularly handy in scenarios like
experimental studies and ad hoc analysis, together with rapid prototyping applications.
Information 2025,16, 932 20 of 38
Table 9. Benchmarking of federated SQL engines (Presto, Trino, Drill, and Starburst) across key
performance-related features [6,7,12,55,60].
Metric Presto Trino Drill Starburst
Connector support
Broad set of connectors
and production
deployments at scale
Broad OSS connector
base; fork of Presto
with added features
Schema-free connectors
for JSON, NoSQL,
and files
Commercial
distribution; adds
enterprise connectors
and governance
Predicate pushdown
Supported across
RDBMS, Hive, and
columnar formats
Supported across most
connectors
Predicate pushdown
for JSON and
columnar data
Extended pushdown
support with enterprise
optimisations
Cost-based
optimisation
CBO with
table/column statistics
and an adaptive
join order
Cost-aware planning
with statistics
integration
Primarily rule-based;
limited CBO
Enterprise-grade CBO
with workload-
aware tuning
Execution model Massively parallel and
pull-based execution
Similar to Presto;
optimised scheduling
Vectorised operator
pipeline
Enhanced parallelism
and workload
management
Caching and
materialisation
Result caching,
materialised views, and
SSD spill options
Spill to disk; MV
support in OSS
is limited
Reader-level pruning
and limited
caching features
Adds advanced
caching and
MV rewriting
Fault tolerance
Recoverable
grouped execution;
Presto-on-Spark variant
Retry-based external
FT extensions
No built-in
query recovery
Enterprise-level FT and
workload isolation
Production use
evidence
Exabyte-scale at Meta;
interactive +
ETL workloads
Large-scale OSS and
enterprise deployments
Interactive ad hoc
analysis over
schema-free data
Widely adopted in
regulated industries
However, schema-on-read implementation is very challenging for the following reasons:
Variability in schema interpretation across queries or individuals;
In the phase of inquiry, complex transformations gain prominence;
The issues involved in handling metadata in large datasets marked by dynamically
changing schemas, requiring scrupulous analysis;
Query performance becomes inefficient as a result of overly extended binding times.
Proposed solutions involve hybrid approaches tackling the fundamental trade-offs
inherent in data management systems, as well as improving the effectiveness of delta
computation. Hybrid lakehouse architecture implementations combine the schema adap-
tation of schema-on-read features with properties commonly seen in data warehousing
environments, including schema enforcement, version management, and governance—the
functions largely evident in traditional data warehouses [
3
,
10
,
59
]. Solutions involving
the use of Delta Lake and Apache Iceberg provide selective operations on schemas, data
types, and partition pruning, in addition to supporting the manipulation of raw and semi-
structured data. One key improvement in this area is the use of metadata-driven schemas,
wherein data conforms to automatically deduced schemas constructed with the help of
tools such as Great Expectations, Deequ, or DataHub.
These schemas have the characteristic of evolving over time and can be used to
create templates for future verification or transformation processes, thereby eliminating
enforcement during the process of taking in data. In practice, instance-level ambiguities
such as locale-dependent dates (e.g., “01/12/2025”) are handled by preferring schema-on-
write conformance (canonical ISO at ingestion) and, where infeasible, by a hybrid read-time
normaliser with explicit lineage and quality expectations.
Information 2025,16, 932 21 of 38
In summary, the decision between schema-on-write and schema-on-read need not be
seen as a straightforward binary. Rather, it is guided by a number of considerations, such
as the requirements of the application in question, the nature of the workload, the factors
that contribute to data variability, and governance requirements. Systems that leverage a
hybrid strategy and can transition smoothly between the two approaches have been seen
as having flexibility and continued effectiveness over time.
5. Metadata, Lineage, and Semantic Interoperability
Metadata is the glue that holds together discoverability, governance, and trust, in-
cluding catalogues and lineage tools, while ontologies and semantic enrichment define
meaning across sources. Altogether, they form a basis for reproducibility, auditability, and
greater integration at scale.
5.1. Metadata Management Frameworks (e.g., Apache Atlas and DataHub)
The growth of large and complex data ecosystems has brought with it increased
recognition of the value of metadata—sets of data describing other sets of data—as a key
factor for integration, discoverability, governance, and access control. Several metadata
catalogue systems (Table 10) provide scalable repositories for technical, operational, and
business metadata. In many analytical applications, keeping the metadata in a coherent
and current condition is important because this can help mitigate problems related to
discrepancies in a schema, transformations, and semantic homogeneity [5,61,62].
Modern metadata standards outline a unified structure, which enables the integration
of technical metadata within differing classifications, encompassing schemas, types, format,
and partitions, for example, operational metadata related to things like recency, lineage,
and access history, together with business metadata related to descriptions, ownership,
and policies for use. Numerous reliable open-source offerings like Apache Atlas, LinkedIn
DataHub, and Amundsen offer horizontally scaled architectures that prove highly efficient
in dealing with the intake, storage, and querying of metadata with diverse origins [5,63].
The platforms also support the integration of data-processing engines like Apache
Spark and Airflow, with storage options like Hive, S3, and Delta Lake, along with gover-
nance features like Role-Based Access Control (RBAC) and audit logging. Additionally,
these platforms support Application Programming Interfaces (APIs) and user interfaces
built specifically for data engineers, analysts, and data stewards. Some of the most impor-
tant features that significantly impact discoverability and interpretability of data include
tagging, searchability, lineage visualisations, and automatic classification of data [
62
,
63
].
A centralised metadata repository greatly reduces redundancy, encourages the reuse of
schemas, and enables governance for schema development. Additionally, the contribution
of metadata is gaining wider prominence in data quality evaluation, audit activities related
to regulatory compliance checking, and impact assessment in highly regulated industries
like healthcare and financial services. As various datasets are being brought together
with increasingly mature methods of integration, metadata systems become even more
important for consistency and auditability across the data ecosystem [5].
Table 10. Open-source metadata catalogue frameworks and their integration capabilities.
Platform
Architecture & Integration
Key Features
Apache Atlas
Metadata repository, native
to Hadoop
REST APIs, lineage, audit
logging, and RBAC
LinkedIn DataHub Distributed service;
platform-agnostic
Metadata ingestion, search
UI, and versioning
Lyft Amundsen Graph-backed discovery Lineage graphs, discovery
UI, and access control
Information 2025,16, 932 22 of 38
5.2. Ontology-Based Integration and Semantic Enrichment
Together with structural metadata, semantic metadata clarifies why data elements
have meaning and are related to each other, thereby enabling the integration of disparate
and distributed systems. Ontologies formally describe domain concepts and relations,
providing a foundation for semantic interoperability.
Ontologies become key tools within data integration spaces, enabling schema mapping,
entity reference disambiguation, and context alignment. For example, a single entity can
be represented with different terminologies within two datasets, for instance, “client” or
“customer”, or appear through aggregated data at differing hierarchical levels, for instance,
“monthly income” or “annual earnings”. Schema elements can be related to ontology classes
using RDF or OWL. These links allow systems to infer relationships such as equivalence,
subsumption, or aggregation.
The use of an ontological basis is widely applied in many scientific domains, including
bioinformatics, as the Gene Ontology; environmental science, as embodied by the Semantic
Web for Earth and Environmental Terminology (SWEET) ontologies; and cultural heritage,
as expressed in the International Committee for Documentation Conceptual Reference
Model (CIDOC CRM) [
64
66
]. In business, application domain-specific terminologies, e.g.,
the Financial Industry Business Ontology (FIBO) for financial services and Health Level Seven
(HL7) for healthcare, are crafted to enhance data standardisation and interoperability [
67
,
68
].
The use of entity-linking techniques, combined with text annotation procedures and
knowledge graph improvement, enables dynamic semantic enrichment. Dynamic semantic
enrichment is the process by which information is contextualised with external knowledge
bases like Wikidata, DBpedia, and domain-specific taxonomies [
69
,
70
]. Dynamic semantic
enrichment increases the data’s discoverability, provides query support through natural
language user interfaces, and enables reasoning across diverse and heterogeneous datasets.
In spite of their promise, ontology-based solutions face a variety of challenges, such as
problems related to ontological alignment, high computational costs, and reluctance to
apply them in settings with insufficient expertise.
Recent ontology alignment frameworks pair retrieval with LLM prompting to raise
unsupervised matching quality while curbing calls to the model. MILA reports the top
F-measure on multiple OAEI tasks and a lower runtime through a prioritised search
pipeline [
17
]. Complementarily, LLMs4OM explores zero-shot and representation-aware
prompting for ontology matching across diverse ontology views, illustrating how founda-
tion models can support semantic interoperability [18].
Advancement in tech developments, represented by lightweight ontologies, schema
annotation, and semantic data stores, can help increase the scalability and universality of
semantic integration.
5.3. Lineage Tracking and Schema Evolution Handling
Data lineage traces the full life cycle of a dataset, from source to processing to final
use. Within integrated environments, lineage analysis explains the data origin, validates
its integrity, and detects anomalies in data pipelines. It also supports compliance with
regulatory requirements, like the “right to be forgotten” imposed by the General Data
Protection Regulation (GDPR); improves reproducibility; and supports cooperation [
23
,
62
].
It can be discussed with reference to various aspects:
Table-level lineage describes the relationship between datasets, such as how table A is
created from tables B and C.
Column lineage provides precise mappings that show relationships between the
transformation processes in individual columns.
Information 2025,16, 932 23 of 38
At the code level, family constructs create relationships between conversions related
to specified activities or scripts, blending execution environments with parameter
specifications for settings.
Tools like Marquez, OpenLineage, and Apache Atlas offer fine-grained APIs with
query frameworks for the express purpose of identifying lineage metadata for data transfer
moves. These tools assist with the integration of pipeline orchestrators like Dagster and
Airflow, which allows for a better view of data transfer across platforms [
5
,
63
]. Figure 6
provides an overview of lineage tracking and schema evolution, connecting dataset lifecycle
stages with lineage levels, supporting tools, and schema evolution mechanisms.
One of the key issues at this level is regarding schema evolution, which essentially
includes changes to the data structure over time, such as adding new columns, changing
data types, or updating the conventions used for column naming. In the conventional static
ETL framework, the changes often create processing issues or give rise to inconsistency
problems. To support schema evolution, it is critical that the systems do the following:
Note and explain changes to the schema;
Validate compatibility across pipeline stages;
Enable both backward- and forward-compatible reading.
Historical changes have been well documented due to advances in storage structures
that are column-major in data organisation, such as Parquet, and are complemented by
transactional systems like Apache Iceberg and Delta Lake. These underlying prototype tech-
nologies provide necessary flexibility in multiple environments while providing uniform
reliability and reproducibility throughout the process [2,3,10].
At a deeper level, the elements of metadata, lineage, and semantic context become
foundational cornerstones for the intelligent integration platform. These aspects outline
the essential guidelines that empower systems to explore relationships between the data,
correct inaccuracies, conduct automatic transformations, and improve the frameworks for
governance [5,71].
Figure 6. Lineage tracking and schema evolution: dataset lifecycle, lineage levels (table, column, and
code), tools (Apache Atlas, Marquez/OpenLineage, Dagster, and Airflow), and schema evolution
with Parquet, Delta Lake, and Apache Iceberg, ensuring compatibility and reproducibility.
Information 2025,16, 932 24 of 38
6. Performance, Scalability, and Consistency Challenges
The problems involved in operating over varying backends can be broken down
for analysis. These include cost-based query optimisation strategies, materialisation and
caching, and the real-world implications of freshness and eventual consistency. This
section provides insight into the mechanisms and trade-offs that govern performance in
realistic situations.
6.1. Query Optimisation Across Heterogeneous Sources
Federated query execution poses complex optimisation challenges due to the diversity
of underlying sources. Compared with traditional monolithic databases, which provide
the optimiser with full control over schemas, statistical information, and execution plans,
federated and multi-model databases face challenges involving stale metadata, diverse
query capability, and non-uniform performance characteristics across a heterogeneous set
of sources [7,14].
One of the key challenges is the derivation of cross-source query plans that efficiently
minimise data transfer while leveraging local computational resources. For example, a
raw join between a local relational database and a remote NoSQL database can create a
lot of network traffic if intermediate results are not quickly filtered. Predicate pushdown,
which involves the allocation of filters and projections to respective underlying systems,
is a common optimisation strategy that reduces the data to be transferred. However,
its effectiveness is dependent on the availability of query operator matching and the
indexability of each data source [7,60].
Cost-based optimisation is also complicated by the absence of reliable and consistent
statistical data. Selectivity or cardinality estimation becomes difficult in distributed envi-
ronments, particularly when heterogeneous sources have varying capacities for sampling
or provide metadata in different formats. Some systems use heuristic rules or runtime feed-
back to dynamically adjust their execution plans. Recent work explores learned estimation
and optimisation—e.g., learned cardinality estimators and reinforcement learning-based
join ordering—to adapt plans under uncertainty [7274].
Engines like Presto/Trino and Apache Drill employ federated optimisers that ac-
count for connector-specific capabilities and support adaptive planning but still suffer
from slowdowns from remote-source latency, schema mismatches, and transformation
overheads [
7
,
60
]. Most recent work has explored machine learning optimisers whose
performance models are learned and can steer join orders and execution paths [74].
6.2. Materialisation and Caching Strategies
To reduce repeated query costs, federated systems increasingly rely on materialisation
and caching to reuse results. These approaches lead to lower latency, reduce the load on
source systems, and improve the predictability of performance in exploratory data analysis
and dashboard-style analytics use cases [75,76].
Materialised views are the results of precomputed queries that are stored in advance
and refreshed at regular intervals in some particular system—typically, a data warehouse or
analytical repository. They are useful for frequently accessed join paths, filtered aggregates,
or derived measure paths known to have high computational costs when computed in real
time. However, the use of materialised views includes storage overhead, requires consis-
tency management, and could cause data to become stale if not periodically refreshed [
75
].
In query engines used in Presto/Trino or Dremio setups, result caching refers to
the storage of results for run or running queries in transient storage or solid-state drives
(SSDs). This technique significantly minimises the computation costs of similar queries.
Caching intermediate outputs through the reuse of the subquery results is a very efficient
Information 2025,16, 932 25 of 38
technique in use-case scenarios involving high query overlap, such as business intelligence
dashboards and multi-tenanted environments. Related systems show how to persist and
reuse intermediate results and sub-jobs effectively [7,77].
Maintenance of metadata related to schema definitions, statistical information about
columns, and access patterns is key to performance improvement. It allows for accelerated
query planning and reduces the overhead of repeated schema discovery typical with the
querying of file-based data stores like Parquet or JSON. Despite these advantages, caching
methods have to bypass a complicated balance between data timeliness requirements,
storage costs related to those requirements, and cache invalidation problems. Datasets
undergoing perpetual and dynamic changes, especially those being updated in real time by
streaming or modified externally through transactions, require careful synchronisation pro-
tocols to be adopted. Modern engines also support materialised federation and incremental
maintenance to balance fast availability of cached views with on-demand flexibility [
75
,
76
].
6.3. Data Freshness and Eventual Consistency Issues
Across integrated systems comprising a range of data stores, including relational
databases, NoSQL databases, files, APIs, and streaming systems, keeping the data fresh and
consistent is a key challenge. Analytical workflows often rely on data ingested or processed
asynchronously, which causes the different sources to become temporally misaligned [
78
].
Freshness of data refers to how much the consolidated view captures the timeliness
of the source datasets. For periodic-batch ETL pipelines, the level of freshness depends
on the frequency of extraction and the update lag thereof. In streaming or near-real-time
systems, freshness depends on factors like ingestion latency, event processing delay, and the
effectiveness of the checkpointing system. Robust watermarks and event-time semantics
are important to quantify and bound lateness [
78
,
79
]. Users often have to trade off low
latency against system stability, especially in pipelines with complex transformations or
downstream dependencies.
In scenarios with distribution or federation, eventual consistency is a situation wherein
changes to the data are not immediately reflected across all instances or replicas. It is a
frequently seen occurrence with NoSQL stores like DynamoDB or Cassandra or design
patterns involving asynchronous replication or a microservices pattern [
44
,
80
]. An update
to an order’s status in a specific service, for example, will not be immediately reflected on a
user analytics screen, nor will it be simultaneously available together with customer details
obtained from a separate repository.
The fact that schema changes evolve over a period of time, often combined with
non-linear data transmission—particularly in streaming scenarios—and the usage of retry
protocols, which can cause event duplication or loss, amplifies the extent of consistency
issues. For this reason, in order for integrated systems to maintain analytical integrity, fea-
tures for deduplication, temporal windowing, out-of-time data management, and conflict
resolution must be implemented. To address these challenges, more platforms are adopting
versioned data architectures like Apache Iceberg and Delta Lake, which provide capabilities
such as time travel, rollback, and reproducible querying support [
3
,
10
]. Observability and
data-quality verification frameworks help monitor freshness and correctness in produc-
tion data pipelines [
81
]. Additionally, consistency and freshness considerations should
be looked upon as design principles rather than just operational issues. Well-performing
analytical systems should set up Service-Level Agreements (SLAs) for data timeliness,
as well as governance patterns that evaluate stale time, revision policies, and high-level
strategies to build confidence in analytical outcomes. Figure 7synthesises key performance
challenges in heterogeneous analytics and maps them to optimisation levers and mitigating
strategies across engines and storage layers.
Information 2025,16, 932 26 of 38
Figure 7. Performance levers for heterogeneous analytics: optimisation, caching, and fresh-
ness/consistency.
6.4. Federation Overhead and Performance Tuning
Federated access in data lakes and polystores is concrete and implementation-specific.
For example, Ontario and Squerall execute federated query processing over semantic data
lakes by decomposing a SPARQL input into subqueries per dataset, then translating each
subquery to the target system (e.g., Spark SQL for TSV/HDFS) using dataset profiles and
rules. Squerall retrieves from CSV/Parquet, MySQL, Cassandra, and MongoDB through
a mediator (high-level ontologies) and ships data via connectors (two implementations:
Spark and Presto) before joining into the final result [5].
In a broader integration survey, federated query answering is explicitly defined as “a
consistent way of accessing data from sources without duplicating them in a central repos-
itory”, achieved “by using sub-queries that target the data sources within the federation
and evaluating their results based on predefined rules” [
11
]. These concrete mechanisms
surface where overhead arises: (i) connector capability skew (e.g., which operators can be
translated/pushed and with what plan quality), (ii) planning under partial or per source
metadata (Ontario “uses the profiles to generate subqueries” and uses metadata... to
generate optimised query plans [
5
]), and (iii) movement/serialisation when subresults are
shipped back to the mediator for final assembly.
In practice, the choice of connector matters: the same mediator (Squerall) reports two
runtime stacks “with different data connectors: Spark and Presto [
5
], anticipating distinct
pushdown, transfer, and scheduling behaviour. At the orchestration layer, LLM interfaces
increasingly appear in pipelines. However, even strong models show execution gaps in
data tasks (e.g., GPT-4 text-to-SQL execution accuracy of 54.89% vs. human 92.96%), which
cautions against uncritical delegation of query planning/translation to LLMs [19].
Tuning, in turn, follows those concrete pain points.
First, connector-aware pushdown is not optional but infrastructural: Ontario’s use
of dataset profiles and Squerall’s mediator mapping illustrate that federation lay-
ers must know source capabilities to drive translation and decide where to execute
selections/aggregations/joins [5,11].
Information 2025,16, 932 27 of 38
Second, planning with partial statistics can still be effective if the mediator exploits
metadata to derive good subquery decompositions and join orders (Ontario uses
metadata .. . to generate optimised query plans [
5
]) and, when available, sampling or
progressive execution to refine estimates (see also systematisations in [6]).
Third, movement minimisation is a transport problem: the choice of colum-
nar/vectorised paths and batching reduces per tuple overhead. Contemporary evalua-
tions of columnar runtimes and data paths emphasise the sustained throughput advan-
tages of vectorised processing and columnar layouts for scans and aggregations [1,2].
Fourth, materialisation and incremental maintenance mitigate repeated cross-source
joins. Rather than fully recomputing federated joins/aggregations, incremental frame-
works (e.g., DBSP) maintain views by applying deltas to compiled differential pro-
grams, reducing refresh latency and source load in steady state [76].
Finally, hybridisation—persisting “hot” integrated slices (lakehouse/warehouse)
while federating the long tail—follows the storage–execution split documented across
recent lakehouse discussions [5,82].
Table 11 maps these overheads to tuning levers with the precise loci (translation,
planning, transfer, and maintenance) where they act.
Table 11. Concrete federation overheads and tuning levers grounded in reported systems.
Overhead Locus (Where It Appears) Tuning Lever (How It Is Mitigated)
Connector translation gaps and
heterogeneous engines (Spark/Presto
variants in Squerall)
Connector-aware planning and dialect
rewriting and per connector
rules/pushdown (Ontario’s profile-driven
subqueries and Squerall’s mediator
mapping) [5].
Partial statistics; mediator lacks
global distributions
Metadata-guided plan generation (Ontario
uses metadata to generate optimised
plans), progressive refinement, and
survey-catalogued strategies [5,6].
Row-oriented transfer and
fine-grained serialisation
Vectorised/columnar paths and batching
for sustained scan/aggregation
throughput [1,2].
Repeated cross-source joins; freshness
vs. latency
Materialised views of “hot” joins with
incremental refresh (differential/delta
maintenance) [76].
Orchestration via LLMs
(NLQ/translation)
Guardrails: verified translations and
fallbacks; LLM use where determinism is
not critical (noting 54.89% text-to-SQL
execution accuracy for GPT-4) [19].
Workload skew across sources
Hybridisation (persist stable, high-value
slices in lakehouse/warehouse and
federate the remainder) [5,11,82].
Case focus (AAS–ECLASS industrial federation). In a manufacturing integration
where AAS submodels act as the mediator and ECLASS serves as the external dictionary,
the authors of [
31
] report two very specific performance levers. (i) Blocking to cut candidate
space: Before pairwise matching, the system narrows candidates via ANN over embed-
dings (open-source SFR-Embedding-Mistral) with Faiss. In their AAS–ECLASS setting, the
dictionary spans 27,423 entries, so blocking is operationally decisive for both compute and
downstream join fan-out (ii) Classifier choice as a speed/accuracy knob: a fine-tuned generative
LLM achieves slightly better results, whereas an encoding-based classifier enables much
Information 2025,16, 932 28 of 38
faster inference, and the fine-tuned LLM surpasses BERT variants and GPT-4+ICL on
entity-matching benchmarks.
In the ER stage feeding federation, the authors of [
15
] show that batching demon-
strations and questions (BATCHER) are very cost-effective for ER, outperforming both
fine-tuned PLMs and manually designed LLM prompting. This directly trims external
API overhead and stabilises latency. With respect to LLMs driving orchestration or NLQ,
another work [
19
] documents execution-level gaps (GPT-4 text-to-SQL of 54.89%), arguing
for deterministic translation paths or verification stages in production. Together with
mediator-level pushdowns (Ontario/Squerall) and incremental materialisation (DBSP),
these concrete techniques reduce shipped data, avoid misplaced computation, and keep
the AAS federation responsive under heterogeneous source capabilities [5,11,76].
7. Applications and Case Studies
This section grounds the concepts in practice (applications), including enterprise
lake/lakehouse deployments, scientific data integration, and public sector pipelines. It
highlights domain-specific constraints (e.g., governance and standards) and architectural
patterns that recur, showing how the reviewed methods translate into outcomes. Table 12
presents a cross-walk from the taxonomy to the three abovementioned application domains,
indicating for each taxonomy element where it is instantiated (“where used”) and where
it is analysed in the text (“where discussed”). This makes the taxonomy operational and
allows readers to locate concrete occurrences and the corresponding discussion quickly.
Table 12. Taxonomyapplications cross-walk (where used and where discussed).
Taxonomy Element Applications (Role and Where in Text)
Schema matching and mapping (Section 2.2)Harmonise identifiers/attributes across sources.
Enterprise lakehouse keys via SQL/ELT (Section 7.1).
Public registries (Section 7.3).
Entity resolution and fusion (Section 2.3)Deduplicate/link records. Unified entities. Enterprise
CRM+transactions (Section 7.1). Public person/org
linkage (Section 7.3).
Semantic enrichment and ontologies (Section 5.2)Disambiguation of meaning, standards-based queries,
scientific knowledge graphs (Section 7.2), and Public
SDMX alignment (Section 7.3).
Metadata catalogues and lineage (Sections 5.1 and 5.3)
Discoverability, governance, reproducibility, enterprise
governance (Section 7.1), and scientific provenance
(Section 7.2).
Storage models (row/column/NoSQL-Section 3)Fit workloads, hybrid query plans, enterprise
columnar lakehouse, and scientific/public doc/graph
as adjunct (Section 7.17.3).
Federated SQL and virtualisation (Section 4.2)Cross-store analytics without relocation, enterprise
Trino-based joins (Section 7.1), and public inter-agency
dashboards (Section 7.3).
Schema-on-read/write and hybrid (Section 4.3)Contracts vs. flexibility, canonicalisation (e.g., dates),
and public regulated pipelines (Section 7.3).
Performance levers (Section 6)
Cost/latency optimisation, freshness, dashboards, and
SLAs (Section 7).
7.1. Enterprise Data Lakes and Lakehouses
In large companies, the need for the consolidation of heterogeneous internal and
external data silos has compelled the large-scale adoption of data lakes and lakehouse
platforms. Traditional data warehouses are limited by rigid schemas, high scaling costs,
and tedious loading. Data lakes, on the other hand, provide for the integration of raw, semi-
structured, and structured data from various business areas, such as customer relationship
management systems, transaction logs, sensor data, and external service providers, into
horizontally scaled object storage solutions like Amazon S3, Azure Data Lake Storage, or
Google Cloud Storage [5].
Information 2025,16, 932 29 of 38
To move beyond simple data storage options, an increasing number of organisations is
adopting lakehouses—unified platforms leveraging the natural flexibility of data lakes with
added functionalities characteristic of data warehouses like ACID transactions, temporal
capabilities, and schema enforcement policies. Products like Delta Lake [
3
], Apache Hudi,
and Iceberg [
10
] enable end-to-end querying, incrementally updated data, and support for
schema evolution over large datasets [13].
One such example is a worldwide retail company that aggregates sales, inventory,
customer feedback, and supply chain data from over 50 systems within an enterprise-wide
analytical framework. Utilising Apache Spark for distributed computation, Presto [
7
,
55
]
for federated querying, and Apache Atlas for metadata management, the company enables
batch and real-time analytics with traceability and governance across multiple regions
and departments.
7.2. Scientific Data Integration
Scientific fields often operate at the frontier of data integration, which requires the
integration of heterogeneous datasets across spatial, temporal, semantic, and modality
aspects. For applications like life sciences, environmental science, physics, and social
sciences, researchers acknowledge the need to integrate a range of sources like experimen-
tal observations, sensor readings, simulation software, domain ontologies, and research
articles. These sources tend to vary in their forms, granularity, semantic material, and
frequency of update, presenting a major challenge with respect to integration for possible
subsequent usage.
In resolving the problem, modern scientific data infrastructure more increasingly relies
on methodologies that include semantic integration, standardised metadata frameworks,
and distributed computing systems [
83
,
84
]. To ensure conceptual consistency in datasets,
ontologies and controlled vocabularies are used, thereby improving the accuracy of align-
ment and aggregation of linked variables. Various tools, such as ontology mapping engines,
semantic catalogues of data, and knowledge graphs, are utilised continuously to support
the interlinking of data in various repositories, instruments, and organisations.
In addition, each of several disciplines has adopted modular, workflow-based systems
for data ingestion, annotation, and transformation processes. These systems support
reproducible analyses, versioning, and shared curation—key attributes in domains where
datasets change over time and that involve large, distributed user bases. The widespread
adoption of cloud-native storage formats (e.g., Parquet and Zarr [
85
,
86
]), spatio-temporal
indexing, and interoperable Application Programming Interfaces (APIs) supports greater
scalability for querying and integration of data from a wide variety of structured and
unstructured sources.
Data integration pipelines enable many uses, including genomic discovery, climate
modelling, astronomy, and disease surveillance. The generated results are often determined
not only by the volume or speed of data but also by semantic matching effectiveness, prove-
nance annotation, and contextualisation of relevance to a particular field—highlighting the
essential need for strong, flexible integration systems to fuel scientific understanding.
7.3. Cross-Domain Data Pipelines in Public Sector Analytics
Public sector organisations are increasingly implementing standardised data platforms
to enable evidence-based policymaking, improve the management of service delivery, and
ensure transparency. An important feature of analytics in public sector organisations is
the need to consolidate data from diverse silos across the organisation, such as education,
health, employment, tax, and mobility.
Information 2025,16, 932 30 of 38
For instance, a national statistical office might bring together data from censuses, hos-
pitalisation rates, school performance measures, and social programs’ enrolment rates for
the purpose of understanding disparities or designing certain interventions. Data sources
can have varying identifiers, structures, frequencies of update, and access-mandated legal
restrictions. Data integration methods must be sensitive to the needs for anonymisation,
possible errors in data correspondence, and auditability needs, which are often controlled
through strict data protection legislation (e.g., GDPR) [87].
To support effective integration pipeline management, a range of tools such as Open-
Refine [
88
], CKAN, and custom data warehouses with lineage traceability and role-based
access controls are used. Additionally, using a semantic standard such as the Statistical
Data and Metadata Exchange (SDMX) [
89
], together with linked data-based methodologies
supports consistency of definitions across different agencies.
Interoperability with open data portals, city virtualisations, and peer-to-peer dash-
boards between and among different agencies is increasingly reliant upon integration
platforms that have real-time capabilities. Such integration platforms are those that inte-
grate dynamic datasets such as traffic flow patterns, energy consumption, and pollution
levels with stable statistical indicators. These conditions highlight the imperative to develop
data management policies that are operational across technical, institutional, and legal
levels, pursuing a harmonious balance of interoperability, governance, and scalability.
Figure 8provides an architectural comparison of integration strategies across eight layers,
showing how they are instantiated across enterprise, scientific, and public sector pipelines.
Figure 8. Architectural comparison across eight layers—ingestion, storage, processing/query, meta-
data/lineage, semantics/standards, tools and enablers, access and use, and governance/policy—for
the enterprise, scientific, and public sector integration contexts. The figure situates representative
technologies and practices in each layer.
Information 2025,16, 932 31 of 38
8. Future Directions
Future expectations refer to analytically sound frameworks, metadata-rich and self-
explanatory workflows, and integration methods enriched with AI, like automatic mapping,
entity-relationship modelling, and semantic reasoning. These can be leveraged to minimise
manual effort while concurrently maintaining human supervision. This section establishes
a strategic blueprint for the production more composable, governable, and intelligent
data platforms.
8.1. Towards Unified Analytical Fabrics
Ongoing innovation in integration and storage demands unified analytical frameworks
that enable transparent access, governance, and processing of data assets. A major goal of
these frameworks is to converge the benefits of data warehouses, data lakes, and operational
stores. This is achieved through a common query and metadata layer that is independent
of the data format, geographic distribution, and movement rate [13,14,90].
Future analytical fabrics will provide for hybrid execution models that can combine
batch, streaming, and federated operations. They will employ declarative models of
metadata for automatic configuration of workflows and increased runtime performance.
Rather than building monolithic platforms, organisations will look towards the adoption
composable architectures with modular components for data ingestion, cataloguing, trans-
formation, and access control—networked together with open standards and APIs [
13
,
14
].
Commercial business ventures and open-source projects, including Dataplex (Google),
Data Fabric (IBM), and LakeFS, are at the forefront of this space. Subsequent frameworks
will evolve further in terms of features, including data lineage tracing, accurate access
control, real-time monitoring, and multi-cloud capabilities. All these frameworks will be
of utmost significance in decentralised organisations and collaborative institutions, which
will require integration in a seamless and secure manner across geopolitical and technical
boundaries [10].
8.2. Metadata-Driven and Self-Describing Pipelines
To counter the intrinsic brittleness and labour-intensive character of existing inte-
gration processes, the future for data management hinges on pipelines that are both self-
describing and metadata-driven. Such next-generation pipelines can autonomously infer,
propagate, and adapt to schema and data profile changes using integrated metadata and
declarative policies [83,91].
In this context, metadata moves beyond being an auxiliary companion resource to-
wards being an essential element within an overall end-to-end pipeline design. Pipeline
building is deliberately schema-aware, context-adaptive, and carefully controlled at the
level of version control. Tools like dbt, Apache Hop, and Tecton enable developers to
express partially declarative pipeline definitions, which vary accordingly with source
schemas, data quality measures, and business rule logic [91].
Self-describing data structures like Parquet, which includes an integrated schema,
Avro, and Arrow, make this progress possible through the capability of pipelines to dy-
namically analyse and verify at runtime. Prior evaluations of columnar file formats reveal
trade-offs between Parquet and ORC. Additionally, the development of data contracts—
organised contractual agreements between consumers and producers that outline structure,
semantics, and SLAs—boosts such development [2,41,92].
In the future, automated testing, schema drift detection, lineage impact estimation,
and semantic reconciliation are expected to be built-in capabilities of data pipelines. That
shift should promote reuse, modularity, and resilience—the key properties to enable main-
tainability in complex analytical frameworks for a long duration [83,91].
Information 2025,16, 932 32 of 38
8.3. AI-Assisted Integration and Auto-Schema Mapping
One of the high-profile and challenging areas of exploration is related to the integration
of AI with a focus on boosting the automation of data integration processes. Schema
mapping, ER, and rule establishment for transformations are highly prone to human
intervention, errors, and limitations in scalability. Natural language modelling, learning of
representations, and primitive models have brought about new possibilities for contextually
informed intelligent assistance with integration processes [90].
AI-based tools can suggest field mappings, design transformation scripts, and identify
semantic relationships by analysing schemas, instance values, and external knowledge
graphs. Some specific tools like AutoMapper, Google Cloud Dataprep, and those using
OpenAI Codex can execute interactive mapping and transformation according to natural
language commands [90].
Furthermore, embedding-based matching methods like BERT and graph embeddings
offer effective solutions for harmoniously reconciling heterogeneous schemas, especially in
applications involving inconsistent labelling or heterogeneous data forms. Additionally,
the integration of active learning within an interface incorporating human feedback can
improve mapping accuracy through user contributions [
93
,
94
]. From a pipeline perspective,
LLMs are increasingly positioned as programmable interfaces for data pipelines, synergis-
ing with KGs, XAI, and AutoML to mediate discovery, transformation, and governance [
19
].
Early orchestration prototypes further show LLM-assisted DAG synthesis for data en-
richment pipelines [
20
], indicating a path from point tools to agentic, metadata-aware
integration flows.
As foundation models evolve for structured data manipulation, future systems will
ingest heterogeneous datasets and identify their internal structure and semantics. They will
automatically generate integration frameworks, quality assessments, and domain-specific
interpretations. This will revolutionise insight extraction and greatly lessen the challenges
involved in complex data integration projects [90].
However, concerns of explainability, control, bias, and governance remain. Artificial
intelligence design should be conceived to augment human capacity and not replace
humans—incorporating facets of transparency, auditability, and the ability to allow human
intervention across all automated decision-making systems.
8.4. Ethical and Regulatory Directions
Beyond the challenges inherent to architecture and performance, systems of inte-
gration in the future will also require considerations of ethics and rules of regulation as
part of their design. In terms of privacy and security of data, Regulation (EU) 2016/679
(GDPR) provides a critical framework, requiring data processors and controllers to adhere
to standards of responsibility, respect rights of the person (e.g., data erasure and trans-
portability), and comply with purpose restriction in all processing operations, including
integration [
95
]. Moreover, the newly ratified EU Artificial Intelligence Act (Regulation
(EU) 2024/1689) provides mandatory rules for AI systems, requiring explanation, risk
assessment, and human oversight—matters of particular relevance to integration involving
learned or generative schema matching or entity resolution techniques [
96
]. Moreover, laws
of a particular domain-specific nature (e.g., HIPAA for healthcare and PSD2 for financial
services) place special limits on the use and sharing of integrated data.
Both from a technical standpoint and an ethics standpoint, there are serious risks
involved in the proliferation of biases in integration pipelines, for instance, processes of
schema matching or of entity resolution being associated with a model learned from a biased
dataset may inadvertently retain and propagate biases. Overcoming such challenges implies
the integration of explainability and verification with human oversight and auditable
Information 2025,16, 932 33 of 38
mechanisms, particularly with transformations afforded by automated inference [
97
]. In
order to build trust and responsibility, prospective integration systems should include
“governance-aware” capabilities, including fine-grained tracking of lineage, history of audit,
and human oversight at critical decision points. Such rules of design not only aim to
bolster transparency and ensure compliance through design but also ensure a balanced
reconciliation of technological advancements and legal and societal responsibility and,
ultimately, an efficient conjoining of interoperability, scaling, and robust governance.
8.5. Limitations of This Review
This work is a structured survey rather than a systematic review. It aims for represen-
tative, cross-layer coverage rather than exhaustiveness. New empirical benchmarks were
not established. Performance assessments reflect published studies and production reports.
The 2015–2025 focus can introduce recency bias and version drift for rapidly evolving en-
gines and connectors. Where feasible, results were triangulated with contemporary surveys.
In addition, non-archival/vendor whitepapers were excluded, which may have led to the
omission of operational detail. Finally, generalisation across domains is limited, i.e., the
examples in Section 7are indicative rather than comprehensive, and some modalities (e.g.,
unstructured media) fall outside the scope of this work.
9. Conclusions
The current study explored the evolving dynamics of data integration and storage in
analytics systems, with special focus on architectures, tools, and methodologies that target
improved performance, scalability, and semantic consistency. A comparative evaluation of
primary storage models, such as row-store and column-store systems, NoSQL databases,
and lakehouse architectures, was conducted in terms of their suitability for different
workloads. Integration patterns were studied in the scenarios of ETL/ELT pipelines,
federated query workloads, and metadata-centric orchestration. In addition, the importance
of semantic enrichment, data provenance, and schema evolution was highlighted as key
enablers for the building of fault-tolerant and traceable data pipelines. Moreover, major
challenge areas related to query optimisation, caching mechanisms, data freshness, and
consistency were discussed, in addition to real-world applications in enterprise, scientific,
and public sector scenarios.
To build future-ready analytical infrastructure, practitioners should adopt modular,
metadata-driven architectures; leverage open standards; and invest in governance-aware
integration. Schema flexibility and tracking of lineage and reproducibility have to be
balanced with each other, and performance tuning has to consider storage configuration,
along with federation overhead. Semantic interoperability and AI-powered toolsets will
play an increasingly important role in integration cost reduction and streamlined self-
adaptive pipelines. Data teams can ensure scalability, agility, and fault tolerance within
complex analytical environments by aligning technical design with organisational needs,
along with imposed regulatory constraints.
Finally, this work offers a broad and multifaceted overview that connects integration
methods to storage solutions. It stands as a comparative approach that highlights trade-
offs with ETL/ELT/virtualisation and federated pushdown and a metadata- and lineage-
focused approach that combines performance and consistency controls into useful design
tools across domains. This integration is conceived as a decision aid in the construction of
regulated hybrid pipelines that can sustain performance and reproducibility in the face of
rapidly changing schemas and workloads. In contrast to prior surveys that concentrated
separately on lakes, federation, or lakehouses, this contribution represents a unification
Information 2025,16, 932 34 of 38
of integration mechanisms with storage and governance under actionable, reproducible
workflows, augmented by the latest AI-assisted methods.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The original contributions presented in this study are included in the
article. Further inquiries can be directed to the corresponding author.
Acknowledgments: The author would like to thank the anonymous reviewers for their constructive
feedback, which improved the paper substantially.
Conflicts of Interest: The author declares no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
ACID Atomicity, Consistency, Isolation, and Durability
AI Artificial Intelligence
API Application Programming Interface
AWS Amazon Web Services
CIDOC CRM International Committee for Documentation Conceptual Reference Model
CPU Central Processing Unit
ELT Extract–Load–Transform
ER Entity Resolution
ETL Extract–Transform,–Load
FIBO Financial Industry Business Ontology
GDPR General Data Protection Regulation
HDFS Hadoop Distributed File System
HL7 Health Level Seven
JSON JavaScript Object Notation
MDM Master Data Management
NoSQL Not Only SQL
OLAP Online Analytical Processing
OLTP Online Transaction Processing
ORC Optimised Row Columnar
OWL Web Ontology Language
RBAC Role-Based Access Control
RDF Resource Description Framework
REST Representational State Transfer
S3 Simple Storage Service
SDMX Statistical Data and Metadata Exchange
SIGMOD ACM Special Interest Group on Management of Data
VLDB Very Large Data Bases (Conference)
ICDE IEEE International Conference on Data Engineering
TKDE IEEE Transactions on Knowledge and Data Engineering
VLDBJ The VLDB Journal
SLA Service Level Agreement
SPARQL SPARQL Protocol and RDF Query Language
SQL Structured Query Language
SWEET Semantic Web for Earth and Environmental Terminology
XML Extensible Markup Language
Information 2025,16, 932 35 of 38
References
1.
Liu, C.; Pavlenko, A.; Interlandi, M.; Haynes, B. A Deep Dive into Common Open Formats for Analytical DBMSs. Proc. VLDB
Endow. 2023,16, 3044–3056. [CrossRef]
2.
Zeng, X.; Hui, Y.; Shen, J.; Pavlo, A.; McKinney, W.; Zhang, H. An Empirical Evaluation of Columnar Storage Formats. Proc.
VLDB Endow. 2023,17, 148–161. [CrossRef]
3.
Armbrust, M.; Das, T.; Sun, L.; Yavuz, B.; Zhu, S.; Murthy, M.; Torres, J.; van Hovell, H.; Ionescu, A.; Łuszczak, A.; et al. Delta
lake: High-performance ACID table storage over cloud object stores. Proc. VLDB Endow. 2020,13, 3411–3424. [CrossRef]
4.
Abadi, D.J.; Madden, S.R.; Hachem, N. Column-stores vs. row-stores: How different are they really? In Proceedings of the
2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 9–12 June 2008; pp. 967–980.
[CrossRef]
5.
Hai, R.; Koutras, C.; Quix, C.; Jarke, M. Data Lakes: A Survey of Functions and Systems. IEEE Trans. Knowl. Data Eng. 2023,
35, 12571–12590. [CrossRef]
6.
Gu, Z.; Corcoglioniti, F.; Lanti, D.; Mosca, A.; Xiao, G.; Xiong, J.; Calvanese, D. A systematic overview of data federation systems.
Semant. Web 2024,15, 107–165. [CrossRef]
7.
Sun, Y.; Meehan, T.; Schlussel, R.; Xie, W.; Basmanova, M.; Erling, O.; Rosa, A.; Fan, S.; Zhong, R.; Thirupathi, A.; et al. Presto: A
Decade of SQL Analytics at Meta. Proc. ACM Manag. Data 2023,1, 1–25. [CrossRef]
8.
Potharaju, R.; Kim, T.; Song, E.; Wu, W.; Novik, L.; Dave, A.; Acharya, V.; Dhody, G.; Li, J.; Ramanujam, S.; et al. Hyperspace: The
Indexing Subsystem of Azure Synapse. Proc. Vldb Endow. 2021,14, 3043–3055. [CrossRef]
9.
Dong, X.L.; Srivastava, D. Big Data Integration; Synthesis Lectures on Data Management; Springer Nature Switzerland AG: Cham,
Switzerland, 2015. [CrossRef]
10.
Okolnychyi, A.; Sun, C.; Tanimura, K.; Spitzer, R.; Blue, R.; Ho, S.; Gu, Y.; Lakkundi, V.; Tsai, D. Petabyte-Scale Row-Level
Operations in Data Lakehouses. Proc. VLDB Endow. 2024,17, 4159–4172. [CrossRef]
11.
Alma’aitah, W.Z.; Quraan, A.; AL-Aswadi, F.N.; Alkhawaldeh, R.S.; Alazab, M.; Awajan, A. Integration Approaches for
Heterogeneous Big Data: A Survey. Cybern. Inf. Technol. 2024,24, 3–20. [CrossRef]
12.
Pedreira, P.; Erling, O.; Basmanova, M.; Wilfong, K.; Sakka, L.; Pai, K.; He, W.; Chattopadhyay, B. Velox: Meta’s unified execution
engine. Proc. VLDB Endow. 2022,15, 3372–3384. [CrossRef]
13.
Schneider, J.; Gröger, C.; Lutsch, A.; Schwarz, H.; Mitschang, B. The Lakehouse: State of the Art on Concepts and Technologies.
SN Comput. Sci. 2024,5, 449. [CrossRef]
14.
Kaoudi, Z.; Quiané-Ruiz, J.A. Unified Data Analytics: State-of-the-Art and Open Problems. Proc. Vldb Endow. 2022,15, 3778–3781.
[CrossRef]
15.
Fan, M.; Han, X.; Fan, J.; Chai, C.; Tang, N.; Li, G.; Du, X. Cost-Effective In-Context Learning for Entity Resolution: A Design
Space Exploration. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The
Netherlands, 13–16 May 2024; pp. 3696–3709. [CrossRef]
16.
Zhang, Z.; Zeng, W.; Tang, J.; Huang, H.; Zhao, X. Active in-context learning for cross-domain entity resolution. Inf. Fusion 2025,
117, 102816. [CrossRef]
17.
Taboada, M.; Martinez, D.; Arideh, M.; Mosquera, R. Ontology matching with Large Language Models and prioritized depth-first
search. Inf. Fusion 2025,123, 103254. [CrossRef]
18.
Babaei Giglou, H.; D’Souza, J.; Engel, F.; Auer, S. LLMs4OM: Matching Ontologies with Large Language Models. In Proceedings
of the Semantic Web: ESWC 2024 Satellite Events, Hersonissos, Greece, 26–30 May 2024; Meroño Peñuela, A., Corcho, O., Groth,
P., Simperl, E., Tamma, V., Nuzzolese, A.G., Poveda-Villalón, M., Sabou, M., Presutti, V., Celino, I., et al., Eds.; Springer: Cham,
Switzerland, 2025; pp. 25–35.
19.
Barbon Junior, S.; Ceravolo, P.; Groppe, S.; Jarrar, M.; Maghool, S.; Sèdes, F.; Sahri, S.; Van Keulen, M. Are Large Language
Models the New Interface for Data Pipelines? In Proceedings of the International Workshop on Big Data in Emergent Distributed
Environments, Santiago, Chile, 9–15 June 2024; [CrossRef]
20.
Alidu, A.; Ciavotta, M.; Paoli, F.D. LLM-Based DAG Creation for Data Enrichment Pipelines in SemT Framework. In Proceedings
of the Service-Oriented Computing—ICSOC 2024 Workshops: ASOCA, AI-PA, WESOACS, GAISS, LAIS, AI on Edge, RTSEMS,
SQS, SOCAISA, SOC4AI and Satellite Events, Tunis, Tunisia, 3–6 December 2024; Springer Nature: Singapore, 2025; pp. 131–143.
[CrossRef]
21. Rahm, E.; Bernstein, P.A. A Survey of Approaches to Automatic Schema Matching. VLDB J. 2001,10, 334–350. [CrossRef]
22. Bleiholder, J.; Naumann, F. Data Fusion. ACM Comput. Surv. 2008,41, 1–41. [CrossRef]
23.
Cheney, J.; Chiticariu, L.; Tan, W. Provenance in Databases: Why, How, and Where. Found. Trends Databases 2009,1, 379–474.
[CrossRef]
24. Euzenat, J.; Shvaiko, P. Ontology Matching, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2013. [CrossRef]
25.
Christen, P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection; Springer:
Berlin/Heidelberg, Germany, 2012. [CrossRef]
Information 2025,16, 932 36 of 38
26.
Papadakis, G.; Skoutas, D.; Thanos, E.; Palpanas, T. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM
Comput. Surv. 2020,53, 1–42. [CrossRef]
27.
Buneman, P.; Khanna, S.; Tan, W. Why and Where: A Characterization of Data Provenance. In Proceedings of the 8th
International Conference on Database Theory (ICDT), London, UK, 4–6 January 2001; Springer: Berlin/Heidelberg, Germany,
2001; Volume 1973, pp. 316–330. [CrossRef]
28.
ISO 8601-2:2019; Date and Time—Representations for Information Interchange—Part 2: Extensions. International Organization
for Standardization: Geneva, Switzerland, 2019; Confirmed 2024; Amendment 1:2025.
29.
Bellahsene, Z.; Bonifati, A.; Rahm, E. Schema Matching and Mapping. In Schema Matching and Mapping; Springer:
Berlin/Heidelberg, Germany, 2011; pp. 1–20. [CrossRef]
30.
Parciak, M.; Vandevoort, B.; Neven, F.; Peeters, L.M.; Vansummeren, S. LLM-Matcher: A Name-Based Schema Matching Tool
using Large Language Models. In Proceedings of the Companion of the 2025 International Conference on Management of Data,
Berlin, Germany, 22–27 June 2025; pp. 203–206. [CrossRef]
31.
Shi, D.; Meyer, O.; Oberle, M.; Bauernhansl, T. Dual data mapping with fine-tuned large language models and asset administration
shells toward interoperable knowledge representation. Robot. Comput. Integr. Manuf. 2025,91, 102837. [CrossRef]
32. Wagner, R.A.; Fischer, M.J. The String-to-String Correction Problem. J. ACM 1974,21, 168–173. [CrossRef]
33.
Rodrigues, D.; da Silva, A. A Study on Machine Learning Techniques for the Schema Matching Network Problem. J. Braz. Comput.
Soc. 2021,27, 1–22. [CrossRef]
34.
Popa, L.; Velegrakis, Y.; Miller, R.J.; Hernández, M.A.; Fagin, R. Chapter 52—Translating Web Data. In VLDB ’02: Proceedings
of the 28th International Conference on Very Large Databases, Hong Kong, China, 20–23 August 2002; Bernstein, P.A., Ioannidis, Y.E.,
Ramakrishnan, R., Papadias, D., Eds.; Morgan Kaufmann: San Francisco, CA, USA, 2002; pp. 598–609. [CrossRef]
35. Binette, O.; Steorts, R.C. (Almost) All of Entity Resolution. Sci. Adv. 2022,8, eabi8021. [CrossRef]
36.
Kemper, A.; Neumann, T. HyPer: A Hybrid OLTP&OLAP Main Memory Database System Based on Virtual Memory Snapshots.
In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE), Hannover, Germany, 11–16 April
2011; pp. 195–206. [CrossRef]
37.
Lamb, A.; Fuller, M.; Varadarajan, R.; Tran, N.; Vandiver, B.; Doshi, L.; Bear, C. The Vertica Analytic Database: C-Store 7 Years
Later. Proc. Vldb Endow. 2012,5, 1790–1801. [CrossRef]
38.
Schulze, R.; Schreiber, T.; Yatsishin, I.; Dahimene, R.; Milovidov, A. ClickHouse—Lightning Fast Analytics for Everyone. Proc.
VLDB Endow. 2024,17, 3731–3744. [CrossRef]
39.
Wang, J.; Lin, C.; Papakonstantinou, Y.; Swanson, S. An Experimental Study of Bitmap Compression vs. Inverted List Compres-
sion. In Proceedings of the 2017 ACM International Conference on Management of Data, Chicago, IL, USA, 14–19 May 2017;
pp. 993–1008. [CrossRef]
40.
Chambi, S.; Lemire, D.; Kaser, O.; Godin, R. Better bitmap performance with Roaring bitmaps. Softw. Pract. Exp. 2016,46, 709–719.
[CrossRef]
41.
Ivanov, T.; Pergolesi, M. The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A Study on ORC and
Parquet. Concurr. Comput. Pract. Exp. 2020,32, e5523. [CrossRef]
42.
Abadi, D.; Madden, S.; Ferreira, M. Integrating Compression and Execution in Column-Oriented Database Systems. In
Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, Chicago, IL, USA, 27–29 June 2006;
pp. 671–682. [CrossRef]
43.
Sikka, V.; Färber, F.; Lehner, W.; Cha, S.K.; Peh, T.; Bornhövd, C. Efficient transaction processing in SAP HANA database: The end
of a column store myth. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale,
AZ, USA, 20–24 May 2012; pp. 731–742. [CrossRef]
44.
DeCandia, G.; Hastorun, D.; Jampani, M.; Kakulapati, G.; Lakshman, A.; Pilchin, A.; Sivasubramanian, S.; Vosshall, P.; Vogels, W.
Dynamo: Amazon’s highly available key-value store. In Proceedings of the Twenty-First ACM SIGOPS Symposium on Operating
Systems Principles, Stevenson, WA, USA, 14–17 October 2007; pp. 205–220. [CrossRef]
45.
O’Neil, P.; Cheng, E.; Gawlick, D.; O’Neil, E. The Log-Structured Merge-Tree (LSM-Tree). Acta Inform. 1996,33, 351–385.
[CrossRef]
46.
Idreos, S.; Callaghan, M. Key-Value Storage Engines. In Proceedings of the 2020 ACM SIGMOD International Conference on
Management of Data, Portland, OR, USA, 14–19 June 2020; pp. 2667–2672. [CrossRef]
47.
Alsubaiee, S.; Altowim, Y.; Altwaijry, H.; Behm, A.; Borkar, V.; Bu, Y.; Carey, M.; Cetindil, I.; Cheelangi, M.; Faraaz, K.; et al.
AsterixDB: A scalable, open source BDMS. Proc. VLDB Endow. 2014,7, 1905–1916. [CrossRef]
48.
Carvalho, I.; Sá, F.; Bernardino, J. Performance Evaluation of NoSQL Document Databases: Couchbase, CouchDB, and MongoDB.
Algorithms 2023,16, 78. [CrossRef]
49.
Besta, M.; Gerstenberger, R.; Peter, E.; Fischer, M.; Podstawski, M.; Barthels, C.; Alonso, G.; Hoefler, T. Demystifying Graph
Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries. ACM Comput. Surv. 2023,56,
1–40. [CrossRef]
Information 2025,16, 932 37 of 38
50.
Francis, N.; Green, A.; Guagliardo, P.; Libkin, L.; Lindaaker, T.; Marsault, V.; Plantikow, S.; Rydberg, M.; Selmer, P.; Taylor, A.
Cypher: An Evolving Query Language for Property Graphs. In Proceedings of the 2018 International Conference on Management
of Data, Houston, TX, USA, 10–15 June 2018; pp. 1433–1445. [CrossRef]
51.
Melnik, S.; Gubarev, A.; Long, J.J.; Romer, G.; Shivakumar, S.; Tolton, M.; Vassilakis, T. Dremel: Interactive analysis of web-scale
datasets. Commun. ACM 2011,54, 114–123. [CrossRef]
52.
Rey, A.; Rieger, M.; Neumann, T. Nested Parquet Is Flat, Why Not Use It? How To Scan Nested Data With On-the-Fly Key
Generation and Joins. Proc. ACM Manag. Data 2025,3, 1–24. [CrossRef]
53.
Ghemawat, S.; Gobioff, H.; Leung, S. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems
Principles (SOSP), Bolton Landing, NY, USA, 19–22 October 2003; pp. 29–43. [CrossRef]
54.
Shvachko, K.; Kuang, H.; Radia, S.; Chansler, R. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th
Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 3–7 May 2010; pp. 1–10. [CrossRef]
55.
Sethi, R.; Traverso, M.; Sundstrom, D.; Phillips, D.; Xie, W.; Sun, Y.; Yegitbasi, N.; Jin, H.; Hwang, E.; Shingte, N.; et al. Presto:
SQL on Everything. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China,
8–11 April 2019; pp. 1802–1813. [CrossRef]
56. Vassiliadis, P. A Survey of Extract–Transform–Load Technology. Int. J. Data Warehous. Min. 2009,5, 1–27. [CrossRef]
57.
Almeida, J.R.; Coelho, L.; Oliveira, J.L. BIcenter: A Collaborative Web ETL Solution Based on a Reflective Software Approach.
SoftwareX 2021,16, 100892. [CrossRef]
58.
Kolev, B.; Valduriez, P.; Bondiombouy, C.; Jiménez, R.; Pau, R.; Pereira, J. Cloudmdsql: Querying heterogeneous cloud data stores
with a common language. Distrib. Parallel Databases 2015,34, 463–503. [CrossRef]
59.
Behm, A.; Palkar, S.; Agarwal, U.; Armstrong, T.; Cashman, D.; Dave, A.; Greenstein, T.; Hovsepian, S.; Johnson, R.; Sai Krishnan,
A.; et al. Photon: A Fast Query Engine for Lakehouse Systems. In Proceedings of the 2022 International Conference on
Management of Data, Philadelphia, PA, USA, 12–17 June 2022; pp. 2326–2339. [CrossRef]
60. Hausenblas, M.; Nadeau, J. Apache Drill: Interactive Ad-Hoc Analysis at Scale. Big Data 2013,1, 100–104. [CrossRef]
61.
Eichler, R.; Berti-Equille, L.; Darmont, J. Modeling metadata in data lakes—A generic model. Data Knowl. Eng. 2021,134, 101931.
[CrossRef]
62.
Herschel, M.; Diestelkämper, R.; Ben Lahmar, H. A survey on provenance: What for? What form? What from? Vldb J. 2017,
26, 881–906. [CrossRef]
63.
Jahnke, N.; Otto, B. Data Catalogs in the Enterprise: Applications and Integration. Datenbank-Spektrum 2023,23, 89–96. [CrossRef]
64. Consortium, T.G.O. The Gene Ontology resource: Enriching a GOld mine. Nucleic Acids Res. 2020,49, D325–D334. [CrossRef]
65.
Raskin, R.G.; Pan, M.J. Knowledge representation in the Semantic Web for Earth and Environmental Terminology (SWEET).
Comput. Geosci. 2005,31, 1119–1125. [CrossRef]
66. Niccolucci, F.; Doerr, M. Extending, mapping, and focusing the CIDOC CRM. Int. J. Digit. Libr. 2017,18, 251–252. [CrossRef]
67.
Petrova, G.G.; Tuzovsky, A.F.; Aksenova, N.V. Application of the Financial Industry Business Ontology (FIBO) for development
of a financial organization ontology. J. Phys. Conf. Ser. 2017,803, 012116. [CrossRef]
68.
Mandel, J.C.; Kreda, D.A.; Mandl, K.D.; Kohane, I.S.; Ramoni, R.B. SMART on FHIR: A standards-based, interoperable apps
platform for electronic health records. J. Am. Med. Inform. Assoc. 2016,23, 899–908. [CrossRef]
69. Vrandeˇci´c, D.; Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. Acm 2014,57, 78–85. [CrossRef]
70.
Bizer, C.; Lehmann, J.; Kobilarov, G.; Auer, S.; Becker, C.; Cyganiak, R.; Hellmann, S. DBpedia—A crystallization point for the
Web of Data. J. Web Semant. 2009,7, 154–165. [CrossRef]
71. Moreau, L.; Groth, P.; Cheney, J.; Lebo, T.; Miles, S. The rationale of PROV. J. Web Semant. 2015,35, 235–257. [CrossRef]
72.
Kim, B.; Niu, S.; Ding, B.; Kraska, T.; Luo, J.; Luo, W.; Tang, C.; Wang, Z.; Zhang, C.; Zhou, J. Learned Cardinality Estimation: An
In-depth Study. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD), Philadelphia, PA, USA,
12–17 June 2022; pp. 1214–1227. [CrossRef]
73.
Marcus, R.; Papaemmanouil, O. Deep Reinforcement Learning for Join Order Enumeration. In Proceedings of the 1st International
Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM@SIGMOD), Houston, TX, USA, 10 June
2018; pp. 3:1–3:4. [CrossRef]
74.
Marcus, R.; Negi, P.; Mao, H.; Tatbul, N.; Alizadeh, M.; Kraska, T. Bao: Making Learned Query Optimization Practical.
In Proceedings of the 2021 International Conference on Management of Data (SIGMOD), Xi’an, China, 20–25 June 2021;
pp. 1275–1288. [CrossRef]
75.
Ahmad, Y.; Kennedy, O.; Koch, C.; Nikolic, M. DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views.
Proc. Vldb Endow. 2012,5, 968–979. [CrossRef]
76.
Budiu, M.; Chajed, T.; McSherry, F.; Ryzhyk, L.; Tannen, V. DBSP: Automatic Incremental View Maintenance for Rich Query
Languages. Proc. VLDB Endow. 2023,16, 1601–1614. [CrossRef]
77. Elghandour, I.; Aboulnaga, A. ReStore: Reusing Results of MapReduce Jobs. Proc. Vldb Endow. 2012,5, 586–597. [CrossRef]
Information 2025,16, 932 38 of 38
78.
Armbrust, M.; Das, T.; Torres, J.; Yavuz, B.; Zhu, S.; Xin, R.; Ghodsi, A.; Stoica, I.; Zaharia, M. Structured Streaming: A Declarative
API for Real-Time Applications in Apache Spark. In Proceedings of the 2018 International Conference on Management of Data,
Houston, TX, USA, 10–15 June 2018; pp. 601–613. [CrossRef]
79.
Akidau, T.; Begoli, E.; Chernyak, S.; Hueske, F.; Knight, K.; Knowles, K.; Mills, D.; Sotolongo, D. Watermarks in stream processing
systems: Semantics and comparative analysis of Apache Flink and Google cloud dataflow. Proc. VLDB Endow. 2021,14, 3135–3147.
[CrossRef]
80. Vogels, W. Eventually Consistent. Commun. ACM 2009,52, 40–44. [CrossRef]
81.
Schelter, S.; Biessmann, F.; Januschowski, T.; Salinas, D.; Seufert, S.; Krettek, A. Automating Large-Scale Data Quality Verification.
Proc. Vldb Endow. 2018,11, 1781–1794. [CrossRef]
82.
Janssen, N.; Ilayperuma, T.; Arachchige, J.J.; Bukhsh, F.A.; Daneva, M. The evolution of data storage architectures: Examining the
secure value of the data lakehouse. J. Data, Inf. Manag. 2024,6, 309–334. [CrossRef]
83.
Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.W.; da Silva Santos,
L.B.; Bourne, P.E.; et al. The FAIR Guiding Principles for Scientific Data Management and Stewardship. Sci. Data 2016,3, 160018.
[CrossRef]
84.
Callahan, T.J.; Tripodi, I.J.; Stefanski, A.L.; Cappelletti, L.; Taneja, S.B.; Wyrwa, J.M.; Casiraghi, E.; Matentzoglu, N.A.; Reese, J.;
Silverstein, J.C.; et al. An open source knowledge graph ecosystem for the life sciences. Sci. Data 2024,11, 363. [CrossRef]
85.
Gowan, T.A.; Horel, J.D.; Jacques, A.A.; Kovac, A. Using Cloud Computing to Analyze Model Output Archived in Zarr Format. J.
Atmos. Ocean. Technol. 2022,39, 449–462. [CrossRef]
86.
Moore, J.; Basurto-Lozada, D.; Besson, S.; Bogovic, J.; Bragantini, J.; Brown, E.M.; Burel, J.; Moreno, X.C.; Medeiros, G.d.; Diel,
E.E.; et al. Ome-zarr: A cloud-optimized bioimaging file format with international community support. Histochem. Cell Biol. 2023,
160, 223–251. [CrossRef]
87.
Joyce, A.; Javidroozi, V. Smart City Development: Data Sharing vs. Data Protection Legislations. Cities 2024,148, 104859.
[CrossRef]
88.
Ahmi, A. OpenRefine: An Approachable Tool for Cleaning and Harmonizing Bibliographical Data. AIP Conf. Proc. 2023,
2827, 030006. [CrossRef]
89.
Willekens, F. Programmatic Access to Open Statistical Data for Population Studies: The SDMX Standard. Demogr. Res. 2023,
49, 1117–1162. [CrossRef]
90.
Kayali, M.; Lykov, A.; Fountalis, I.; Vasiloglou, N.; Olteanu, D.; Suciu, D. Chorus: Foundation Models for Unified Data Discovery
and Exploration. Proc. Vldb Endow. 2024,17, 2104–2114. [CrossRef]
91.
Leipzig, J.; Nüst, D.; Hoyt, C.T.; Ram, K.; Greenberg, J. The role of metadata in reproducible computational research. Patterns
2021,2, 100322. [CrossRef]
92.
Ahmad, T. Benchmarking Apache Arrow Flight - A wire-speed protocol for data transfer, querying and microservices. In
Proceedings of the Benchmarking in the Data Center: Expanding to the Cloud, Seoul, Republic of Korea, 2–6 April 2022; [CrossRef]
93.
Shraga, R.; Gal, A. PoWareMatch: A Quality-aware Deep Learning Approach to Improve Human Schema Matching. J. Data Inf.
Qual. 2022,14, 1–27. [CrossRef]
94.
Zhang, J.; Shin, B.; Choi, J.D.; Ho, J.C. SMAT: An Attention-Based Deep Learning Solution to the Automation of Schema Matching.
In Advances in Databases and Information Systems (ADBIS 2021); Springer: Berlin/Heidelberg, Germany, 2021; Volume 12843,
Lecture Notes in Computer Science; pp. 260–274. [CrossRef]
95.
Union, E. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural
persons with regard to the processing of personal data and on the free movement of such data. Off. J. Eur. Union 2016,679, 10–13.
96.
European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 on harmonised
rules on artificial intelligence. Off. J. Eur. Union 2024. Available online: https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng
(accessed on 21 October 2025).
97.
Álvarez, J.M.; Colmenarejo, A.B.; Elobaid, A.; Fabbrizzi, S.; Fahimi, M.; Ferrara, A.; Ghodsi, S.; Mougan, C.; Papageorgiou, I.;
Lobo, P.R.; et al. Policy advice and best practices on bias and fairness in ai. Ethics Inf. Technol. 2024,26, 31. [CrossRef]
Disclaimer/Publishers Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.