Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges PDF Free Download

Name: Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges PDF
Author: Kevin Ray

1 / 38

0 views•38 pages

Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges PDF Free Download

Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges PDF free Download. Think more deeply and widely.

Academic Editors: Haridimos

Kondylakis, Yuan Tian, Le Sun,

Vidyasagar Potdar and Biao Song

Received: 27 August 2025

Revised: 28 September 2025

Accepted: 22 October 2025

Published: 26 October 2025

Citation: Koukaras, P. Data

Integration and Storage Strategies in

Heterogeneous Analytical Systems:

Architectures, Methods, and

Interoperability Challenges.

Information 2025,16, 932. https://

doi.org/10.3390/info16110932

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license

(https://creativecommons.org/

licenses/by/4.0/).

Review

Data Integration and Storage Strategies in Heterogeneous Analytical

Systems: Architectures, Methods, and Interoperability Challenges

Paraskevas Koukaras

School of Science and Technology, International Hellenic University, 14th km Thessaloniki-Moudania,

57001 Thessaloniki, Greece; p.koukaras@ihu.edu.gr

Abstract

In the current scenario of universal accessibility of data, organisations face highly com-

plex challenges related to integrating and processing diverse sets of data in order to meet

their analytical needs. This review paper analyses traditional and innovative methods

used for data storage and integration, with particular focus on their implications for scal-

ability, consistency, and interoperability within an analytical ecosystem. In particular, it

contributes a cross-layer taxonomy linking integration mechanisms (schema matching,

entity resolution, and semantic enrichment) to storage/query substrates (row/column

stores, NoSQL, lakehouse, and federation), together with comparative tables and ﬁgures

that synthesise trade-offs and performance/governance levers. Through schema mapping

solutions addressing the challenges brought about by structural heterogeneity, storage

architectures varying from traditional storage solutions all the way to cloud storage solu-

tions, and ETL pipeline integration using federated query processors, the research provides

speciﬁc attention for the application of metadata management, with a focus on semantic

enrichment using ontologies and lineage management to enable end-to-end traceability

and governance. It also covers performance hotspots and caching techniques, along with

consistency trade-offs arising out of distributed systems. Empirical case studies from real

applications in enterprise lakehouses, scientiﬁc exploration activities, and public gover-

nance applications serve to invoke this review. Following this work is the possibility of

future directions in convergent analytical platforms with support for multiple workloads,

along with metadata-centric orchestration with provisions for AI-based integration. Com-

bining technological advancement with practical considerations results in an enabling

resource for researchers and practitioners seeking the creation of fault-tolerant, reliable,

and future-ready data infrastructure. This review is primarily aimed at researchers, system

architects, and advanced practitioners who design and evaluate heterogeneous analytical

platforms. It also offers value to graduate students by serving as a structured overview of

contemporary methods, thereby bridging academic knowledge with industrial practice.

Keywords: data integration; data storage; metadata management; heterogeneous

systems; schema mapping; lakehouse architecture; federated querying; data lineage; data

interoperability; pipeline orchestration

1. Introduction

Modern-day analytics operate over a heterogeneous collection of formats, systems,

and semantics, introducing integration challenges poorly served by schema or storage-

centric approaches. This section introduces the contextual background around this re-

search topic. It highlights the importance of deﬁning consistent methodologies to bridge

Information 2025,16, 932 https://doi.org/10.3390/info16110932

Information 2025,16, 932 2 of 38

heterogeneity-aware integration to scalable storage and clariﬁes relevant parameters and

research contributions.

1.1. The Growing Heterogeneity in Analytical Ecosystems

In today’s society, in which the value of data usage has become even more salient,

organisations have increasingly relied on a heterogeneous set of data sources for to boost

decision-making activities, knowledge gathering, and machine learning activities. These

sources include traditional relational databases, key-value storage repositories, document-

oriented storage repositories, semi-structured repositories like JavaScript Object Notation

(JSON) and Extensible Markup Language (XML), and public cloud storage solutions like

Google Cloud Storage and Amazon Simple Storage Service (S3), along with structured ﬂat

ﬁles organised in a column-wise structure like Apache Parquet and ORC [

–

]. The diversity

inherent in these data sources comes not only from the storage structure, along with

supported query methods, but also from heterogeneities in schemata, data representation

methods, access protocols, and update characteristics. This diversity becomes a signiﬁcant

obstacle for data management activities, along with integration activities, mainly because

end users aim to produce uniform, high-quality outputs from environments that happen to

be highly heterogeneous but also show considerable distribution characteristics [4–6].

Researchers, engineers, and data scientists regularly ﬁnd themselves dealing with

issues related to heterogeneous data pipelines—more speciﬁcally, with the problem of the

integration of heterogeneous data from multiple storage systems, models, and schemas.

Although there have been signiﬁcant boosts towards scaled storage solutions and high-

query-throughput technologies, various systems still reel from issues related to schema

mismatches, redundancy, deceptive metadata, and varying data semantics from a multi-

plicity of sources. These issues are highly noted within enterprise data lakes, federated info

structures, science research campuses, and analytical platforms across multiple operational

areas. The increased occurrence of these problems within hybrid cloud infrastructure

clearly highlights the need for more advanced integration methods, along with adaptive

storage solutions offering varying levels of ﬂexibility with regards to changing schemas,

lowered redundancy, and semi-structured data [5,7,8].

1.2. Motivation for Uniﬁed Integration and Storage Strategies

It is important to acquire holistic knowledge of integration approaches to meet the

challenges of diverse data types that comprise schema matching, instance alignment, and

data fusion. The need is especially accentuated when such approaches are combined with

modern storage platforms like distributed ﬁle systems, Not Only SQL (NoSQL) stores,

and lakehouse implementations [

–

]. Static relational schema datasets were the focus

of traditional integration approaches. This vision is now less sufﬁcient to address the

workloads that require more ﬂexible approaches to process-dynamic, semi-structured, and

rapidly changing datasets.

Storage innovation has brought about new trade-offs between performance, ﬂexibility,

and consistency. Schema-on-read data querying practices, late-binding policies, and feder-

ated access protocols can be used for the querying of data without pre-data modelling but

generally weaken semantic expressiveness or operational efﬁciency [

]. Therefore, data in-

tegration solutions require storage infrastructure alignment for scalability, interoperability,

and reliable analytics [7,8,12].

This study provides a valuable and beneﬁcial question for researchers and practi-

tioners alike, as it enables effective decision-making in system development, helps de-

termine challenges of integration due to discipline-speciﬁc differences in infrastructure,

Information 2025,16, 932 3 of 38

and ensures the application of best practices in managing analytical processes in various

infrastructural environments.

1.3. Scope and Contributions of the Review

This research undertakes a broad review of integration and storage methods utilised

within various analytical architectures, providing an equitable focus on areas that include

both structured and semi-structured data. The research includes schema-based approaches

such as schema matching, Entity Resolution (ER), and semantic enrichment, as well as

architecture-centric strategies that involve columnar database systems, key-value store

alternatives, cloud storage solutions, and integrated query processing systems [1–3,7,9].

Modern methodologies and frameworks are categorised according to their roles in

the data management continuum, including features that range from data acquisition to

storage setup, metadata management, and cross-source query optimisation. This analysis

is centred on the solutions offered by these methodologies with respect to heterogeneity,

scalability, and interoperability concerns, as well as regarding the trade-offs that arise in

real-world deployments [5,6].

This work responds to three main aspects: (i) synthesis of data integration and storage

paradigms in heterogeneous analytical contexts, drawing connections between schema-

level and infrastructure-level heterogeneity; (ii) a comparative analysis of representative

systems and tools, highlighting their functionalities, limitations, and integration strategies;

and (iii) identiﬁcation of outstanding challenges and potential paths, including opportuni-

ties for integration spurred by metadata, schema evolution management, and AI-enabled

interoperability. Beyond these three aforementioned concerns, this work offers four sub-

stantial contributions.

•

It introduces an end-to-end cross-layer taxonomy that distinguishes integration mech-

anisms like schema matching, entity resolution, and semantic enrichment over a het-

erogeneous collection of storage methods and access models, including row/column

stores, many families of NoSQL stores, cloud-native columnar stores, and lakehouse

systems. This taxonomy clearly illustrates the interplay of workloads and architectures

and their effects for governance and performance. By considering Sections 2and 3

simultaneously, practitioners can immediately map data uniﬁcation to its concomitant

storage point and the corresponding utilised query format.

•

Operational planning of ETL, ELT, and data virtualisation/federation is laid out by

considering popular process models and tools, followed by an analysis of the impact

of optimisers and pushdown methods in integrated SQL engines. This creates a prag-

matic architecture for the balancing of the latency, cost, and freshness of data across

varying backends.

•

Metadata, lineage, and schema evolution are highlighted as central constructs in

ensuring reproducibility and dependability, in addition to formalising performance

enhancements such as caching/materialisation and freshness/consistency control, in

a checklist to support architectural decision-making.

•

The ﬁndings are then consolidated with cross-domain examples such as enterprise

lake/lakehouse, scientiﬁc integration, and public analytics domains, which leverage

pragmatic patterns for managed, heterogeneous schema-on-read/write pipelines.

These developments integrate embedded methodologies and infrastructure into a

uniﬁed, decision-focused view for data teams and architects.

Information 2025,16, 932 4 of 38

Positioning vs. Prior Surveys

Relative to recent surveys on data lakes [

], data federation [

], lakehouses [

], uniﬁed

analytics [

], and data integration [

], Table 1contrasts scope and adds a side-by-side view

of what this synthesis contributes beyond these works.

This review contributes (i) a cross-layer taxonomy that explicitly couples integration

mechanisms (schema matching, entity resolution, and semantic enrichment) with storage

and access models (row/column stores, NoSQL, lakehouses, and federation), operational-

ising this via an application cross-walk (Section 7); (ii) working, reproducible resolution

workﬂows for canonical heterogeneity problems—schema name mismatch and instance-

level date ambiguity—with concrete enforcement/lineage policies (Section 2.4); (iii) a

governance-aware treatment that binds performance levers (pushdown and materialisa-

tion/caching) to data-quality SLAs, freshness, and consistency (Section 6); and (iv) the

ﬁrst synthesis, according to known information, that integrates peer-reviewed 2024–2025

evidence on AI-assisted integration—LLM-based schema/ontology matching and cost-

aware/active ER—into an architectural decision frame [15–20].

This positioning clariﬁes how this survey complements system-focused reviews by

providing design-level guidance that traverses schema, semantics, storage, and operations

by translation into domain patterns (Section 7).

Table 1. Positioning against prior surveys: scope, coverage, and added value of this review.

Survey Primary Focus Coverage What This Review Adds

[5]Data lake

functions/systems

Ingestion, metadata,

governance, and

lake infrastructure

Cross-layer linkage of

schema/ER/semantics

with storage + perfor-

mance/governance levers

[6]Federation architectures

and capabilities

Connectors, query

translation, and

execution models

Integration of federation

with storage models,

ETL/ELT, and

performance tuning

[13]Lakehouse

concepts/technologies

Transactional tables, open

formats, and hybrid

lake/warehouse design

Comparative role of

lakehouses among

integration strategies;

lineage/governance

integration

[14] Uniﬁed analytics vision Fabrics, abstractions, and

open challenges

Operational taxonomy,

workﬂows, domain case

studies, and incorporation

of 2024–2025

AI instruments

[9]Big data

integration foundations

Schema matching,

mapping, and fusion

Updated synthesis with

recent metadata

catalogues, lakehouses,

and federated engines

The goal of this work is to act as a reference point for data engineers, system architects,

and researchers involved in the integration of data analytics and data management in

terms of connecting theoretical integration models with relevant technical frameworks for

the storage and retrieval of data. To clarify the intended audience, consider the follow-

ing: (i) researchers beneﬁt from a synthesis of taxonomies, comparative evaluations, and

identiﬁcation of open challenges; (ii) practitioners and system architects gain guidance on

trade-offs, architectural patterns, and integration strategies applicable to real deployments;

and (iii) students obtain an accessible yet rigorous entry point into heterogeneous data

integration and storage, which complements more specialised literature. This multi-tiered

Information 2025,16, 932 5 of 38

orientation ensures that the review delivers concrete utility across scholarly, professional,

and pedagogical contexts.

Finally, to make such a taxonomy directly actionable, this work explicitly connects

its layers to the application domains analysed in Section 7. Concretely, (i) integra-

tion mechanisms (schema matching and entity resolution for semantic enrichment from

Sections 2.2 and 2.3) are mapped to enterprise lakehouse harmonisation of customer and

product identiﬁers; (ii) metadata, lineage, and ontology scaffolding (Section 5) underpin

scientiﬁc workﬂows for reproducibility and semantic alignment; and (iii) hybrid schema

strategies and federated access (Sections 4.1–4.3 address governance and inter-agency

interoperability in public-sector pipelines.

1.4. Review Methodology

This work is conceived as a structured survey rather than a systematic review. Ac-

cordingly, it did not aim at exhaustive coverage of all publications but, instead, sought to

capture representative methods, architectures, and trends that deﬁne the state of practice in

heterogeneous analytical systems.

The literature selection followed three guiding principles. First, it focused on peer-

reviewed venues where novel data integration and storage strategies are usually published,

including ACM SIGMOD;VLDB;IEEE ICDE; and journals such as TKDE,VLDBJ, and

Information Fusion. Second, the time span of 2015–2025 was targeted to reﬂect the transition

from early data-lake and NoSQL adoption to recent lakehouse and AI-assisted integration

methods while retaining seminal pre-2015 contributions for context (e.g., schema match-

ing [

], data fusion [

], and provenance [

]). Third, an inclusion criterion was applied

that required explicit linkage between integration mechanisms (schema matching, ER, and

semantics) and storage/query substrates (row/column, NoSQL, lakehouse, and federation),

as well as relevance to metadata, lineage, governance, or performance trade-offs.

Searches were performed in ACM Digital Library, IEEE Xplore, DBLP, and Google

Scholar using combinations of terms such as “data lake”, “lakehouse”, “federated query”,

“schema matching”, “entity resolution”, and “metadata catalog”. Reference snowballing

from key surveys (e.g., [5,6,13,14]) ensured coverage of inﬂuential works.

Finally, to structure the review, the selected literature was organised into a cross-layer

taxonomy spanning schema/ER/semantic integration (Section 2), storage models and

architectures (Section 3), integration strategies (Section 4), metadata/lineage/semantics

(Section 5), performance levers (Section 6), and application domains (Section 7). This taxon-

omy provides a balanced view that prioritises clarity, representativeness, and analytical

depth over completeness.

2. Foundations of Data Integration

In this part, schema, semantic, and instance levels are pinpointed as sources of het-

erogeneity, followed by methodologies suggested to overcome these challenges, such as

schema mapping or matching, entity resolution, and other forms of data fusion. These are

the basic terms and central notions applied throughout the research.

2.1. Types of Heterogeneity: Schema, Semantic, Instance Levels

Data integration is the process of combining information from independent sources

into a uniﬁed scheme. This stage is related to various aspects of heterogeneity [

Schema heterogeneity occurs when there are different datasets that use differing structures

or schemas to portray repeated or similar data. For instance, two databases can refer to

customer details using different terms for attributes (like customer_id and client_no) or

can use differing types (like string and integer with respect to identiﬁcation-based data)

Information 2025,16, 932 6 of 38

with respect to customer details. The common method of harmonising these structurally

differing variations often forms a considerable part of the ﬁrst phase of the integration

process [21].

Besides schema mismatches, semantic heterogeneity appears in the case of differing

meanings of the elements of data, regardless of their structure or name being similar. A

classic example of this occurrence is an application using the word salary for gross annual

earnings but another using it for net take-home monthly earnings. Addressing semantic

heterogeneity calls for context understanding, often aided by metadata, external ontologies,

or domain-speciﬁc vocabularies [23,24].

Instance-level heterogeneity refers to value differences arising from different sources.

Value differences can occur in the form of format differences (1/12/2025 vs. 2025-12-01),

unit mismatches (kilometres vs. miles), or identiﬁer disagreements (customer IDs varying

across systems). These instances pose barriers to dataset integration and, hence, necessitate

the use of instance-level reconciliation methods, which can be in the form of normalisa-

tions, transformations, or record linkages [

]. The integration of multiple types of

heterogeneity poses signiﬁcant challenges, particularly in the integration of semi-structured

and even unstructured data with traditional structured databases [

]. Next-generation

data integration platforms need to have the ability to deal with the dynamic alignment

of schemas, enabling semantic mediation and recording of the reconciliations of records,

along with maintenance of data integrity and provenance [23,27].

For pedagogical clarity, the resolution steps are also made explicit. Typical schema

name mismatches (e.g., customer_id vs. client_no) are ﬁrst detected via semi-automated

schema matching (lexical/structural/instance evidence), then compiled into executable

mappings (SQL/XQuery) and validated against gold (ground-truth curated by domain

experts) or ontology-backed correspondences. Instance-level ambiguities (e.g., the date

“01/12/2025”) are eliminated through schema-on-write or hybrid enforcement, which

standardises canonical formats (e.g., ISO 8601 [

]) at ingestion and validates them via

metadata-driven checks. Where legacy sources cannot be conformed at write time, a

compensating read-time normaliser is applied, with lineage being recorded.

Table 2summarises the main forms of heterogeneity encountered in data integration,

illustrating schema, semantic, and instance-level challenges with examples.

Table 2. Types of heterogeneity in data integration, with examples and resolution approaches.

Heterogeneity Type Description and Example Resolution Approach

Schema heterogeneity

Differing data schemas

(e.g., customer_id vs.

client_no)

Schema mapping and

translation [21]

Semantic heterogeneity

Same term has different

meanings (e.g.,

salary = gross vs. net)

Ontology mapping [24]

Instance-level

heterogeneity

Different formats/values

(e.g., dates 01/12/2025 vs.

2025-12-01)

Data cleaning and

normalisation [25]

2.2. Schema Matching and Map Generation

Schema matching is the activity of identifying matches between the elements of vari-

ous data schemas. This is a basic activity for data integration where heterogeneous datasets

can be combined into an integrated frame of reference [

]. The algorithms used for

schema matching can broadly be classiﬁed into three main categories: name-based methods,

structure-based methods, and instance-based methods (Table 3summarises these). Recent

AI-driven instruments demonstrate that LLM prompting over schema documentation

Information 2025,16, 932 7 of 38

can bootstrap high-quality correspondences in instance-restricted domains and reduce

veriﬁcation effort, exceeding purely lexical baselines [

]. Beyond instance-free discovery,

ﬁne-tuned LLMs have been used to materialise executable mappings against standard infor-

mation models in industrial settings, thereby strengthening semantic interoperability [

Table 3. Categories of schema matching algorithms, with matching strategies and examples.

Matcher Type Matching Strategy Representative Work

Name-based (lexical) Compares schema element names

(string similarity and synonyms) [32]

Structure-based Exploits schema structure (hierarchies

and parent–child structures) [29]

Instance-based Uses actual data values (overlapping

ranges and distributions) [21]

Lexical matchers rely upon known lexical relationships present at both the attribute

and table levels. String distance measuring techniques, including edit-distance dynamic

programming and set-overlap measures, may be used in combination with methods em-

ploying synonyms from linguistic resources to determine potential matches [32].

Structure-based matchers exploit the inherent relationships between the elements of a

schema, such as type constraints, hierarchies, and foreign keys, to boost the effectiveness

of a match. For example, tables that share similar parent–child relationships or similar

referential patterns could be considered structurally equivalent [29].

Instance-based matchers evaluate the compatibility of real data values stored within

schemas via comparative analysis. Signs of semantic similarity between schema elements

may depend on measures like the intersection of value ranges, shared value distributions,

or statistical interdependencies. Once correspondences are identiﬁed, mapping generation

produces transformation rules, which specify how source schemas are converted into

the target schema. These mappings can then be speciﬁed in a declarative manner using

languages like the Structured Query Language (SQL), XQuery, or Datalog or, conversely,

can be built visually using ETL software [

]. For scenarios where higher sophistication is

a necessity, map generation is performed using semi-automated or interactive software,

which still requires human intervention for validation or ﬁne-tuning of the proposed

recommendations, especially in environments where precision is of high value.

Schema matching solutions using AI methods like embeddings and neural models

have been reported in recent literature [

]. Studies indicate an encouraging possibility of

providing high-quality mappings with limited manual intervention to facilitate comprehen-

sive integration efforts accommodating a heterogeneous set of sources, inconsistent naming

conventions, and limited documentation assistance. Foundational mapping systems such

as Clio demonstrate map creation and data exchange at scale [34].

2.3. Entity Resolution and Data Fusion

Though the schema is the same throughout, a major challenge refers to ER, which is

more commonly termed record linkage, duplication identiﬁcation, or reference reconcilia-

tion. It refers to the correlation and comparison of records for a single real-world entity

across multiple platforms. Data relating to customers gathered from more than a single

Customer Relationship Management system can employ varying identiﬁers, spellings,

or unique address patterns for a single individual [

]. Concurrently, cost-aware LLM

prompting has been systematised. The authors of [

] designed batch prompting for ER

that preserves accuracy while substantially reducing API cost. In engineering contexts,

ﬁne-tuned open-source LLMs used as ER classiﬁers surpass the prior SOTA (and GPT-4

with in-context learning) on benchmarking datasets, indicating practical headroom for

Information 2025,16, 932 8 of 38

LLM-driven ER beyond zero/few-shot architectures [

]. Moreover, active in-context

learning improves cross-domain ER without task-speciﬁc ﬁne-tuning, mitigating domain

shift [16].

Most ER algorithms are based on a rich framework consisting of similarity functions,

blocking methods, and classiﬁcation models [

]. Similarity functions are applied to assess

particular attributes like names, dates, and locations. Blocking, on the other hand, reduces

the search space by dividing potential matches into subsets that enable fast processing.

Most classiﬁcation methodologies use supervised learning to infer whether two records

refer to the same entity using previously extracted features. Recent surveys also emphasise

probabilistic and Bayesian approaches for ER at scale [35].

Upon completion of the entity identiﬁcation process, the next step is data fusion, which

is the combination of similar records into a single, uniﬁed representation. This involves the

resolution of inconsistencies inherent in the varying sets of attribute values. Techniques

used to enable this fusion include the following:

• Source prioritisation: Prioritisation of a primary source over other sources;

•

Aggregation or voting: The combination of many values applying statistical methods;

•

Provenance-aware fusion: the use of metadata in selecting values based on their

timeliness, prevalence, or reliability.

Advanced fusion techniques have the ability to incorporate domain-speciﬁc data,

ontologies, or user input to enable conﬂict resolution. Across many real-world applications,

data fusion needs to preserve lineage and traceability, allowing users to better understand

the processes by which the resultant fused representation was constructed, as well as the

initial sources from which it was derived [

]. Modern ER and data fusion techniques

are increasingly being integrated into data management systems, especially into Master

Data Management (MDM) systems and data lakehouses [

]. In analytical applications

that necessitate real-time or near-real-time processing performance, it is critical that such

software tools improve their scalability while making it trouble-free to integrate with

metadata repositories and support workﬂow orchestration [

]. The overall process of

entity resolution and data fusion is summarised in Figure 1, illustrating the sequence from

similarity computation through blocking and supervised classiﬁcation to fusion strategies

that produce uniﬁed records with preserved lineage.

2.4. Illustrative Resolution Scenarios and Workﬂows

This subsection provides practical examples that move beyond conceptual statements

and demonstrate concrete, reproducible resolution ﬂows for common heterogeneity issues.

2.4.1. Scenario A: Schema Name Mismatch (customer_id vs. client_no)

Evidence gathering: Apply name-, structure-, and instance-based matching algo-

rithms to recommend a correspondence amongst customer_id (RDBMS A) and

client_no (RDBMS B).

Candidate alignment: Check with domain ontology (e.g., “Customer” with one main

identiﬁer) to resolve homonyms and enforce cardinality/uniqueness constraints.

Mapping synthesis: Develop an operational mapping (for instance, an SQL view

or an ELT transformation) that presents client_no as customer_id while ensuring

type harmonisation.

4. Validation and lineage: Validate the precision and recall with a duly chosen sample.

Keep the mapping, quality measures, and provenance in the metadata catalogue and

implement it as a reusable module in the pipeline.

Information 2025,16, 932 9 of 38

Figure 1. Entity resolution and data fusion: similarity functions; blocking; supervised classiﬁcation us-

ing previously extracted features; clustering; and fusion via source prioritisation, aggregation/voting,

and provenance-aware selection, producing uniﬁed records that preserve lineage/traceability.

Figure 2visualises the matcher→mapping→validation workﬂow.

2.4.2. Scenario B: Instance-Level Ambiguity (Date “01/12/2025”)

Policy: Adopt a canonical date format (ISO 8601, YYYY-MM-DD) as a data contract

for analytical layers.

Enforcement (schema-on-write where feasible): At ingestion, parse source dates

using locale-aware parsers with explicit day/month disambiguation. Reject or quar-

antine unparsable records. Normalise to ISO.

Hybrid fallback (schema-on-read): For immutable/legacy sources, apply a determin-

istic normaliser at query time with source-speciﬁc locale rules. Mark records with

conﬁdence and retain the raw value.

Validation and observability: Deﬁne expectations (e.g., no ambiguous DD/MM vs.

MM/DD overlaps) and monitor drift. Expose freshness and conformance metrics in

the catalogue/lineage system.

Information 2025,16, 932 10 of 38

Figure 2. Schema-matching

→

mapping

→

validation workﬂow for resolution of customer_id vs.

client_no.

Table 4summarises the policy, and Figure 3shows the control ﬂow.

Table 4. Ambiguous date inputs and canonicalisation policy for analytical layers.

Ambiguous Input Enforced Rule

(Contract) Canonical Form Validation

Mechanism

01/12/2025

(unknown locale)

YYYY-MM-DD

(ISO 8601) 2025-12-01 or reject

Expectation checks,

quarantine

on ambiguity

12/01/2025

(unknown locale) YYYY-MM-DD 2025-01-12 or reject Locale proﬁle,

conﬁdence tagging

2025/12/01 YYYY-MM-DD 2025-12-01 Strict pattern

enforcement

Information 2025,16, 932 11 of 38

Figure 3. Date disambiguation pipeline with schema-on-write preference and hybrid read-time

normalisation under lineage/quality monitoring.

2.4.3. Scenario C: Semantic Conﬂict (Salary = Gross vs. Net)

Vocabulary alignment: Bind source attributes to ontology concepts (e.g., GrossAnnu-

alSalary and NetMonthlySalary).

Normalisation: Deﬁne transformation rules (unit/calendar/period adjustments) with

explicit semantics. Compute derived measures only where deﬁnable.

Query mediation: Expose a semantically consistent view (virtualised or materialised)

that prevents cross-meaning joins. Annotate with provenance and assumptions in

the catalogue.

2.5. Limitations and Possible Biases in Reviewed Methods

Despite substantial progress, each strand of data integration exhibits inherent risks

and biases, which may constrain reproducibility.

Schema matching and mapping. Lexical/structural matchers inherit biases from

naming conventions and language resources. Instance-based approaches are sensitive

to distributional drift, identiﬁer inconsistencies, and noisy or limited gold standards.

Consequently, reported accuracy may overstate generalisability, and human validation

remains indispensable, introducing subjectivity [21,33].

Entity resolution and data fusion. Blocking and ﬁltering reduce comparison space

but can depress recall. Severe class imbalance biases classiﬁers toward majority decisions.

Fusion rules that ignore source reliability, provenance, and timeliness may propagate

systematic errors [

]. These factors become more acute at scale and under streaming

or near-real-time constraints.

Ontology-based and semantic integration. Alignment suffers from coverage gaps,

ambiguous correspondences, and evolving vocabularies. Fully automatic matching remains

elusive in heterogeneous domains, often requiring expert intervention that is costly and

potentially subjective. Alignment errors can propagate downstream and bias integrated

views [17,18,24].

Information 2025,16, 932 12 of 38

System-level evaluation in federated/lakehouse settings. Heterogeneous connectors

and partial statistics create optimiser blind spots, yielding mis-costed pushdown and

inefﬁcient cross-source joins. Materialisation and caches risk staleness if not incrementally

maintained. Thus, reported performance may reﬂect system-speciﬁc assumptions rather

than general capability [6,7].

3. Storage Architectures for Analytical Workloads

This section discusses how storage choices shape analytics, i.e., Row vs. column

stores for OLTP/OLAP trade-offs, NoSQL models (key value, document, and graph)

for semi-structured needs, and cloud-native columnar formats and lakehouse layers for

scale and reliability. It sets the stage for the mapping of workload patterns to the right

storage medium.

3.1. Row-Oriented and Column-Oriented Stores

The choice of storage structure strongly affects the scalability and efﬁciency of ana-

lytical systems. Traditional row-oriented relational databases like PostgreSQL, MySQL,

and Oracle store data records sequentially, with the ﬁelds for each record located near each

other on the storage device. This structure is a very efﬁcient storage mechanism for the

typical Online Transaction Processing (OLTP) activities like reading or updating complete

records. Classic examples are updating a customer’s details and placing a new order [

Analytical queries, also known as Online Analytical Processing (OLAP), typically

require the retrieval of many columns from large rows (e.g., summing up sales amounts

over many geographic regions or ﬁltering over integer attributes). These types of queries

suffer from poor performance in row-oriented databases because of the large number

of input/output operations that they require, as well as poor caching mechanisms. A

suggested solution to this weakness has been the advent of column-oriented databases.

Excellent examples of this category of databases are MonetDB, Vertica, and ClickHouse,

which isolate individual columns into separate contiguous blocks, thereby allowing for fast

column scanning and increased compression ratios and encouraging better CPU vectorisa-

tion [4,37,38].

Column storage architectures use a series of encoding methods like run-length encod-

ing (RLE), dictionary encoding, and bit-packing to store data efﬁciently while supporting

fast ﬁltering and aggregation. These storage frameworks for data also use advanced index-

ing methods like zone maps, along with bitmap indexes to boost analytical query execution

performance [

–

]. Row-oriented databases were once predominant within traditional

real-time transaction environments but no longer comprise the single type of database

being used today. Column-store databases like Google BigQuery, Amazon Redshift, and

Snowﬂake became popular within analytical environments, as well as cloud data ware-

housing scenarios. Figure 4contrasts row stores and column stores and situates hybrid

systems that support mixed workloads.

Modern systems like Apache HBase, along with SAP HANA, use dual-storage engine

or hybrid settings, which provide storage for the data in either a column-oriented or row-

oriented scheme and meet the varied needs of heterogeneous workloads accordingly [

This blending of storage models indicates the progress in optimally matched storage

frameworks, keeping in view the major access patterns of analytical workloads.

Information 2025,16, 932 13 of 38

Figure 4. Storage models and workload ﬁt.

3.2. Key-Value, Document, and Graph Databases

To manage semi-structured or non-tabular data efﬁciently, several NoSQL storage

models have been developed, each with distinct data models and query paradigms. Most

NoSQL storage systems can be classiﬁed into three main categories (Table 5): key-value

stores, document stores, and graph databases [44–46].

Table 5. NoSQL database categories, their key characteristics, and example systems.

NoSQL Category Characteristics Examples

Key-value store

Data stored as key/value

pairs, optimised

for lookups

Amazon DynamoDB

and Redis

Document store

Semi-structured JSON-like

documents and

ﬂexible schema

MongoDB and Couchbase

Graph database Data as nodes and edges;

suited for relationships

Neo4j and

Amazon Neptune

Key-value stores like Redis, Amazon DynamoDB, and Riak structure data in pairs,

where a unique key acts as an identiﬁer attached to a corresponding value that can be

returned in an object or an unstructured blob form. These key-value stores’ architectures

have important performance attributes deﬁned mainly by high reading capability combined

with low latency, along with distributed scalability, which greatly simpliﬁes application

design in areas like session management, caching storage, and keeping huge records [

Information 2025,16, 932 14 of 38

Although their primitive querying capability severely constrains their use in analytical

scenarios unless complemented by secondary indexing or specialised processing methods,

document store databases like MongoDB, Couchbase, and Amazon DocumentDB structure

their data into hierarchical forms similar to JSON-like forms. A key feature of document

stores is sa ﬂexible schema deﬁnition combined with nested structures and pipeline ag-

gregation, which greatly simpliﬁes the handling of relatively complex queries involving

semi-structured data [

]. Document store implementations are often used in scenarios

where structuring of the data is prone to constant changes or heterogeneous types of records

like in application areas related to content management, product catalogues, or telemetry

data storage within Internet of Things scenarios.

Graph databases such as Neo4j, Amazon Neptune, and JanusGraph are designed to

store entities (nodes) and their relationships (edges), along with attributes. These databases

excel at handling highly interconnected datasets and support graph pattern matching via

languages such as Cypher, Gremlin, or the SPARQL Protocol and Resource Description

Framework (RDF) Query Language (SPARQL) [

]. In addition, they provide specialised

aspects geared toward supporting a wide range of applications, ranging from social net-

work data analysis to recommendation systems, anti-fraud ﬁlters, and the querying of

bioinformatics data. In spite of the attractive features of schema ﬂexibility combined with

scalability beneﬁts inherent in these NoSQL databases, issues evolve concerning the guaran-

tee of strong consistency requirements when compared to the more complex functionalities

of traditional relational database management systems (RDBMSs). As such, in the analytical

settings considered in this research, these systems operate as adjunct resources or data

sources in computational pipelines, adding value by enabling querying and aggregation in

subsequent stages of computation.

To complement the categorical overview, Table 6benchmarks the three NoSQL families

on consistency/availability, scale-out, and query-path performance (including scans vs.

traversals), consolidating peer-reviewed evidence.

Table 6. Benchmarking synthesis across NoSQL families used for graphs: consistency/availability

and scale-out mechanics, plus workload-level performance (scans vs. traversals) [44,48–50].

Metric Key-Value (e.g., Dynamo) Document (e.g., Mon-

goDB/CouchDB/Couchbase)

Graph DBs (LPG/RDF)

Consistency vs. availability

“Always writable” design,

quorum-tuned R/W,

eventual consistency with

vector clocks, and hinted

handoff for node outages

Stronger per node

consistency, cluster settings

vary by product, and

designed for CRUD with

JSON/BSON

Transaction models vary;

many support ACID for

OLTP, and global analytics

are typically read-only

Partitioning/scale-out

Consistent hashing, virtual

nodes, sloppy quorum,

and seamless node

add/remove function

Sharding and replica sets

are common, as well as per

collection partitioning

Sharding/replication

depend on the engine;

many native stores

optimize locality

for traversals

Write path

Fast, partition-local writes;

durability ensured via a

conﬁgurable W

Bulk inserts and high-rate

CRUD; durability ensured

via journaling/replica sync

OLTP writes supported

(engine-dependent), and

heavy analytics often

separated from the

write path

Information 2025,16, 932 15 of 38

Table 6. Cont.

Metric Key-Value (e.g., Dynamo) Document (e.g., Mon-

goDB/CouchDB/Couchbase) Graph DBs (LPG/RDF)

Scan/range queries

Limited (key-oriented);

range needs

secondary/indexed paths

YCSB: MongoDB has the best

overall runtime, while a

scan-heavy workload makes

CouchDB faster and

CouchDB scales best

with threads

Scans expressed via

label/property predicates; the

cost depends on index design;

not a primary strength

vs. documents

Traversals/locality

Multi-hop traversals require

app-level joins or

pre-materialization

Multi-hop joins across

collections are costly and not

traversal-centric

Native adjacency (AL and

direct pointers), and traversal

cost grows with the number

of visited subgraphs, not

graph size; suited to

path/pattern queries

Indexing

Primary key and optional

secondary indexes for

ranges/ﬁlters

Rich secondary and

compound indexes; text/geo

often available

Structural (neighbourhood) +

data indexes;

languages expose

pattern/path operators

Query languages KV APIs and

app-side composition

Aggregation pipelines and

SQL-like DSLs

SPARQL (RDF),

Cypher/Gremlin (LPG),

and mature

pattern/path semantics

3.3. Cloud-Native Formats and Distributed File Systems (e.g., Parquet and Delta Lake)

Recent developments in cloud-native infrastructure and data lakes have accelerated

the adoption of open, columnar ﬁle formats with distributed storage systems. Apache

Parquet, Apache Optimised Row Columnar (ORC), and Avro enable fast storage solutions

with efﬁcient query operations for large amounts of structured and semi-structured data in

cloud-centric and distributed environments. Beyond disk-oriented formats, evaluations

also consider in-memory columnar exchange frameworks such as Apache Arrow and its

Feather variant. While Arrow/Feather are not designed for long-term storage, they provide

extremely fast (de)serialisation and zero-copy interoperability across engines, making them

an important baseline in benchmarking studies [1,51].

Parquet is a columnar storage format specially designed to support nested data struc-

tures and is heavily optimised for use cases that arise from high-frequency read workloads.

The layout of this format exhibits excellent compatibility with multiple big data frameworks,

such as Apache Spark, Hive, Dremio, and Amazon Web Services (AWS) Athena. The fast

encoding protocol, with strong metadata, allows query ﬁltering and planning to be more

efﬁcient, essentially reducing the need for full scans. Likewise, ORC is designed to enhance

the performance of reading and writing and includes advanced compression mechanisms,

predicate pushdown, and indexing features [

]. Table 7consolidates results on Parquet,

ORC, and Arrow/Feather, highlighting compression efﬁciency, query performance, nested

data handling, and workload-speciﬁc trade-offs as reported in peer-reviewed studies.

The development of lakehouse architectures is due to the natural limitations inherent

in raw data lakes—namely, their lack of sufﬁcient data management and lack of Atomicity,

Consistency, Isolation, and Durability (ACID) compliance. Lakehouse architectures take

advantage of cloud-native infrastructure in providing support for transactions, versioning,

incremental schema evolution, and temporal capabilities [3,10].

Information 2025,16, 932 16 of 38

Table 7. Benchmarking results for open columnar formats: reported performance characteristics of

Parquet, ORC, and Arrow/Feather [1,2,41,52].

Metric Parquet ORC Arrow/Feather

Compression ratio

Strong overall

compression, especially

with dictionary encoding

Often higher compression

on structured/numeric

workloads

Not a storage format;

minimal compression;

focuses on speed

Scan/decoding speed

Faster end-to-end

decoding in

mixed workloads

Slightly slower, but

predicate evaluation

is stronger

Fastest (de)serialisation

throughput; zero-copy

in-memory

Predicate pushdown/skipping Effective but limited by

column statistics

Fine-grained zone maps

yield strong selective

query performance

Not applicable

(in-memory only)

Nested data handling R/D-level encoding and

efﬁcient leaf-only access

Presence/length streams;

overhead increases

with depth

Dependent on

producer/consumer; no

disk encoding

Workload trade-offs

Performs best on wide

tables and

vectorised execution

Strong on narrow/deep

workloads with

high selectivity

Best as interchange for

ML/analytics pipelines

In addition, lakehouse architectures provide a single platform that combines the scala-

bility features usually found with data lakes and the reliability and ease of management

that deﬁne data warehouses. Current storage infrastructure enables the use of batch and

streaming data, allows for uniﬁed read–write operations across multiple data nodes, and

provides inherent compatibility with SQL engines, along with Spark [

]. Additionally,

these storage platforms effectively address latent issues related to large-scale analytical

systems, like keeping data fresh, imposing server-level schema constraints, ensuring con-

sistency for update operations, and more. Usage of these storage platforms is rapidly

gaining momentum for the building out of analytical pipelines for various industries, like

ﬁnance, healthcare, advertising, and scientiﬁc exploration. Notably, lakehouse platforms,

along with cloud-native storage platforms, provide storage-agnostic features that boost

interoperability between a heterogeneous set of storage backends like Amazon S3, Google

Cloud Storage, Azure Data Lake, and others. This storage-agnostic capability enables multi-

cloud adoption, with streamlined integration across multiple environments. Inclusion of

federated query engines like Presto, Trino, and Starburst in these platforms provides a

storage-oriented foundation design for modern scale-out analytics systems [7,53–55].

4. Bridging Integration and Storage

The interplay between data mobility and accessibility can be seen by analysing the

interplay between multiple mechanisms, such as ETL and ELT pipelines, data virtuali-

sation, and federated SQL engines and by the compromises involved with schema-on-

read and schema-on-write. The overall effects of choosing these tools are the minimi-

sation of redundancy, an increase in agility, and holding performance and governance

levels constant.

4.1. ETL/ELT Pipelines and Data Virtualisation

In heterogeneous environments, integration depends on efﬁciently moving and trans-

forming data across diverse storage backends rather than enforcing a single schema upfront.

Extract–Transform–Load (ETL) and Extract–Load–Transform (ELT) are the main methods

used to enable this type of integration [5,56].

Information 2025,16, 932 17 of 38

In conventional ETL, data is extracted, then cleaned and transformed to conform to

a normalised target schema. Eventually, the data is stored in a main repository, which is

most often a relational data warehouse [

]. These practices are common in enterprise

business intelligence scenarios, where data quality maintenance, along with conformance

with schema speciﬁcations, is an essential requirement prior to the data ingestion step.

Commercial ETL offerings like Talend, Informatica, and Pentaho have powerful platforms

for the design of transformation workﬂows with rule-based data cleansing, deduplication,

and aggregation capabilities [56,57].

Nevertheless, the arrival of storage solutions featuring schema-on-read abilities, along

with cloud-native data lakes, has made it possible for the ELT methodology to be used

more extensively [

]. In such a context, data that is unstructured or slightly structured is

ﬁrst stored in a horizontally scalable data repository like Amazon S3, Hadoop Distributed

File System (HDFS), or Azure Data Lake, then processed at the query execution point or

any subsequent point in the processing pipeline. This approach increases ﬂexibility, along

with data ingestion rates, more signiﬁcantly, especially in scenarios involving stream data

management, where more regular changes take place to the schema, or where exploratory

analysis is beneﬁcial. Both Extract–Transform–Load (ETL) and Extract–Load–Transform

(ELT) practices require careful management of metadata, schema versioning, and data

lineage tracking in order for transformations to be repeatable, auditable, and appropriate

for auditing practices.

Declarative transformation pipelines like Apache NiFi, Airbyte, dbt (data build tool),

and Dagster have features for automatic testing, distribution, and scheduling. Their integra-

tion with cloud storage infrastructure, version control systems like Git, and distributable

computation environments increases their contribution within modern data integration

solutions [

]. Data virtualisation components represent optional or complementary means

of addressing traditional ETL/ELT practices using a logical abstraction layer for the in-

tegration heterogeneous data sources without pre-transfer or pre-transformation phases.

Denodo, Red Hat JBoss Data Virtualisation, and Dremio represent solutions using abstrac-

tion layers to provide uniﬁed views for heterogeneous back-end systems through federated

query engines. These solutions integrate relational databases, NoSQL databases, REST APIs,

and ﬁle storage within a semantic modelling context, thereby allowing end users to query

the combined data using standard SQL queries, with the platform automatically managing

data access tasks in the background [

]. To address heterogeneous data pipelines, one

may use ETL for warehouses, ELT on cloud storage, or logical data virtualisation. Table 8

compares these approaches and tools. This allows for the minimisation of redundancy,

boosts query response efﬁciency in today’s data environments, and allows for the integra-

tion of multiple analytical models. On the downside, it can cause possible performance

loss because of latency issues, data inconsistencies, or the unavailability of sources of data

in the course of carrying out the indexing process. For these reasons, hybrid infrastructure

often utilises virtualisation in a bid for on-demand querying functionality, complemented

by ETL/ELT practices for optimal structured data management.

Table 8. Comparison of data integration strategies: ETL, ELT, and data virtualisation.

Approach Process Typical Tools

ETL (Extract–Transform–Load) Extract →Transform →Load into

data warehouse Talend, Informatica, and Pentaho

ELT (Extract–Load–Transform) Extract →Load (raw) →Transform

later in data lake Hadoop HDFS, Spark, and dbt

Data virtualisation Virtual layer integrates sources in

real time Denodo, Dremio, and Presto/Trino

Information 2025,16, 932 18 of 38

Decision Guidance: ETL vs. ELT vs. Virtualisation

ETL is preferable when conformance and upstream data quality must precede ana-

lytical use (e.g., governed marts and slowly evolving schemas) [

]. ELT is effective for

high-volume, schema-evolving feeds landing in open columnar storage. Enforcement is

deferred to query time or periodic consolidation in the lakehouse, leveraging vectorised

execution engines [

]. Virtualisation/federation is appropriate where duplication is

undesirable or sources must remain authoritative. Performance hinges on connector-aware

pushdown and cost-guided planning [

]. In practice, hybrids persist in frequently used

aggregates while federating the long tail, with promotion from late-bound to enforced

schemas once contracts stabilise (see Table 8and Figure 5).

Figure 5. Federated SQL engine architecture and pushdown.

4.2. Federated Querying and Uniﬁed Query Engines

Federated query systems are central to storage integration. Federated systems allow

the end user to run analytical queries on different heterogeneous data sources through a

single uniﬁed interface. As opposed to traditional integration methods based on materi-

alised data warehouses, federated systems operate under a query-time integration model,

which allows for synchronous and dynamic integration of multiple sources of data [6,55].

Prominent federated query engines include Presto, Trino (formerly known as Presto),

Apache Drill, and Starburst Enterprise. These engines provide a uniﬁed SQL-focused

querying interface for users by interacting with multiple relational databases, NoSQL

databases, cloud object storage services, and distributed ﬁle systems using connectors [

The automatically generated queries are able to carry out joins, groupings, and ﬁltering

across several systems, like MySQL, MongoDB, Hive, and S3, without the need for data

relocation or pre-loading.

Such engines are deﬁned by high-tech characteristics such as the following:

• Cost-centric query optimisation over various backend systems;

•

Predicate pushdown, allowing ﬁltering operations to be executed near their corre-

sponding data sources;

• Parallel execution plans, enabling distributed scalability;

Information 2025,16, 932 19 of 38

• Connector extensibility for custom data-source support.

Figure 5shows a federated SQL engine orchestrating connectors, predicate pushdown,

and parallel execution across heterogeneous sources.

To provide a practical comparison, Table 9benchmarks representative federated

engines across optimisation, execution, and scalability features.

In independent YCSB experiments, MongoDB achieves the best overall runtime across

diverse CRUD workloads, while scan-heavy workloads favour CouchDB, which also

exhibits the strongest thread scale-up among the three evaluated document stores. In

contrast, native graph engines optimise adjacency locality (e.g., adjacency lists and, in some

designs, direct pointers), so traversal cost is governed by the visited subgraph rather than

the total graph size.

The capabilities of the systems have been documented and evolved in production

deployments [

]. For instance, a ﬁnancial analyst can leverage Trino to combine reference

data gleaned from a PostgreSQL table with transactional data stored in Parquet format

within an Amazon S3 bucket, along with metadata gained from a MongoDB collection,

without resorting to reorganising or duplicating data [

]. Federated querying allows for

a decrease in data transfer costs. At the same time, it brings about latency issues, along with

concerns around the availability of data sources and maintaining consistency guarantees.

In addition, performance improvements are hindered by performance-related restrictions

speciﬁc to certain data sources, which are enhanced by a lack of in-depth indexing or

statistics data. To counter these issues, different approaches are being utilised by systems,

for example, the use of caching layers, the use of materialised views, or the reuse of query

results. Orchestration platforms (for example, Airﬂow) and catalogue services (for example,

AWS Glue and Hive Metastore) are incorporated with federated engines to manage schema

deﬁnitions and enhance execution effectiveness. These components form the backbone

of lakehouses, allowing for analysis at continuous real-time or near-real-time speeds for

structured and semi-structured data in formats such as Delta Lake, Apache Iceberg, or

ORC [3,10].

4.3. Schema-on-Read vs. Schema-on-Write Trade-Offs

A key design trade-off is choosing between schema-on-write and schema-on-read,

which reﬂect different philosophies for schema management, validation, and integration.

In schema-on-write, data is required to conform to a predeﬁned schema before be-

ing ingested into the system. This approach, prevalent in traditional RDBMS and data

warehouses, enforces strong consistency, data quality, and semantic clarity. It facilitates

indexing, integrity checks, and query optimisation. However, it also introduces rigidity,

requiring costly transformation steps and limiting the ability to incorporate evolving or

irregular data sources quickly [56].

On the other hand, the schema-on-read approach postpones the creation of the schema

until the execution of a query takes place. This approach is commonly used, together with

data lakes and semi-structured storage implementations, using data types like JSON or

Avro, thereby enabling more ﬂexibility plus better data ingestion mechanisms [

]. It allows

for the storage of uncompressed or poorly structured data while deferring the application

of the schema, thereby enabling analyses that require little preprocessing to be performed

by the analyst and with the researcher. This method is particularly handy in scenarios like

experimental studies and ad hoc analysis, together with rapid prototyping applications.

Information 2025,16, 932 20 of 38

Table 9. Benchmarking of federated SQL engines (Presto, Trino, Drill, and Starburst) across key

performance-related features [6,7,12,55,60].

Metric Presto Trino Drill Starburst

Connector support

Broad set of connectors

and production

deployments at scale

Broad OSS connector

base; fork of Presto

with added features

Schema-free connectors

for JSON, NoSQL,

and ﬁles

Commercial

distribution; adds

enterprise connectors

and governance

Predicate pushdown

Supported across

RDBMS, Hive, and

columnar formats

Supported across most

connectors

Predicate pushdown

for JSON and

columnar data

Extended pushdown

support with enterprise

optimisations

Cost-based

optimisation

CBO with

table/column statistics

and an adaptive

join order

Cost-aware planning

with statistics

integration

Primarily rule-based;

limited CBO

Enterprise-grade CBO

with workload-

aware tuning

Execution model Massively parallel and

pull-based execution

Similar to Presto;

optimised scheduling

Vectorised operator

pipeline

Enhanced parallelism

and workload

management

Caching and

materialisation

Result caching,

materialised views, and

SSD spill options

Spill to disk; MV

support in OSS

is limited

Reader-level pruning

and limited

caching features

Adds advanced

caching and

MV rewriting

Fault tolerance

Recoverable

grouped execution;

Presto-on-Spark variant

Retry-based external

FT extensions

No built-in

query recovery

Enterprise-level FT and

workload isolation

Production use

evidence

Exabyte-scale at Meta;

interactive +

ETL workloads

Large-scale OSS and

enterprise deployments

Interactive ad hoc

analysis over

schema-free data

Widely adopted in

regulated industries

However, schema-on-read implementation is very challenging for the following reasons:

• Variability in schema interpretation across queries or individuals;

• In the phase of inquiry, complex transformations gain prominence;

•

The issues involved in handling metadata in large datasets marked by dynamically

changing schemas, requiring scrupulous analysis;

• Query performance becomes inefﬁcient as a result of overly extended binding times.

Proposed solutions involve hybrid approaches tackling the fundamental trade-offs

inherent in data management systems, as well as improving the effectiveness of delta

computation. Hybrid lakehouse architecture implementations combine the schema adap-

tation of schema-on-read features with properties commonly seen in data warehousing

environments, including schema enforcement, version management, and governance—the

functions largely evident in traditional data warehouses [

]. Solutions involving

the use of Delta Lake and Apache Iceberg provide selective operations on schemas, data

types, and partition pruning, in addition to supporting the manipulation of raw and semi-

structured data. One key improvement in this area is the use of metadata-driven schemas,

wherein data conforms to automatically deduced schemas constructed with the help of

tools such as Great Expectations, Deequ, or DataHub.

These schemas have the characteristic of evolving over time and can be used to

create templates for future veriﬁcation or transformation processes, thereby eliminating

enforcement during the process of taking in data. In practice, instance-level ambiguities

such as locale-dependent dates (e.g., “01/12/2025”) are handled by preferring schema-on-

write conformance (canonical ISO at ingestion) and, where infeasible, by a hybrid read-time

normaliser with explicit lineage and quality expectations.

Information 2025,16, 932 21 of 38

In summary, the decision between schema-on-write and schema-on-read need not be

seen as a straightforward binary. Rather, it is guided by a number of considerations, such

as the requirements of the application in question, the nature of the workload, the factors

that contribute to data variability, and governance requirements. Systems that leverage a

hybrid strategy and can transition smoothly between the two approaches have been seen

as having ﬂexibility and continued effectiveness over time.

5. Metadata, Lineage, and Semantic Interoperability

Metadata is the glue that holds together discoverability, governance, and trust, in-

cluding catalogues and lineage tools, while ontologies and semantic enrichment deﬁne

meaning across sources. Altogether, they form a basis for reproducibility, auditability, and

greater integration at scale.

5.1. Metadata Management Frameworks (e.g., Apache Atlas and DataHub)

The growth of large and complex data ecosystems has brought with it increased

recognition of the value of metadata—sets of data describing other sets of data—as a key

factor for integration, discoverability, governance, and access control. Several metadata

catalogue systems (Table 10) provide scalable repositories for technical, operational, and

business metadata. In many analytical applications, keeping the metadata in a coherent

and current condition is important because this can help mitigate problems related to

discrepancies in a schema, transformations, and semantic homogeneity [5,61,62].

Modern metadata standards outline a uniﬁed structure, which enables the integration

of technical metadata within differing classiﬁcations, encompassing schemas, types, format,

and partitions, for example, operational metadata related to things like recency, lineage,

and access history, together with business metadata related to descriptions, ownership,

and policies for use. Numerous reliable open-source offerings like Apache Atlas, LinkedIn

DataHub, and Amundsen offer horizontally scaled architectures that prove highly efﬁcient

in dealing with the intake, storage, and querying of metadata with diverse origins [5,63].

The platforms also support the integration of data-processing engines like Apache

Spark and Airﬂow, with storage options like Hive, S3, and Delta Lake, along with gover-

nance features like Role-Based Access Control (RBAC) and audit logging. Additionally,

these platforms support Application Programming Interfaces (APIs) and user interfaces

built speciﬁcally for data engineers, analysts, and data stewards. Some of the most impor-

tant features that signiﬁcantly impact discoverability and interpretability of data include

tagging, searchability, lineage visualisations, and automatic classiﬁcation of data [

A centralised metadata repository greatly reduces redundancy, encourages the reuse of

schemas, and enables governance for schema development. Additionally, the contribution

of metadata is gaining wider prominence in data quality evaluation, audit activities related

to regulatory compliance checking, and impact assessment in highly regulated industries

like healthcare and ﬁnancial services. As various datasets are being brought together

with increasingly mature methods of integration, metadata systems become even more

important for consistency and auditability across the data ecosystem [5].

Table 10. Open-source metadata catalogue frameworks and their integration capabilities.

Platform

Architecture & Integration

Key Features

Apache Atlas

Metadata repository, native

to Hadoop

REST APIs, lineage, audit

logging, and RBAC

LinkedIn DataHub Distributed service;

platform-agnostic

Metadata ingestion, search

UI, and versioning

Lyft Amundsen Graph-backed discovery Lineage graphs, discovery

UI, and access control

Information 2025,16, 932 22 of 38

5.2. Ontology-Based Integration and Semantic Enrichment

Together with structural metadata, semantic metadata clariﬁes why data elements

have meaning and are related to each other, thereby enabling the integration of disparate

and distributed systems. Ontologies formally describe domain concepts and relations,

providing a foundation for semantic interoperability.

Ontologies become key tools within data integration spaces, enabling schema mapping,

entity reference disambiguation, and context alignment. For example, a single entity can

be represented with different terminologies within two datasets, for instance, “client” or

“customer”, or appear through aggregated data at differing hierarchical levels, for instance,

“monthly income” or “annual earnings”. Schema elements can be related to ontology classes

using RDF or OWL. These links allow systems to infer relationships such as equivalence,

subsumption, or aggregation.

The use of an ontological basis is widely applied in many scientiﬁc domains, including

bioinformatics, as the Gene Ontology; environmental science, as embodied by the Semantic

Web for Earth and Environmental Terminology (SWEET) ontologies; and cultural heritage,

as expressed in the International Committee for Documentation Conceptual Reference

Model (CIDOC CRM) [

–

]. In business, application domain-speciﬁc terminologies, e.g.,

the Financial Industry Business Ontology (FIBO) for financial services and Health Level Seven

(HL7) for healthcare, are crafted to enhance data standardisation and interoperability [

The use of entity-linking techniques, combined with text annotation procedures and

knowledge graph improvement, enables dynamic semantic enrichment. Dynamic semantic

enrichment is the process by which information is contextualised with external knowledge

bases like Wikidata, DBpedia, and domain-speciﬁc taxonomies [

]. Dynamic semantic

enrichment increases the data’s discoverability, provides query support through natural

language user interfaces, and enables reasoning across diverse and heterogeneous datasets.

In spite of their promise, ontology-based solutions face a variety of challenges, such as

problems related to ontological alignment, high computational costs, and reluctance to

apply them in settings with insufﬁcient expertise.

Recent ontology alignment frameworks pair retrieval with LLM prompting to raise

unsupervised matching quality while curbing calls to the model. MILA reports the top

F-measure on multiple OAEI tasks and a lower runtime through a prioritised search

pipeline [

]. Complementarily, LLMs4OM explores zero-shot and representation-aware

prompting for ontology matching across diverse ontology views, illustrating how founda-

tion models can support semantic interoperability [18].

Advancement in tech developments, represented by lightweight ontologies, schema

annotation, and semantic data stores, can help increase the scalability and universality of

semantic integration.

5.3. Lineage Tracking and Schema Evolution Handling

Data lineage traces the full life cycle of a dataset, from source to processing to ﬁnal

use. Within integrated environments, lineage analysis explains the data origin, validates

its integrity, and detects anomalies in data pipelines. It also supports compliance with

regulatory requirements, like the “right to be forgotten” imposed by the General Data

Protection Regulation (GDPR); improves reproducibility; and supports cooperation [

It can be discussed with reference to various aspects:

•

Table-level lineage describes the relationship between datasets, such as how table A is

created from tables B and C.

•

Column lineage provides precise mappings that show relationships between the

transformation processes in individual columns.

Information 2025,16, 932 23 of 38

•

At the code level, family constructs create relationships between conversions related

to speciﬁed activities or scripts, blending execution environments with parameter

speciﬁcations for settings.

Tools like Marquez, OpenLineage, and Apache Atlas offer ﬁne-grained APIs with

query frameworks for the express purpose of identifying lineage metadata for data transfer

moves. These tools assist with the integration of pipeline orchestrators like Dagster and

Airﬂow, which allows for a better view of data transfer across platforms [

]. Figure 6

provides an overview of lineage tracking and schema evolution, connecting dataset lifecycle

stages with lineage levels, supporting tools, and schema evolution mechanisms.

One of the key issues at this level is regarding schema evolution, which essentially

includes changes to the data structure over time, such as adding new columns, changing

data types, or updating the conventions used for column naming. In the conventional static

ETL framework, the changes often create processing issues or give rise to inconsistency

problems. To support schema evolution, it is critical that the systems do the following:

• Note and explain changes to the schema;

• Validate compatibility across pipeline stages;

• Enable both backward- and forward-compatible reading.

Historical changes have been well documented due to advances in storage structures

that are column-major in data organisation, such as Parquet, and are complemented by

transactional systems like Apache Iceberg and Delta Lake. These underlying prototype tech-

nologies provide necessary ﬂexibility in multiple environments while providing uniform

reliability and reproducibility throughout the process [2,3,10].

At a deeper level, the elements of metadata, lineage, and semantic context become

foundational cornerstones for the intelligent integration platform. These aspects outline

the essential guidelines that empower systems to explore relationships between the data,

correct inaccuracies, conduct automatic transformations, and improve the frameworks for

governance [5,71].

Figure 6. Lineage tracking and schema evolution: dataset lifecycle, lineage levels (table, column, and

code), tools (Apache Atlas, Marquez/OpenLineage, Dagster, and Airﬂow), and schema evolution

with Parquet, Delta Lake, and Apache Iceberg, ensuring compatibility and reproducibility.

Information 2025,16, 932 24 of 38

6. Performance, Scalability, and Consistency Challenges

The problems involved in operating over varying backends can be broken down

for analysis. These include cost-based query optimisation strategies, materialisation and

caching, and the real-world implications of freshness and eventual consistency. This

section provides insight into the mechanisms and trade-offs that govern performance in

realistic situations.

6.1. Query Optimisation Across Heterogeneous Sources

Federated query execution poses complex optimisation challenges due to the diversity

of underlying sources. Compared with traditional monolithic databases, which provide

the optimiser with full control over schemas, statistical information, and execution plans,

federated and multi-model databases face challenges involving stale metadata, diverse

query capability, and non-uniform performance characteristics across a heterogeneous set

of sources [7,14].

One of the key challenges is the derivation of cross-source query plans that efﬁciently

minimise data transfer while leveraging local computational resources. For example, a

raw join between a local relational database and a remote NoSQL database can create a

lot of network trafﬁc if intermediate results are not quickly ﬁltered. Predicate pushdown,

which involves the allocation of ﬁlters and projections to respective underlying systems,

is a common optimisation strategy that reduces the data to be transferred. However,

its effectiveness is dependent on the availability of query operator matching and the

indexability of each data source [7,60].

Cost-based optimisation is also complicated by the absence of reliable and consistent

statistical data. Selectivity or cardinality estimation becomes difﬁcult in distributed envi-

ronments, particularly when heterogeneous sources have varying capacities for sampling

or provide metadata in different formats. Some systems use heuristic rules or runtime feed-

back to dynamically adjust their execution plans. Recent work explores learned estimation

and optimisation—e.g., learned cardinality estimators and reinforcement learning-based

join ordering—to adapt plans under uncertainty [72–74].

Engines like Presto/Trino and Apache Drill employ federated optimisers that ac-

count for connector-speciﬁc capabilities and support adaptive planning but still suffer

from slowdowns from remote-source latency, schema mismatches, and transformation

overheads [

]. Most recent work has explored machine learning optimisers whose

performance models are learned and can steer join orders and execution paths [74].

6.2. Materialisation and Caching Strategies

To reduce repeated query costs, federated systems increasingly rely on materialisation

and caching to reuse results. These approaches lead to lower latency, reduce the load on

source systems, and improve the predictability of performance in exploratory data analysis

and dashboard-style analytics use cases [75,76].

Materialised views are the results of precomputed queries that are stored in advance

and refreshed at regular intervals in some particular system—typically, a data warehouse or

analytical repository. They are useful for frequently accessed join paths, ﬁltered aggregates,

or derived measure paths known to have high computational costs when computed in real

time. However, the use of materialised views includes storage overhead, requires consis-

tency management, and could cause data to become stale if not periodically refreshed [

In query engines used in Presto/Trino or Dremio setups, result caching refers to

the storage of results for run or running queries in transient storage or solid-state drives

(SSDs). This technique signiﬁcantly minimises the computation costs of similar queries.

Caching intermediate outputs through the reuse of the subquery results is a very efﬁcient

Information 2025,16, 932 25 of 38

technique in use-case scenarios involving high query overlap, such as business intelligence

dashboards and multi-tenanted environments. Related systems show how to persist and

reuse intermediate results and sub-jobs effectively [7,77].

Maintenance of metadata related to schema deﬁnitions, statistical information about

columns, and access patterns is key to performance improvement. It allows for accelerated

query planning and reduces the overhead of repeated schema discovery typical with the

querying of ﬁle-based data stores like Parquet or JSON. Despite these advantages, caching

methods have to bypass a complicated balance between data timeliness requirements,

storage costs related to those requirements, and cache invalidation problems. Datasets

undergoing perpetual and dynamic changes, especially those being updated in real time by

streaming or modiﬁed externally through transactions, require careful synchronisation pro-

tocols to be adopted. Modern engines also support materialised federation and incremental

maintenance to balance fast availability of cached views with on-demand ﬂexibility [

6.3. Data Freshness and Eventual Consistency Issues

Across integrated systems comprising a range of data stores, including relational

databases, NoSQL databases, ﬁles, APIs, and streaming systems, keeping the data fresh and

consistent is a key challenge. Analytical workﬂows often rely on data ingested or processed

asynchronously, which causes the different sources to become temporally misaligned [

Freshness of data refers to how much the consolidated view captures the timeliness

of the source datasets. For periodic-batch ETL pipelines, the level of freshness depends

on the frequency of extraction and the update lag thereof. In streaming or near-real-time

systems, freshness depends on factors like ingestion latency, event processing delay, and the

effectiveness of the checkpointing system. Robust watermarks and event-time semantics

are important to quantify and bound lateness [

]. Users often have to trade off low

latency against system stability, especially in pipelines with complex transformations or

downstream dependencies.

In scenarios with distribution or federation, eventual consistency is a situation wherein

changes to the data are not immediately reﬂected across all instances or replicas. It is a

frequently seen occurrence with NoSQL stores like DynamoDB or Cassandra or design

patterns involving asynchronous replication or a microservices pattern [

]. An update

to an order’s status in a speciﬁc service, for example, will not be immediately reﬂected on a

user analytics screen, nor will it be simultaneously available together with customer details

obtained from a separate repository.

The fact that schema changes evolve over a period of time, often combined with

non-linear data transmission—particularly in streaming scenarios—and the usage of retry

protocols, which can cause event duplication or loss, ampliﬁes the extent of consistency

issues. For this reason, in order for integrated systems to maintain analytical integrity, fea-

tures for deduplication, temporal windowing, out-of-time data management, and conﬂict

resolution must be implemented. To address these challenges, more platforms are adopting

versioned data architectures like Apache Iceberg and Delta Lake, which provide capabilities

such as time travel, rollback, and reproducible querying support [

]. Observability and

data-quality veriﬁcation frameworks help monitor freshness and correctness in produc-

tion data pipelines [

]. Additionally, consistency and freshness considerations should

be looked upon as design principles rather than just operational issues. Well-performing

analytical systems should set up Service-Level Agreements (SLAs) for data timeliness,

as well as governance patterns that evaluate stale time, revision policies, and high-level

strategies to build conﬁdence in analytical outcomes. Figure 7synthesises key performance

challenges in heterogeneous analytics and maps them to optimisation levers and mitigating

strategies across engines and storage layers.

Information 2025,16, 932 26 of 38

Figure 7. Performance levers for heterogeneous analytics: optimisation, caching, and fresh-

ness/consistency.

6.4. Federation Overhead and Performance Tuning

Federated access in data lakes and polystores is concrete and implementation-speciﬁc.

For example, Ontario and Squerall execute federated query processing over semantic data

lakes by decomposing a SPARQL input into subqueries per dataset, then translating each

subquery to the target system (e.g., Spark SQL for TSV/HDFS) using dataset proﬁles and

rules. Squerall retrieves from CSV/Parquet, MySQL, Cassandra, and MongoDB through

a mediator (high-level ontologies) and ships data via connectors (two implementations:

Spark and Presto) before joining into the ﬁnal result [5].

In a broader integration survey, federated query answering is explicitly deﬁned as “a

consistent way of accessing data from sources without duplicating them in a central repos-

itory”, achieved “by using sub-queries that target the data sources within the federation

and evaluating their results based on predeﬁned rules” [

]. These concrete mechanisms

surface where overhead arises: (i) connector capability skew (e.g., which operators can be

translated/pushed and with what plan quality), (ii) planning under partial or per source

metadata (Ontario “uses the proﬁles to generate subqueries” and “uses metadata... to

generate optimised query plans” [

]), and (iii) movement/serialisation when subresults are

shipped back to the mediator for ﬁnal assembly.

In practice, the choice of connector matters: the same mediator (Squerall) reports two

runtime stacks “with different data connectors: Spark and Presto” [

], anticipating distinct

pushdown, transfer, and scheduling behaviour. At the orchestration layer, LLM interfaces

increasingly appear in pipelines. However, even strong models show execution gaps in

data tasks (e.g., GPT-4 text-to-SQL execution accuracy of 54.89% vs. human 92.96%), which

cautions against uncritical delegation of query planning/translation to LLMs [19].

Tuning, in turn, follows those concrete pain points.

•

First, connector-aware pushdown is not optional but infrastructural: Ontario’s use

of dataset proﬁles and Squerall’s mediator mapping illustrate that federation lay-

ers must know source capabilities to drive translation and decide where to execute

selections/aggregations/joins [5,11].

Information 2025,16, 932 27 of 38

•

Second, planning with partial statistics can still be effective if the mediator exploits

metadata to derive good subquery decompositions and join orders (Ontario “uses

metadata .. . to generate optimised query plans” [

]) and, when available, sampling or

progressive execution to reﬁne estimates (see also systematisations in [6]).

•

Third, movement minimisation is a transport problem: the choice of colum-

nar/vectorised paths and batching reduces per tuple overhead. Contemporary evalua-

tions of columnar runtimes and data paths emphasise the sustained throughput advan-

tages of vectorised processing and columnar layouts for scans and aggregations [1,2].

•

Fourth, materialisation and incremental maintenance mitigate repeated cross-source

joins. Rather than fully recomputing federated joins/aggregations, incremental frame-

works (e.g., DBSP) maintain views by applying deltas to compiled differential pro-

grams, reducing refresh latency and source load in steady state [76].

•

Finally, hybridisation—persisting “hot” integrated slices (lakehouse/warehouse)

while federating the long tail—follows the storage–execution split documented across

recent lakehouse discussions [5,82].

Table 11 maps these overheads to tuning levers with the precise loci (translation,

planning, transfer, and maintenance) where they act.

Table 11. Concrete federation overheads and tuning levers grounded in reported systems.

Overhead Locus (Where It Appears) Tuning Lever (How It Is Mitigated)

Connector translation gaps and

heterogeneous engines (Spark/Presto

variants in Squerall)

Connector-aware planning and dialect

rewriting and per connector

rules/pushdown (Ontario’s proﬁle-driven

subqueries and Squerall’s mediator

mapping) [5].

Partial statistics; mediator lacks

global distributions

Metadata-guided plan generation (Ontario

uses metadata to generate optimised

plans), progressive reﬁnement, and

survey-catalogued strategies [5,6].

Row-oriented transfer and

ﬁne-grained serialisation

Vectorised/columnar paths and batching

for sustained scan/aggregation

throughput [1,2].

Repeated cross-source joins; freshness

vs. latency

Materialised views of “hot” joins with

incremental refresh (differential/delta

maintenance) [76].

Orchestration via LLMs

(NLQ/translation)

Guardrails: veriﬁed translations and

fallbacks; LLM use where determinism is

not critical (noting 54.89% text-to-SQL

execution accuracy for GPT-4) [19].

Workload skew across sources

Hybridisation (persist stable, high-value

slices in lakehouse/warehouse and

federate the remainder) [5,11,82].

Case focus (AAS–ECLASS industrial federation). In a manufacturing integration

where AAS submodels act as the mediator and ECLASS serves as the external dictionary,

the authors of [

] report two very speciﬁc performance levers. (i) Blocking to cut candidate

space: Before pairwise matching, the system narrows candidates via ANN over embed-

dings (open-source SFR-Embedding-Mistral) with Faiss. In their AAS–ECLASS setting, the

dictionary spans 27,423 entries, so blocking is operationally decisive for both compute and

downstream join fan-out (ii) Classiﬁer choice as a speed/accuracy knob: a ﬁne-tuned generative

LLM achieves slightly better results, whereas an encoding-based classiﬁer enables much

Information 2025,16, 932 28 of 38

faster inference, and the ﬁne-tuned LLM surpasses BERT variants and GPT-4+ICL on

entity-matching benchmarks.

In the ER stage feeding federation, the authors of [

] show that batching demon-

strations and questions (BATCHER) are very cost-effective for ER, outperforming both

ﬁne-tuned PLMs and manually designed LLM prompting. This directly trims external

API overhead and stabilises latency. With respect to LLMs driving orchestration or NLQ,

another work [

] documents execution-level gaps (GPT-4 text-to-SQL of 54.89%), arguing

for deterministic translation paths or veriﬁcation stages in production. Together with

mediator-level pushdowns (Ontario/Squerall) and incremental materialisation (DBSP),

these concrete techniques reduce shipped data, avoid misplaced computation, and keep

the AAS federation responsive under heterogeneous source capabilities [5,11,76].

7. Applications and Case Studies

This section grounds the concepts in practice (applications), including enterprise

lake/lakehouse deployments, scientiﬁc data integration, and public sector pipelines. It

highlights domain-speciﬁc constraints (e.g., governance and standards) and architectural

patterns that recur, showing how the reviewed methods translate into outcomes. Table 12

presents a cross-walk from the taxonomy to the three abovementioned application domains,

indicating for each taxonomy element where it is instantiated (“where used”) and where

it is analysed in the text (“where discussed”). This makes the taxonomy operational and

allows readers to locate concrete occurrences and the corresponding discussion quickly.

Table 12. Taxonomy→applications cross-walk (where used and where discussed).

Taxonomy Element Applications (Role and Where in Text)

Schema matching and mapping (Section 2.2)Harmonise identiﬁers/attributes across sources.

Enterprise lakehouse keys via SQL/ELT (Section 7.1).

Public registries (Section 7.3).

Entity resolution and fusion (Section 2.3)Deduplicate/link records. Uniﬁed entities. Enterprise

CRM+transactions (Section 7.1). Public person/org

linkage (Section 7.3).

Semantic enrichment and ontologies (Section 5.2)Disambiguation of meaning, standards-based queries,

scientiﬁc knowledge graphs (Section 7.2), and Public

SDMX alignment (Section 7.3).

Metadata catalogues and lineage (Sections 5.1 and 5.3)

Discoverability, governance, reproducibility, enterprise

governance (Section 7.1), and scientiﬁc provenance

(Section 7.2).

Storage models (row/column/NoSQL-Section 3)Fit workloads, hybrid query plans, enterprise

columnar lakehouse, and scientiﬁc/public doc/graph

as adjunct (Section 7.1–7.3).

Federated SQL and virtualisation (Section 4.2)Cross-store analytics without relocation, enterprise

Trino-based joins (Section 7.1), and public inter-agency

dashboards (Section 7.3).

Schema-on-read/write and hybrid (Section 4.3)Contracts vs. ﬂexibility, canonicalisation (e.g., dates),

and public regulated pipelines (Section 7.3).

Performance levers (Section 6)

Cost/latency optimisation, freshness, dashboards, and

SLAs (Section 7).

7.1. Enterprise Data Lakes and Lakehouses

In large companies, the need for the consolidation of heterogeneous internal and

external data silos has compelled the large-scale adoption of data lakes and lakehouse

platforms. Traditional data warehouses are limited by rigid schemas, high scaling costs,

and tedious loading. Data lakes, on the other hand, provide for the integration of raw, semi-

structured, and structured data from various business areas, such as customer relationship

management systems, transaction logs, sensor data, and external service providers, into

horizontally scaled object storage solutions like Amazon S3, Azure Data Lake Storage, or

Google Cloud Storage [5].

Information 2025,16, 932 29 of 38

To move beyond simple data storage options, an increasing number of organisations is

adopting lakehouses—uniﬁed platforms leveraging the natural ﬂexibility of data lakes with

added functionalities characteristic of data warehouses like ACID transactions, temporal

capabilities, and schema enforcement policies. Products like Delta Lake [

], Apache Hudi,

and Iceberg [

] enable end-to-end querying, incrementally updated data, and support for

schema evolution over large datasets [13].

One such example is a worldwide retail company that aggregates sales, inventory,

customer feedback, and supply chain data from over 50 systems within an enterprise-wide

analytical framework. Utilising Apache Spark for distributed computation, Presto [

]

for federated querying, and Apache Atlas for metadata management, the company enables

batch and real-time analytics with traceability and governance across multiple regions

and departments.

7.2. Scientiﬁc Data Integration

Scientiﬁc ﬁelds often operate at the frontier of data integration, which requires the

integration of heterogeneous datasets across spatial, temporal, semantic, and modality

aspects. For applications like life sciences, environmental science, physics, and social

sciences, researchers acknowledge the need to integrate a range of sources like experimen-

tal observations, sensor readings, simulation software, domain ontologies, and research

articles. These sources tend to vary in their forms, granularity, semantic material, and

frequency of update, presenting a major challenge with respect to integration for possible

subsequent usage.

In resolving the problem, modern scientiﬁc data infrastructure more increasingly relies

on methodologies that include semantic integration, standardised metadata frameworks,

and distributed computing systems [

]. To ensure conceptual consistency in datasets,

ontologies and controlled vocabularies are used, thereby improving the accuracy of align-

ment and aggregation of linked variables. Various tools, such as ontology mapping engines,

semantic catalogues of data, and knowledge graphs, are utilised continuously to support

the interlinking of data in various repositories, instruments, and organisations.

In addition, each of several disciplines has adopted modular, workﬂow-based systems

for data ingestion, annotation, and transformation processes. These systems support

reproducible analyses, versioning, and shared curation—key attributes in domains where

datasets change over time and that involve large, distributed user bases. The widespread

adoption of cloud-native storage formats (e.g., Parquet and Zarr [

]), spatio-temporal

indexing, and interoperable Application Programming Interfaces (APIs) supports greater

scalability for querying and integration of data from a wide variety of structured and

unstructured sources.

Data integration pipelines enable many uses, including genomic discovery, climate

modelling, astronomy, and disease surveillance. The generated results are often determined

not only by the volume or speed of data but also by semantic matching effectiveness, prove-

nance annotation, and contextualisation of relevance to a particular ﬁeld—highlighting the

essential need for strong, ﬂexible integration systems to fuel scientiﬁc understanding.

7.3. Cross-Domain Data Pipelines in Public Sector Analytics

Public sector organisations are increasingly implementing standardised data platforms

to enable evidence-based policymaking, improve the management of service delivery, and

ensure transparency. An important feature of analytics in public sector organisations is

the need to consolidate data from diverse silos across the organisation, such as education,

health, employment, tax, and mobility.

Information 2025,16, 932 30 of 38

For instance, a national statistical ofﬁce might bring together data from censuses, hos-

pitalisation rates, school performance measures, and social programs’ enrolment rates for

the purpose of understanding disparities or designing certain interventions. Data sources

can have varying identiﬁers, structures, frequencies of update, and access-mandated legal

restrictions. Data integration methods must be sensitive to the needs for anonymisation,

possible errors in data correspondence, and auditability needs, which are often controlled

through strict data protection legislation (e.g., GDPR) [87].

To support effective integration pipeline management, a range of tools such as Open-

Reﬁne [

], CKAN, and custom data warehouses with lineage traceability and role-based

access controls are used. Additionally, using a semantic standard such as the Statistical

Data and Metadata Exchange (SDMX) [

], together with linked data-based methodologies

supports consistency of deﬁnitions across different agencies.

Interoperability with open data portals, city virtualisations, and peer-to-peer dash-

boards between and among different agencies is increasingly reliant upon integration

platforms that have real-time capabilities. Such integration platforms are those that inte-

grate dynamic datasets such as trafﬁc ﬂow patterns, energy consumption, and pollution

levels with stable statistical indicators. These conditions highlight the imperative to develop

data management policies that are operational across technical, institutional, and legal

levels, pursuing a harmonious balance of interoperability, governance, and scalability.

Figure 8provides an architectural comparison of integration strategies across eight layers,

showing how they are instantiated across enterprise, scientific, and public sector pipelines.

Figure 8. Architectural comparison across eight layers—ingestion, storage, processing/query, meta-

data/lineage, semantics/standards, tools and enablers, access and use, and governance/policy—for

the enterprise, scientiﬁc, and public sector integration contexts. The ﬁgure situates representative

technologies and practices in each layer.

Information 2025,16, 932 31 of 38

8. Future Directions

Future expectations refer to analytically sound frameworks, metadata-rich and self-

explanatory workﬂows, and integration methods enriched with AI, like automatic mapping,

entity-relationship modelling, and semantic reasoning. These can be leveraged to minimise

manual effort while concurrently maintaining human supervision. This section establishes

a strategic blueprint for the production more composable, governable, and intelligent

data platforms.

8.1. Towards Uniﬁed Analytical Fabrics

Ongoing innovation in integration and storage demands uniﬁed analytical frameworks

that enable transparent access, governance, and processing of data assets. A major goal of

these frameworks is to converge the beneﬁts of data warehouses, data lakes, and operational

stores. This is achieved through a common query and metadata layer that is independent

of the data format, geographic distribution, and movement rate [13,14,90].

Future analytical fabrics will provide for hybrid execution models that can combine

batch, streaming, and federated operations. They will employ declarative models of

metadata for automatic conﬁguration of workﬂows and increased runtime performance.

Rather than building monolithic platforms, organisations will look towards the adoption

composable architectures with modular components for data ingestion, cataloguing, trans-

formation, and access control—networked together with open standards and APIs [

Commercial business ventures and open-source projects, including Dataplex (Google),

Data Fabric (IBM), and LakeFS, are at the forefront of this space. Subsequent frameworks

will evolve further in terms of features, including data lineage tracing, accurate access

control, real-time monitoring, and multi-cloud capabilities. All these frameworks will be

of utmost signiﬁcance in decentralised organisations and collaborative institutions, which

will require integration in a seamless and secure manner across geopolitical and technical

boundaries [10].

8.2. Metadata-Driven and Self-Describing Pipelines

To counter the intrinsic brittleness and labour-intensive character of existing inte-

gration processes, the future for data management hinges on pipelines that are both self-

describing and metadata-driven. Such next-generation pipelines can autonomously infer,

propagate, and adapt to schema and data proﬁle changes using integrated metadata and

declarative policies [83,91].

In this context, metadata moves beyond being an auxiliary companion resource to-

wards being an essential element within an overall end-to-end pipeline design. Pipeline

building is deliberately schema-aware, context-adaptive, and carefully controlled at the

level of version control. Tools like dbt, Apache Hop, and Tecton enable developers to

express partially declarative pipeline deﬁnitions, which vary accordingly with source

schemas, data quality measures, and business rule logic [91].

Self-describing data structures like Parquet, which includes an integrated schema,

Avro, and Arrow, make this progress possible through the capability of pipelines to dy-

namically analyse and verify at runtime. Prior evaluations of columnar ﬁle formats reveal

trade-offs between Parquet and ORC. Additionally, the development of data contracts—

organised contractual agreements between consumers and producers that outline structure,

semantics, and SLAs—boosts such development [2,41,92].

In the future, automated testing, schema drift detection, lineage impact estimation,

and semantic reconciliation are expected to be built-in capabilities of data pipelines. That

shift should promote reuse, modularity, and resilience—the key properties to enable main-

tainability in complex analytical frameworks for a long duration [83,91].

Information 2025,16, 932 32 of 38

8.3. AI-Assisted Integration and Auto-Schema Mapping

One of the high-proﬁle and challenging areas of exploration is related to the integration

of AI with a focus on boosting the automation of data integration processes. Schema

mapping, ER, and rule establishment for transformations are highly prone to human

intervention, errors, and limitations in scalability. Natural language modelling, learning of

representations, and primitive models have brought about new possibilities for contextually

informed intelligent assistance with integration processes [90].

AI-based tools can suggest ﬁeld mappings, design transformation scripts, and identify

semantic relationships by analysing schemas, instance values, and external knowledge

graphs. Some speciﬁc tools like AutoMapper, Google Cloud Dataprep, and those using

OpenAI Codex can execute interactive mapping and transformation according to natural

language commands [90].

Furthermore, embedding-based matching methods like BERT and graph embeddings

offer effective solutions for harmoniously reconciling heterogeneous schemas, especially in

applications involving inconsistent labelling or heterogeneous data forms. Additionally,

the integration of active learning within an interface incorporating human feedback can

improve mapping accuracy through user contributions [

]. From a pipeline perspective,

LLMs are increasingly positioned as programmable interfaces for data pipelines, synergis-

ing with KGs, XAI, and AutoML to mediate discovery, transformation, and governance [

Early orchestration prototypes further show LLM-assisted DAG synthesis for data en-

richment pipelines [

], indicating a path from point tools to agentic, metadata-aware

integration ﬂows.

As foundation models evolve for structured data manipulation, future systems will

ingest heterogeneous datasets and identify their internal structure and semantics. They will

automatically generate integration frameworks, quality assessments, and domain-speciﬁc

interpretations. This will revolutionise insight extraction and greatly lessen the challenges

involved in complex data integration projects [90].

However, concerns of explainability, control, bias, and governance remain. Artiﬁcial

intelligence design should be conceived to augment human capacity and not replace

humans—incorporating facets of transparency, auditability, and the ability to allow human

intervention across all automated decision-making systems.

8.4. Ethical and Regulatory Directions

Beyond the challenges inherent to architecture and performance, systems of inte-

gration in the future will also require considerations of ethics and rules of regulation as

part of their design. In terms of privacy and security of data, Regulation (EU) 2016/679

(GDPR) provides a critical framework, requiring data processors and controllers to adhere

to standards of responsibility, respect rights of the person (e.g., data erasure and trans-

portability), and comply with purpose restriction in all processing operations, including

integration [

]. Moreover, the newly ratiﬁed EU Artiﬁcial Intelligence Act (Regulation

(EU) 2024/1689) provides mandatory rules for AI systems, requiring explanation, risk

assessment, and human oversight—matters of particular relevance to integration involving

learned or generative schema matching or entity resolution techniques [

]. Moreover, laws

of a particular domain-speciﬁc nature (e.g., HIPAA for healthcare and PSD2 for ﬁnancial

services) place special limits on the use and sharing of integrated data.

Both from a technical standpoint and an ethics standpoint, there are serious risks

involved in the proliferation of biases in integration pipelines, for instance, processes of

schema matching or of entity resolution being associated with a model learned from a biased

dataset may inadvertently retain and propagate biases. Overcoming such challenges implies

the integration of explainability and veriﬁcation with human oversight and auditable

Information 2025,16, 932 33 of 38

mechanisms, particularly with transformations afforded by automated inference [

]. In

order to build trust and responsibility, prospective integration systems should include

“governance-aware” capabilities, including ﬁne-grained tracking of lineage, history of audit,

and human oversight at critical decision points. Such rules of design not only aim to

bolster transparency and ensure compliance through design but also ensure a balanced

reconciliation of technological advancements and legal and societal responsibility and,

ultimately, an efﬁcient conjoining of interoperability, scaling, and robust governance.

8.5. Limitations of This Review

This work is a structured survey rather than a systematic review. It aims for represen-

tative, cross-layer coverage rather than exhaustiveness. New empirical benchmarks were

not established. Performance assessments reﬂect published studies and production reports.

The 2015–2025 focus can introduce recency bias and version drift for rapidly evolving en-

gines and connectors. Where feasible, results were triangulated with contemporary surveys.

In addition, non-archival/vendor whitepapers were excluded, which may have led to the

omission of operational detail. Finally, generalisation across domains is limited, i.e., the

examples in Section 7are indicative rather than comprehensive, and some modalities (e.g.,

unstructured media) fall outside the scope of this work.

9. Conclusions

The current study explored the evolving dynamics of data integration and storage in

analytics systems, with special focus on architectures, tools, and methodologies that target

improved performance, scalability, and semantic consistency. A comparative evaluation of

primary storage models, such as row-store and column-store systems, NoSQL databases,

and lakehouse architectures, was conducted in terms of their suitability for different

workloads. Integration patterns were studied in the scenarios of ETL/ELT pipelines,

federated query workloads, and metadata-centric orchestration. In addition, the importance

of semantic enrichment, data provenance, and schema evolution was highlighted as key

enablers for the building of fault-tolerant and traceable data pipelines. Moreover, major

challenge areas related to query optimisation, caching mechanisms, data freshness, and

consistency were discussed, in addition to real-world applications in enterprise, scientiﬁc,

and public sector scenarios.

To build future-ready analytical infrastructure, practitioners should adopt modular,

metadata-driven architectures; leverage open standards; and invest in governance-aware

integration. Schema ﬂexibility and tracking of lineage and reproducibility have to be

balanced with each other, and performance tuning has to consider storage conﬁguration,

along with federation overhead. Semantic interoperability and AI-powered toolsets will

play an increasingly important role in integration cost reduction and streamlined self-

adaptive pipelines. Data teams can ensure scalability, agility, and fault tolerance within

complex analytical environments by aligning technical design with organisational needs,

along with imposed regulatory constraints.

Finally, this work offers a broad and multifaceted overview that connects integration

methods to storage solutions. It stands as a comparative approach that highlights trade-

offs with ETL/ELT/virtualisation and federated pushdown and a metadata- and lineage-

focused approach that combines performance and consistency controls into useful design

tools across domains. This integration is conceived as a decision aid in the construction of

regulated hybrid pipelines that can sustain performance and reproducibility in the face of

rapidly changing schemas and workloads. In contrast to prior surveys that concentrated

separately on lakes, federation, or lakehouses, this contribution represents a uniﬁcation

Information 2025,16, 932 34 of 38

of integration mechanisms with storage and governance under actionable, reproducible

workﬂows, augmented by the latest AI-assisted methods.

Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.

Data Availability Statement: The original contributions presented in this study are included in the

article. Further inquiries can be directed to the corresponding author.

Acknowledgments: The author would like to thank the anonymous reviewers for their constructive

feedback, which improved the paper substantially.

Conﬂicts of Interest: The author declares no conﬂicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ACID Atomicity, Consistency, Isolation, and Durability

AI Artiﬁcial Intelligence

API Application Programming Interface

AWS Amazon Web Services

CIDOC CRM International Committee for Documentation Conceptual Reference Model

CPU Central Processing Unit

ELT Extract–Load–Transform

ER Entity Resolution

ETL Extract–Transform,–Load

FIBO Financial Industry Business Ontology

GDPR General Data Protection Regulation

HDFS Hadoop Distributed File System

HL7 Health Level Seven

JSON JavaScript Object Notation

MDM Master Data Management

NoSQL Not Only SQL

OLAP Online Analytical Processing

OLTP Online Transaction Processing

ORC Optimised Row Columnar

OWL Web Ontology Language

RBAC Role-Based Access Control

RDF Resource Description Framework

REST Representational State Transfer

S3 Simple Storage Service

SDMX Statistical Data and Metadata Exchange

SIGMOD ACM Special Interest Group on Management of Data

VLDB Very Large Data Bases (Conference)

ICDE IEEE International Conference on Data Engineering

TKDE IEEE Transactions on Knowledge and Data Engineering

VLDBJ The VLDB Journal

SLA Service Level Agreement

SPARQL SPARQL Protocol and RDF Query Language

SQL Structured Query Language

SWEET Semantic Web for Earth and Environmental Terminology

XML Extensible Markup Language

Information 2025,16, 932 35 of 38

References

Liu, C.; Pavlenko, A.; Interlandi, M.; Haynes, B. A Deep Dive into Common Open Formats for Analytical DBMSs. Proc. VLDB

Endow. 2023,16, 3044–3056. [CrossRef]

Zeng, X.; Hui, Y.; Shen, J.; Pavlo, A.; McKinney, W.; Zhang, H. An Empirical Evaluation of Columnar Storage Formats. Proc.

VLDB Endow. 2023,17, 148–161. [CrossRef]

Armbrust, M.; Das, T.; Sun, L.; Yavuz, B.; Zhu, S.; Murthy, M.; Torres, J.; van Hovell, H.; Ionescu, A.; Łuszczak, A.; et al. Delta

lake: High-performance ACID table storage over cloud object stores. Proc. VLDB Endow. 2020,13, 3411–3424. [CrossRef]

Abadi, D.J.; Madden, S.R.; Hachem, N. Column-stores vs. row-stores: How different are they really? In Proceedings of the

2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 9–12 June 2008; pp. 967–980.

[CrossRef]

Hai, R.; Koutras, C.; Quix, C.; Jarke, M. Data Lakes: A Survey of Functions and Systems. IEEE Trans. Knowl. Data Eng. 2023,

35, 12571–12590. [CrossRef]

Gu, Z.; Corcoglioniti, F.; Lanti, D.; Mosca, A.; Xiao, G.; Xiong, J.; Calvanese, D. A systematic overview of data federation systems.

Semant. Web 2024,15, 107–165. [CrossRef]

Sun, Y.; Meehan, T.; Schlussel, R.; Xie, W.; Basmanova, M.; Erling, O.; Rosa, A.; Fan, S.; Zhong, R.; Thirupathi, A.; et al. Presto: A

Decade of SQL Analytics at Meta. Proc. ACM Manag. Data 2023,1, 1–25. [CrossRef]

Potharaju, R.; Kim, T.; Song, E.; Wu, W.; Novik, L.; Dave, A.; Acharya, V.; Dhody, G.; Li, J.; Ramanujam, S.; et al. Hyperspace: The

Indexing Subsystem of Azure Synapse. Proc. Vldb Endow. 2021,14, 3043–3055. [CrossRef]

Dong, X.L.; Srivastava, D. Big Data Integration; Synthesis Lectures on Data Management; Springer Nature Switzerland AG: Cham,

Switzerland, 2015. [CrossRef]

10.

Okolnychyi, A.; Sun, C.; Tanimura, K.; Spitzer, R.; Blue, R.; Ho, S.; Gu, Y.; Lakkundi, V.; Tsai, D. Petabyte-Scale Row-Level

Operations in Data Lakehouses. Proc. VLDB Endow. 2024,17, 4159–4172. [CrossRef]

11.

Alma’aitah, W.Z.; Quraan, A.; AL-Aswadi, F.N.; Alkhawaldeh, R.S.; Alazab, M.; Awajan, A. Integration Approaches for

Heterogeneous Big Data: A Survey. Cybern. Inf. Technol. 2024,24, 3–20. [CrossRef]

12.

Pedreira, P.; Erling, O.; Basmanova, M.; Wilfong, K.; Sakka, L.; Pai, K.; He, W.; Chattopadhyay, B. Velox: Meta’s uniﬁed execution

engine. Proc. VLDB Endow. 2022,15, 3372–3384. [CrossRef]

13.

Schneider, J.; Gröger, C.; Lutsch, A.; Schwarz, H.; Mitschang, B. The Lakehouse: State of the Art on Concepts and Technologies.

SN Comput. Sci. 2024,5, 449. [CrossRef]

14.

Kaoudi, Z.; Quiané-Ruiz, J.A. Uniﬁed Data Analytics: State-of-the-Art and Open Problems. Proc. Vldb Endow. 2022,15, 3778–3781.

[CrossRef]

15.

Fan, M.; Han, X.; Fan, J.; Chai, C.; Tang, N.; Li, G.; Du, X. Cost-Effective In-Context Learning for Entity Resolution: A Design

Space Exploration. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The

Netherlands, 13–16 May 2024; pp. 3696–3709. [CrossRef]

16.

Zhang, Z.; Zeng, W.; Tang, J.; Huang, H.; Zhao, X. Active in-context learning for cross-domain entity resolution. Inf. Fusion 2025,

117, 102816. [CrossRef]

17.

Taboada, M.; Martinez, D.; Arideh, M.; Mosquera, R. Ontology matching with Large Language Models and prioritized depth-ﬁrst

search. Inf. Fusion 2025,123, 103254. [CrossRef]

18.

Babaei Giglou, H.; D’Souza, J.; Engel, F.; Auer, S. LLMs4OM: Matching Ontologies with Large Language Models. In Proceedings

of the Semantic Web: ESWC 2024 Satellite Events, Hersonissos, Greece, 26–30 May 2024; Meroño Peñuela, A., Corcho, O., Groth,

P., Simperl, E., Tamma, V., Nuzzolese, A.G., Poveda-Villalón, M., Sabou, M., Presutti, V., Celino, I., et al., Eds.; Springer: Cham,

Switzerland, 2025; pp. 25–35.

19.

Barbon Junior, S.; Ceravolo, P.; Groppe, S.; Jarrar, M.; Maghool, S.; Sèdes, F.; Sahri, S.; Van Keulen, M. Are Large Language

Models the New Interface for Data Pipelines? In Proceedings of the International Workshop on Big Data in Emergent Distributed

Environments, Santiago, Chile, 9–15 June 2024; [CrossRef]

20.

Alidu, A.; Ciavotta, M.; Paoli, F.D. LLM-Based DAG Creation for Data Enrichment Pipelines in SemT Framework. In Proceedings

of the Service-Oriented Computing—ICSOC 2024 Workshops: ASOCA, AI-PA, WESOACS, GAISS, LAIS, AI on Edge, RTSEMS,

SQS, SOCAISA, SOC4AI and Satellite Events, Tunis, Tunisia, 3–6 December 2024; Springer Nature: Singapore, 2025; pp. 131–143.

[CrossRef]

21. Rahm, E.; Bernstein, P.A. A Survey of Approaches to Automatic Schema Matching. VLDB J. 2001,10, 334–350. [CrossRef]

22. Bleiholder, J.; Naumann, F. Data Fusion. ACM Comput. Surv. 2008,41, 1–41. [CrossRef]

23.

Cheney, J.; Chiticariu, L.; Tan, W. Provenance in Databases: Why, How, and Where. Found. Trends Databases 2009,1, 379–474.

[CrossRef]

24. Euzenat, J.; Shvaiko, P. Ontology Matching, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2013. [CrossRef]

25.

Christen, P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection; Springer:

Berlin/Heidelberg, Germany, 2012. [CrossRef]

Information 2025,16, 932 36 of 38

26.

Papadakis, G.; Skoutas, D.; Thanos, E.; Palpanas, T. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM

Comput. Surv. 2020,53, 1–42. [CrossRef]

27.

Buneman, P.; Khanna, S.; Tan, W. Why and Where: A Characterization of Data Provenance. In Proceedings of the 8th

International Conference on Database Theory (ICDT), London, UK, 4–6 January 2001; Springer: Berlin/Heidelberg, Germany,

2001; Volume 1973, pp. 316–330. [CrossRef]

28.

ISO 8601-2:2019; Date and Time—Representations for Information Interchange—Part 2: Extensions. International Organization

for Standardization: Geneva, Switzerland, 2019; Conﬁrmed 2024; Amendment 1:2025.

29.

Bellahsene, Z.; Bonifati, A.; Rahm, E. Schema Matching and Mapping. In Schema Matching and Mapping; Springer:

Berlin/Heidelberg, Germany, 2011; pp. 1–20. [CrossRef]

30.

Parciak, M.; Vandevoort, B.; Neven, F.; Peeters, L.M.; Vansummeren, S. LLM-Matcher: A Name-Based Schema Matching Tool

using Large Language Models. In Proceedings of the Companion of the 2025 International Conference on Management of Data,

Berlin, Germany, 22–27 June 2025; pp. 203–206. [CrossRef]

31.

Shi, D.; Meyer, O.; Oberle, M.; Bauernhansl, T. Dual data mapping with ﬁne-tuned large language models and asset administration

shells toward interoperable knowledge representation. Robot. Comput. Integr. Manuf. 2025,91, 102837. [CrossRef]

32. Wagner, R.A.; Fischer, M.J. The String-to-String Correction Problem. J. ACM 1974,21, 168–173. [CrossRef]

33.

Rodrigues, D.; da Silva, A. A Study on Machine Learning Techniques for the Schema Matching Network Problem. J. Braz. Comput.

Soc. 2021,27, 1–22. [CrossRef]

34.

Popa, L.; Velegrakis, Y.; Miller, R.J.; Hernández, M.A.; Fagin, R. Chapter 52—Translating Web Data. In VLDB ’02: Proceedings

of the 28th International Conference on Very Large Databases, Hong Kong, China, 20–23 August 2002; Bernstein, P.A., Ioannidis, Y.E.,

Ramakrishnan, R., Papadias, D., Eds.; Morgan Kaufmann: San Francisco, CA, USA, 2002; pp. 598–609. [CrossRef]

35. Binette, O.; Steorts, R.C. (Almost) All of Entity Resolution. Sci. Adv. 2022,8, eabi8021. [CrossRef]

36.

Kemper, A.; Neumann, T. HyPer: A Hybrid OLTP&OLAP Main Memory Database System Based on Virtual Memory Snapshots.

In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE), Hannover, Germany, 11–16 April

2011; pp. 195–206. [CrossRef]

37.

Lamb, A.; Fuller, M.; Varadarajan, R.; Tran, N.; Vandiver, B.; Doshi, L.; Bear, C. The Vertica Analytic Database: C-Store 7 Years

Later. Proc. Vldb Endow. 2012,5, 1790–1801. [CrossRef]

38.

Schulze, R.; Schreiber, T.; Yatsishin, I.; Dahimene, R.; Milovidov, A. ClickHouse—Lightning Fast Analytics for Everyone. Proc.

VLDB Endow. 2024,17, 3731–3744. [CrossRef]

39.

Wang, J.; Lin, C.; Papakonstantinou, Y.; Swanson, S. An Experimental Study of Bitmap Compression vs. Inverted List Compres-

sion. In Proceedings of the 2017 ACM International Conference on Management of Data, Chicago, IL, USA, 14–19 May 2017;

pp. 993–1008. [CrossRef]

40.

Chambi, S.; Lemire, D.; Kaser, O.; Godin, R. Better bitmap performance with Roaring bitmaps. Softw. Pract. Exp. 2016,46, 709–719.

[CrossRef]

41.

Ivanov, T.; Pergolesi, M. The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A Study on ORC and

Parquet. Concurr. Comput. Pract. Exp. 2020,32, e5523. [CrossRef]

42.

Abadi, D.; Madden, S.; Ferreira, M. Integrating Compression and Execution in Column-Oriented Database Systems. In

Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, Chicago, IL, USA, 27–29 June 2006;

pp. 671–682. [CrossRef]

43.

Sikka, V.; Färber, F.; Lehner, W.; Cha, S.K.; Peh, T.; Bornhövd, C. Efﬁcient transaction processing in SAP HANA database: The end

of a column store myth. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale,

AZ, USA, 20–24 May 2012; pp. 731–742. [CrossRef]

44.

DeCandia, G.; Hastorun, D.; Jampani, M.; Kakulapati, G.; Lakshman, A.; Pilchin, A.; Sivasubramanian, S.; Vosshall, P.; Vogels, W.

Dynamo: Amazon’s highly available key-value store. In Proceedings of the Twenty-First ACM SIGOPS Symposium on Operating

Systems Principles, Stevenson, WA, USA, 14–17 October 2007; pp. 205–220. [CrossRef]

45.

O’Neil, P.; Cheng, E.; Gawlick, D.; O’Neil, E. The Log-Structured Merge-Tree (LSM-Tree). Acta Inform. 1996,33, 351–385.

[CrossRef]

46.

Idreos, S.; Callaghan, M. Key-Value Storage Engines. In Proceedings of the 2020 ACM SIGMOD International Conference on

Management of Data, Portland, OR, USA, 14–19 June 2020; pp. 2667–2672. [CrossRef]

47.

Alsubaiee, S.; Altowim, Y.; Altwaijry, H.; Behm, A.; Borkar, V.; Bu, Y.; Carey, M.; Cetindil, I.; Cheelangi, M.; Faraaz, K.; et al.

AsterixDB: A scalable, open source BDMS. Proc. VLDB Endow. 2014,7, 1905–1916. [CrossRef]

48.

Carvalho, I.; Sá, F.; Bernardino, J. Performance Evaluation of NoSQL Document Databases: Couchbase, CouchDB, and MongoDB.

Algorithms 2023,16, 78. [CrossRef]

49.

Besta, M.; Gerstenberger, R.; Peter, E.; Fischer, M.; Podstawski, M.; Barthels, C.; Alonso, G.; Hoeﬂer, T. Demystifying Graph

Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries. ACM Comput. Surv. 2023,56,

1–40. [CrossRef]

Information 2025,16, 932 37 of 38

50.

Francis, N.; Green, A.; Guagliardo, P.; Libkin, L.; Lindaaker, T.; Marsault, V.; Plantikow, S.; Rydberg, M.; Selmer, P.; Taylor, A.

Cypher: An Evolving Query Language for Property Graphs. In Proceedings of the 2018 International Conference on Management

of Data, Houston, TX, USA, 10–15 June 2018; pp. 1433–1445. [CrossRef]

51.

Melnik, S.; Gubarev, A.; Long, J.J.; Romer, G.; Shivakumar, S.; Tolton, M.; Vassilakis, T. Dremel: Interactive analysis of web-scale

datasets. Commun. ACM 2011,54, 114–123. [CrossRef]

52.

Rey, A.; Rieger, M.; Neumann, T. Nested Parquet Is Flat, Why Not Use It? How To Scan Nested Data With On-the-Fly Key

Generation and Joins. Proc. ACM Manag. Data 2025,3, 1–24. [CrossRef]

53.

Ghemawat, S.; Gobioff, H.; Leung, S. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems

Principles (SOSP), Bolton Landing, NY, USA, 19–22 October 2003; pp. 29–43. [CrossRef]

54.

Shvachko, K.; Kuang, H.; Radia, S.; Chansler, R. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th

Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 3–7 May 2010; pp. 1–10. [CrossRef]

55.

Sethi, R.; Traverso, M.; Sundstrom, D.; Phillips, D.; Xie, W.; Sun, Y.; Yegitbasi, N.; Jin, H.; Hwang, E.; Shingte, N.; et al. Presto:

SQL on Everything. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China,

8–11 April 2019; pp. 1802–1813. [CrossRef]

56. Vassiliadis, P. A Survey of Extract–Transform–Load Technology. Int. J. Data Warehous. Min. 2009,5, 1–27. [CrossRef]

57.

Almeida, J.R.; Coelho, L.; Oliveira, J.L. BIcenter: A Collaborative Web ETL Solution Based on a Reﬂective Software Approach.

SoftwareX 2021,16, 100892. [CrossRef]

58.

Kolev, B.; Valduriez, P.; Bondiombouy, C.; Jiménez, R.; Pau, R.; Pereira, J. Cloudmdsql: Querying heterogeneous cloud data stores

with a common language. Distrib. Parallel Databases 2015,34, 463–503. [CrossRef]

59.

Behm, A.; Palkar, S.; Agarwal, U.; Armstrong, T.; Cashman, D.; Dave, A.; Greenstein, T.; Hovsepian, S.; Johnson, R.; Sai Krishnan,

A.; et al. Photon: A Fast Query Engine for Lakehouse Systems. In Proceedings of the 2022 International Conference on

Management of Data, Philadelphia, PA, USA, 12–17 June 2022; pp. 2326–2339. [CrossRef]

60. Hausenblas, M.; Nadeau, J. Apache Drill: Interactive Ad-Hoc Analysis at Scale. Big Data 2013,1, 100–104. [CrossRef]

61.

Eichler, R.; Berti-Equille, L.; Darmont, J. Modeling metadata in data lakes—A generic model. Data Knowl. Eng. 2021,134, 101931.

[CrossRef]

62.

Herschel, M.; Diestelkämper, R.; Ben Lahmar, H. A survey on provenance: What for? What form? What from? Vldb J. 2017,

26, 881–906. [CrossRef]

63.

Jahnke, N.; Otto, B. Data Catalogs in the Enterprise: Applications and Integration. Datenbank-Spektrum 2023,23, 89–96. [CrossRef]

64. Consortium, T.G.O. The Gene Ontology resource: Enriching a GOld mine. Nucleic Acids Res. 2020,49, D325–D334. [CrossRef]

65.

Raskin, R.G.; Pan, M.J. Knowledge representation in the Semantic Web for Earth and Environmental Terminology (SWEET).

Comput. Geosci. 2005,31, 1119–1125. [CrossRef]

66. Niccolucci, F.; Doerr, M. Extending, mapping, and focusing the CIDOC CRM. Int. J. Digit. Libr. 2017,18, 251–252. [CrossRef]

67.

Petrova, G.G.; Tuzovsky, A.F.; Aksenova, N.V. Application of the Financial Industry Business Ontology (FIBO) for development

of a ﬁnancial organization ontology. J. Phys. Conf. Ser. 2017,803, 012116. [CrossRef]

68.

Mandel, J.C.; Kreda, D.A.; Mandl, K.D.; Kohane, I.S.; Ramoni, R.B. SMART on FHIR: A standards-based, interoperable apps

platform for electronic health records. J. Am. Med. Inform. Assoc. 2016,23, 899–908. [CrossRef]

69. Vrandeˇci´c, D.; Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. Acm 2014,57, 78–85. [CrossRef]

70.

Bizer, C.; Lehmann, J.; Kobilarov, G.; Auer, S.; Becker, C.; Cyganiak, R.; Hellmann, S. DBpedia—A crystallization point for the

Web of Data. J. Web Semant. 2009,7, 154–165. [CrossRef]

71. Moreau, L.; Groth, P.; Cheney, J.; Lebo, T.; Miles, S. The rationale of PROV. J. Web Semant. 2015,35, 235–257. [CrossRef]

72.

Kim, B.; Niu, S.; Ding, B.; Kraska, T.; Luo, J.; Luo, W.; Tang, C.; Wang, Z.; Zhang, C.; Zhou, J. Learned Cardinality Estimation: An

In-depth Study. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD), Philadelphia, PA, USA,

12–17 June 2022; pp. 1214–1227. [CrossRef]

73.

Marcus, R.; Papaemmanouil, O. Deep Reinforcement Learning for Join Order Enumeration. In Proceedings of the 1st International

Workshop on Exploiting Artiﬁcial Intelligence Techniques for Data Management (aiDM@SIGMOD), Houston, TX, USA, 10 June

2018; pp. 3:1–3:4. [CrossRef]

74.

Marcus, R.; Negi, P.; Mao, H.; Tatbul, N.; Alizadeh, M.; Kraska, T. Bao: Making Learned Query Optimization Practical.

In Proceedings of the 2021 International Conference on Management of Data (SIGMOD), Xi’an, China, 20–25 June 2021;

pp. 1275–1288. [CrossRef]

75.

Ahmad, Y.; Kennedy, O.; Koch, C.; Nikolic, M. DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views.

Proc. Vldb Endow. 2012,5, 968–979. [CrossRef]

76.

Budiu, M.; Chajed, T.; McSherry, F.; Ryzhyk, L.; Tannen, V. DBSP: Automatic Incremental View Maintenance for Rich Query

Languages. Proc. VLDB Endow. 2023,16, 1601–1614. [CrossRef]

77. Elghandour, I.; Aboulnaga, A. ReStore: Reusing Results of MapReduce Jobs. Proc. Vldb Endow. 2012,5, 586–597. [CrossRef]

Information 2025,16, 932 38 of 38

78.

Armbrust, M.; Das, T.; Torres, J.; Yavuz, B.; Zhu, S.; Xin, R.; Ghodsi, A.; Stoica, I.; Zaharia, M. Structured Streaming: A Declarative

API for Real-Time Applications in Apache Spark. In Proceedings of the 2018 International Conference on Management of Data,

Houston, TX, USA, 10–15 June 2018; pp. 601–613. [CrossRef]

79.

Akidau, T.; Begoli, E.; Chernyak, S.; Hueske, F.; Knight, K.; Knowles, K.; Mills, D.; Sotolongo, D. Watermarks in stream processing

systems: Semantics and comparative analysis of Apache Flink and Google cloud dataﬂow. Proc. VLDB Endow. 2021,14, 3135–3147.

[CrossRef]

80. Vogels, W. Eventually Consistent. Commun. ACM 2009,52, 40–44. [CrossRef]

81.

Schelter, S.; Biessmann, F.; Januschowski, T.; Salinas, D.; Seufert, S.; Krettek, A. Automating Large-Scale Data Quality Veriﬁcation.

Proc. Vldb Endow. 2018,11, 1781–1794. [CrossRef]

82.

Janssen, N.; Ilayperuma, T.; Arachchige, J.J.; Bukhsh, F.A.; Daneva, M. The evolution of data storage architectures: Examining the

secure value of the data lakehouse. J. Data, Inf. Manag. 2024,6, 309–334. [CrossRef]

83.

Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.W.; da Silva Santos,

L.B.; Bourne, P.E.; et al. The FAIR Guiding Principles for Scientiﬁc Data Management and Stewardship. Sci. Data 2016,3, 160018.

[CrossRef]

84.

Callahan, T.J.; Tripodi, I.J.; Stefanski, A.L.; Cappelletti, L.; Taneja, S.B.; Wyrwa, J.M.; Casiraghi, E.; Matentzoglu, N.A.; Reese, J.;

Silverstein, J.C.; et al. An open source knowledge graph ecosystem for the life sciences. Sci. Data 2024,11, 363. [CrossRef]

85.

Gowan, T.A.; Horel, J.D.; Jacques, A.A.; Kovac, A. Using Cloud Computing to Analyze Model Output Archived in Zarr Format. J.

Atmos. Ocean. Technol. 2022,39, 449–462. [CrossRef]

86.

Moore, J.; Basurto-Lozada, D.; Besson, S.; Bogovic, J.; Bragantini, J.; Brown, E.M.; Burel, J.; Moreno, X.C.; Medeiros, G.d.; Diel,

E.E.; et al. Ome-zarr: A cloud-optimized bioimaging ﬁle format with international community support. Histochem. Cell Biol. 2023,

160, 223–251. [CrossRef]

87.

Joyce, A.; Javidroozi, V. Smart City Development: Data Sharing vs. Data Protection Legislations. Cities 2024,148, 104859.

[CrossRef]

88.

Ahmi, A. OpenReﬁne: An Approachable Tool for Cleaning and Harmonizing Bibliographical Data. AIP Conf. Proc. 2023,

2827, 030006. [CrossRef]

89.

Willekens, F. Programmatic Access to Open Statistical Data for Population Studies: The SDMX Standard. Demogr. Res. 2023,

49, 1117–1162. [CrossRef]

90.

Kayali, M.; Lykov, A.; Fountalis, I.; Vasiloglou, N.; Olteanu, D.; Suciu, D. Chorus: Foundation Models for Uniﬁed Data Discovery

and Exploration. Proc. Vldb Endow. 2024,17, 2104–2114. [CrossRef]

91.

Leipzig, J.; Nüst, D.; Hoyt, C.T.; Ram, K.; Greenberg, J. The role of metadata in reproducible computational research. Patterns

2021,2, 100322. [CrossRef]

92.

Ahmad, T. Benchmarking Apache Arrow Flight - A wire-speed protocol for data transfer, querying and microservices. In

Proceedings of the Benchmarking in the Data Center: Expanding to the Cloud, Seoul, Republic of Korea, 2–6 April 2022; [CrossRef]

93.

Shraga, R.; Gal, A. PoWareMatch: A Quality-aware Deep Learning Approach to Improve Human Schema Matching. J. Data Inf.

Qual. 2022,14, 1–27. [CrossRef]

94.

Zhang, J.; Shin, B.; Choi, J.D.; Ho, J.C. SMAT: An Attention-Based Deep Learning Solution to the Automation of Schema Matching.

In Advances in Databases and Information Systems (ADBIS 2021); Springer: Berlin/Heidelberg, Germany, 2021; Volume 12843,

Lecture Notes in Computer Science; pp. 260–274. [CrossRef]

95.

Union, E. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural

persons with regard to the processing of personal data and on the free movement of such data. Off. J. Eur. Union 2016,679, 10–13.

96.

European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 on harmonised

rules on artiﬁcial intelligence. Off. J. Eur. Union 2024. Available online: https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng

(accessed on 21 October 2025).

97.

Álvarez, J.M.; Colmenarejo, A.B.; Elobaid, A.; Fabbrizzi, S.; Fahimi, M.; Ferrara, A.; Ghodsi, S.; Mougan, C.; Papageorgiou, I.;

Lobo, P.R.; et al. Policy advice and best practices on bias and fairness in ai. Ethics Inf. Technol. 2024,26, 31. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual

author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to

people or property resulting from any ideas, methods, instructions or products referred to in the content.

0 views·38 pages

Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges PDF Free Download

Data Integration and Storage Strategies in Heterogeneous Analytical Systems: Architectures, Methods, and Interoperability Challenges PDF free Download. Think more deeply and widely.

Uploaded by Kevin Ray on 4/10/2026

/38

100%