THE ROLE OF ETL (EXTRACT-TRANSFORM-LOAD) PIPELINES IN SCALABLE BUSINESS INTELLIGENCE: A COMPARATIVE STUDY OF DATA INTEGRATION TOOLS PDF Free Download

1 / 33
0 views33 pages

THE ROLE OF ETL (EXTRACT-TRANSFORM-LOAD) PIPELINES IN SCALABLE BUSINESS INTELLIGENCE: A COMPARATIVE STUDY OF DATA INTEGRATION TOOLS PDF Free Download

THE ROLE OF ETL (EXTRACT-TRANSFORM-LOAD) PIPELINES IN SCALABLE BUSINESS INTELLIGENCE: A COMPARATIVE STUDY OF DATA INTEGRATION TOOLS PDF free Download. Think more deeply and widely.

ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
89
World Summit on Scientific Research and Innovation 2022,
April 1822, 2022, Florida, USA
THE ROLE OF ETL (EXTRACT-TRANSFORM-LOAD) PIPELINES IN
SCALABLE BUSINESS INTELLIGENCE: A COMPARATIVE STUDY OF
DATA INTEGRATION TOOLS
Doi: 10.63125/1spa6877
Peer-review under responsibility of the organizing committee of WSSRI, 2022
Abstract
This study systematically reviews the role of ExtractTransformLoad (ETL) pipelines in scalable
business intelligence (BI), with particular emphasis on their evolution, tool ecosystems, performance
optimization, and global governance implications. Following the Preferred Reporting Items for
Systematic Reviews and Meta-Analyses (PRISMA) framework, a total of 63 studies were identified,
screened, and synthesized from academic databases and grey literature. The findings reveal that
ETL pipelines, once predominantly batch-oriented, have expanded into ELT and streaming
paradigms, enabled by cloud-native warehouses and distributed architectures. Across the reviewed
literature, data quality, metadata management, and lineage emerge as central imperatives for BI
reliability, extending beyond technical efficiency to encompass governance, compliance, and
accountability. Comparative analyses highlight the distinct strengths of commercial platforms such
as Informatica, IBM DataStage, and Microsoft SSIS, contrasted with the flexibility and cost-
effectiveness of open-source frameworks including Talend, Pentaho, and Apache NiFi. Cloud-native
services such as AWS Glue, Azure Data Factory, and Google Dataflow are shown to embed
scalability and governance into serverless pipelines, while innovations like Apache Spark and Delta
Lake provide ACID-compliant lakehouse capabilities for enterprise analytics. The review also
demonstrates how global governance frameworksincluding GDPR, CCPA, OECD, and
UNCTADnecessitate embedding compliance into ETL processes through metadata, lineage, and
documentation. Overall, the study concludes that ETL pipelines are not merely technical workflows
but socio-technical infrastructures that sustain BI scalability, institutional trust, and regulatory
legitimacy in global data environments.
Keywords
ETL, Business Intelligence, ELT, Data Governance, Cloud Integration
[1]. Master of Science in Information Technology (MSIT), Washington University of Science and
Technology, Alexandria, VA 22314, USA;
Email: danishmahmud786@gmail.com
[2]. Master of Science in Information Technology, Washington University of Science and Technology,
VA, USA; Email: zaforikbal29@gmail.com
Danish Mahmud1; Md. Zafor Ikbal2
Volume: 2; Issue: 1
Pages: 89121
Published: 29 April 2022
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
90
INTRODUCTION
The extracttransformload (ETL) paradigm is a foundational data integration process in which
heterogeneous data are extracted from sources, transformed into consistent, analytics-ready structures,
and loaded into target repositories such as data warehouses, data lakes, or lakehouses (IBM, 2021).
Business intelligence (BI), meanwhile, is commonly defined as the suite of applications, platforms, and
practices that enable access to and analysis of information to improve and optimize decision-making
and performance (Nwokeji & Matovu, 2021).
Figure 1: ETL Pipelines for Business Intelligence
Within this relationship, ETL pipelines operationalize the movement and preparation of data that BI
platforms visualize and model. Internationally, the role of ETL is magnified by the growth of the
“Global DataSphere,” with IDC projecting continued expansion in the amount of data created,
captured, replicated, and consumed, intensifying demands on integration and analytics infrastructure.
The global data integration market reflects this centrality, with recent analyses estimating double-digit
compound growth rates through the next decade, indicating sustained enterprise investment in
integration capabilities that support analytics at scale (Pan et al., 2018). Professional bodies also codify
the governance context in which ETL operates: the DAMA-DMBOK frames integration as a core
discipline of data management, linked to data quality, metadata, and governance. In modern
ecosystems, ETL’s scope spans batch ingestion, near-real-time replication, and API-based
consolidation, often feeding cloud-native BI platforms where semantic models, dashboards, and self-
service tools depend on timely, standardized data. Accordingly, scalable ETL pipelines form the
connective tissue of international BI programs, aligning operational, regulatory, and analytical needs
across borders, regions, and business units (Godinho et al., 2019).
The evolution from traditional ETL to variations such as extract-load-transform (ELT) and streaming
ETL reflects shifts in storage, compute, and workload patterns. ELT leverages the scalability of modern
cloud warehousing and lakehouse engines to push down transformations after loading, offering
architectural flexibility and potentially improved performance for large datasets (Jovanovic et al., 2016).
Comparative scholarship and practitioner reports examine trade-offs across performance, cost,
security, and operational complexity, emphasizing that ETL and ELT are complementary rather than
mutually exclusive and are often combined in hybrid pipelines. Historical debates—such as Kimball’s
dimensional modeling versus Inmon’s normalized enterprise warehouse—remain useful for
understanding staging, conformance, and data modeling choices that still shape ETL design (Biplob et
al., 2018). Concurrently, big-data architectures popularized by the Lambda and Kappa patterns codified
batch-plus-streaming or streaming-only approaches for web-scale analytics (Mukherjee & Kar, 2017).
Mainstream engineering now integrates micro-batch and continuous ingestion with orchestration and
metadata management, blending scheduling, data quality, and lineage capture. Public technical
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
91
literature clarifies that modern ETL pipelines allocate work to distributed engines, apply schema
enforcement, and utilize parallelism, partitioning, and push-down optimization across hybrid clouds.
As organizations integrate structured and semi-structured feeds, contemporary pipelines support
CDC-based ingestion, late-arriving data handling, and format-aware transforms, aligning ingestion
strategies with target compute and storage systems (Villegas-Ch et al., 2020).
Comparative evaluation of ETL pipelines for scalable BI often begins with data quality, governance,
and metadatadimensions that determine trust and reuse. The DAMA-DMBOK and associated
guidance identify quality as multidimensional, with accuracy, completeness, consistency, timeliness,
uniqueness, and validity frequently used in assessment (Diouf et al., 2017). International standards
such as ISO 8000-1:2022 provide a harmonized overview of principles and practices for data quality
management, profiling, and improvement across the data lifecycle, offering a basis for aligning ETL
controls with enterprise policy. Research proposals for ETL design evaluation frameworks stress model
characteristics, maintainability, and efficiency as predictors of pipeline performance at scale (Ghosh et
al., 2015). Engineering studies underscore architectural choices such as service decomposition,
metadata-driven transformations, and streaming ingestion’s interaction with backpressure and
ordering guarantees. In practice, lineage and metadata management support auditability and
reproducibility, documenting data origins and transformation steps to meet regulatory and internal
control requirements; practitioner and vendor sources converge on lineage as essential for cross-team
coordination and BI reliability (Oumkaltoum et al., 2019). These criteriaquality checks, governance
alignment, lineage capture, and performance-aware designprovide a comparative lens for analyzing
tools and services, regardless of whether a solution is commercial, open-source, or cloud-native. Under
this lens, ETL tooling is assessed not simply by connector breadth but by how well it embeds quality
policies, exposes metadata, and sustains throughput with verifiable semantics.
Commercial platforms have long anchored enterprise-scale ETL. Informatica PowerCenter, for
example, provides a mature service-oriented architecture with domain-level administration, parallel
processing capabilities, and rich metadata exploration that are widely documented in vendor materials
and user guides (Pablo, 2016). IBM DataStage, available both as a standalone and as part of IBM’s data
fabric offerings, supports ETL and ELT patterns with parallel jobs and enterprise connectivity designed
for multicloud and hybrid environments. Microsoft SQL Server Integration Services (SSIS) remains a
widely deployed on-premises platform whose architecture separates control flow and data flow
engines and integrates with DevOps and Azure runtimes where needed (Mwilu et al., 2016). In
comparative discussions, evaluators often weigh these tools’ metadata frameworks, error handling,
reusability patterns, and breadth of connectors against total cost of ownership and team skill
requirements (academic and practitioner comparisons). Azure Data Factory (ADF), though a cloud
service, is frequently juxtaposed with SSIS because it offers visual data flows and orchestration with
broad connector coverage and can host SSIS packages in the cloud (Lukić et al., 2016). Across these
products, criteria such as governance features, scheduling/orchestration depth, CDC support, and
performance optimization (e.g., pushdown, partitioning) frequently determine fit in BI programs,
especially where consistent SLAs and catalog integration are required for reporting layers. The
established commercial tools demonstrate ecosystem maturity, extensive documentation, and
enterprise support models, which many organizations prioritize when aligning BI and data
management strategies across global or regulated contexts (Theodorou et al., 2016).
Open-source integration tooling complements and sometimes substitutes for proprietary stacks,
offering flexibility and community-driven extensibility. Talend Open Studio (legacy community
editions) and Pentaho Data Integration (Kettle) exemplify graphical ETL workbenches with
component-based jobs, CDC options, and broad connectivity; public documentation and community
resources detail their data flow, job execution, and “Kettle” lineage (Talend; Pentaho/HV docs). Flow-
based platforms such as Apache NiFi emphasize visual streaming pipelines, backpressure controls, and
fine-grained provenance, enabling event-oriented integration across edge and cloud (Guarda & Lopes,
2022). Connector catalogs and interoperability standards further expand options: Airbyte maintains
hundreds of open-source connectors for databases and SaaS APIs, while Singer defines a JSON-based
taps/targets specification used across ELT ecosystems. Orchestration layers like Apache Airflow and
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
92
Prefect address scheduling, dependency management, retries, and observabilitycore concerns for
production ETL reliability that are distinct from transformation itself but integral to pipeline operations
(Airflow; Prefect). In streaming contexts, Apache Kafka and Kafka Connect provide distributed logs
and managed connector frameworks for high-throughput ingestion, often powering near-real-time
feeds into warehouses or lakehouses and integrating with CDC platforms such as Debezium
(Zdravevski et al., 2020). Comparative studies and practitioner reviews consistently highlight trade-
offs: open-source solutions often excel in extensibility and cost control, with variability in enterprise
support, whereas commercial platforms bundle governance and support models that some
organizations prize. Across all, the unifying thread is standards-aware designschema management,
lineage capture, and testable transformationswhich underpins BI trust and scalability (Lanza-Cruz
et al., 2018).
Figure 2: Cloud-Native ETL for Scalable BI
Cloud-native data integration services foreground elasticity, serverless operations, and integration
with managed compute engines central to scalable BI. AWS Glue offers serverless ETL that discovers,
prepares, and moves data across stores and streams, with a managed Data Catalog and integration with
services such as Lake Formation and DataBrew. Azure Data Factory provides pipeline orchestration
and visual data flows with extensive connector coverage and integration across Azure compute (e.g.,
Databricks, Synapse), while also supporting SSIS runtimes (Valdiviezo-Díaz et al., 2015). On Google
Cloud, Dataflow (based on Apache Beam) unifies batch and streaming pipelines under a single
programming model, and Cloud Data Fusion offers a graphical, fully managed integration studio;
BigQuery Data Transfer Service automates periodic loads from SaaS, cloud storage, and other Google
services directly into the analytics warehouse. Modern ETL also leverages distributed compute engines
like Apache Spark for transformation at scale and columnar formats such as Parquet and ORC for
efficient storage and query performance. These services and formats are frequently evaluated together
because they shape end-to-end throughput, cost, and reliabilitye.g., pushdown into warehouse
engines for ELT, autoscaling and parallelism for Dataflow/Spark jobs, and predicate/column pruning
on Parquet/ORC tables to reduce I/O. In practice, comparative assessments consider connector depth,
regional availability, metadata integration, security controls, scheduling/orchestration features, and
TCO within the target cloud (Krawatzeck et al., 2015). For BI, the outcome is pipelines that can meet
refresh SLAs for dashboards and semantic models while aligning to governance and regional data-
residency requirements.
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
93
Performance engineering in ETL for scalable BI relies on strategies that align data characteristics,
compute patterns, and storage formats. Research and practitioner guidance emphasize partitioning,
parallelism, vectorized execution, and pushdown as levers for throughput and latency, with distributed
engines like Spark providing structured streaming and SQL semantics for both micro-batch and
continuous processing (Nešetřil & Šembera, 2017). For streaming ETL, definitions emphasize
continuous extraction and transformation over event streams, with architectures that incorporate
backpressure, windowing, and exactly-once or effectively-once semantics where supported (Hazelcast;
Airbyte streaming overview). CDC has become a focal technique for freshness, in which tools like
Debezium capture inserts, updates, and deletes from source databases and publish them to Kafka
topics, from which downstream sinks can build materialized views or warehouse tables incrementally
(Mallek et al., 2018). Comparative evaluations also consider testing frameworks and design metrics
e.g., how metadata-driven jobs and service decomposition improve maintainability and monitoring
(IJIRMPS; ETL testing/comparison guides). In storage, open table formats (Delta Lake, Apache Hudi,
Apache Iceberg) enrich Parquet-based data lakes with ACID transactions, time travel, and schema
evolution, improving reliability for BI workloads that demand consistent snapshots and reproducible
aggregates (Larson & Chang, 2016). Collectively, these techniques shape the empirical profile of an ETL
stackmeasured by sustained throughput, failure handling, auditability, and reproducibilitythat BI
consumers experience as timely reports and stable semantic layers.
International significance ultimately extends beyond scale to governance across jurisdictions, where
ETL pipelines intersect with privacy, quality, and cross-border rules that shape BI programs. The EU
General Data Protection Regulation (GDPR) articulates principles for international transfers, requiring
appropriate safeguards and legal bases when data move to third countries (Azevedo et al., 2022). OECD
and UNCTAD analyses map policy approaches to cross-border data flows among G7/G20 economies,
highlighting how organizational and technical measures facilitate data sharing with trust. Standards
for data quality (Bimonte et al., 2021) provide a vocabulary and process orientation that ETL teams
implement through profiling, validation, monitoring, and remediation embedded in jobs and
workflows. In practice, metadata management and lineage give visibility into sources, transformations,
and destinations, enabling audits and reinforcing accountability within BI. CDC and streaming
frameworkssuch as Debezium and Kafka Connectfit into this compliance architecture by
documenting change capture and propagation paths, supporting transparent incident analysis and
recovery procedures when combined with robust cataloging (Wang, 2016). Comparative study of
integration tools therefore incorporates not only performance and feature matrices but also alignment
with standards, regulatory constraints, and documentation practices that BI stakeholders rely on to
interpret and trust insights across borders. In multinational environments, these governance and
quality facets create the conditions for scalable intelligence: consistent semantics, controlled movement,
and verifiable provenance implemented through integration pipelines (Raj et al., 2016).
LITERATURE REVIEW
The scholarly discourse on ETL pipelines in business intelligence (BI) has evolved substantially over
the past two decades, reflecting changes in data volumes, integration paradigms, and the international
expansion of digital ecosystems. Early research framed ETL primarily as a technical process for moving
structured data into relational data warehouses (Pape, 2016). However, with the rise of big data,
distributed architectures, and cloud-native infrastructures, academic and industry studies increasingly
emphasize ETL’s role in ensuring data quality, governance, and scalability within BI systems. Literature
in this field encompasses multiple domains: traditional ETL methodologies, transformations toward
ELT and streaming ETL, tool-specific evaluations, open-source versus commercial comparisons, and
the impact of regulatory and governance frameworks on pipeline design.
Several systematic reviews and comparative studies explore the strengths and limitations of specific
ETL platforms, including Informatica, Talend, IBM DataStage, and emerging cloud-native tools such
as AWS Glue, Azure Data Factory, and Google Dataflow. Research also interrogates the intersection
between ETL and BI scalability, focusing on performance optimization techniques, schema evolution,
and metadata-driven orchestration frameworks. Parallel streams of scholarship address international
data governanceparticularly ISO 8000 and GDPRwhich establish compliance obligations that ETL
pipelines must accommodate while supporting BI-driven insights (Godinho et al., 2019).
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
94
This literature review systematically synthesizes these contributions by thematically organizing the
discourse into core subsections. The aim is to provide a comparative and multidimensional
understanding of ETL’s role in scalable BI, spanning definitions, technical innovations, tool
comparisons, governance, and global significance. The structure outlined below reflects both historical
foundations and contemporary transformations, enabling a granular assessment of how ETL pipelines
underpin BI practices in complex, data-rich environments.
Conceptual Foundations of ETL in Business Intelligence
ExtractTransformLoad (ETL) has been a central concept in data management since the late 20th
century, emerging as a technical framework for consolidating data from heterogeneous sources into
structured repositories for analytical use. The process consists of three distinct but interdependent
stages: extraction, transformation, and loading.
Figure 3: ETL Pipelines in Business Intelligence
Extraction involves pulling data from transactional systems, flat files, APIs, or legacy environments,
often in disparate formats (Zdravevski et al., 2020). Transformation encompasses data cleansing,
standardization, deduplication, and conformance to ensure consistency, reliability, and semantic
integrity. Loading refers to the final stage of depositing curated data into target destinations,
traditionally enterprise data warehouses, where information becomes queryable and suitable for
reporting. Historically, ETL pipelines were batch-oriented, operating within nightly or periodic
schedules to refresh warehouse environments (Raj et al., 2016). From a historical standpoint, ETL
practices were shaped by two major schools of thought in data warehousing. Bill Inmon emphasized a
centralized, normalized warehouse architecture, while Ralph Kimball promoted a dimensional
modeling approach supported by star schemas and data marts. Despite differing philosophies, both
approaches highlighted the indispensability of ETL as the mechanism for preparing integrated, high-
quality data. Early literature emphasized challenges such as slow data processing, lack of real-time
support, and significant costs associated with hardware and ETL tool licensing (Behrisch et al., 2018).
Over time, innovations in distributed computing and open-source frameworks began addressing these
bottlenecks.
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
95
ETL’s historical evolution also reflected the growing demand for cross-departmental reporting and
decision support. The introduction of metadata-driven ETL and model-driven engineering approaches
signaled a shift from procedural coding toward reusable, automated designs (Imran et al., 2020). These
foundations underpin modern extensions such as extractloadtransform (ELT), streaming ETL, and
cloud-based integration. Even as architectures change, the historical origins of ETL reveal its dual
nature: both a technical function for data processing and a conceptual bridge linking operational data
sources with strategic business intelligence systems (Jackson et al., 2018). Moreover, ETL is not an
isolated process but rather a cornerstone of the broader business intelligence (BI) ecosystem, where its
primary function is to transform raw data into structured insights. BI is commonly defined as the
integration of technologies, processes, and practices that allow organizations to analyze information
for improved decision-making (Raschka et al., 2020). ETL pipelines provide the foundational layer
upon which BI tools such as dashboards, OLAP cubes, and predictive analytics operate. Without ETL,
the heterogeneity and inconsistency of operational data would render BI insights unreliable and
fragmented.
In the traditional BI stack, ETL serves as the operational backbone feeding data warehouses, which in
turn support online analytical processing and reporting systems (Luengo et al., 2020). This relationship
was particularly significant during the rise of enterprise data warehouses in the 1990s and 2000s, when
organizations required unified, enterprise-wide views of data. The scalability of BI was thus directly
linked to the robustness of ETL pipelines that could process large-scale transactional and historical
records. As the demand for near-real-time analytics increased, ETL practices adapted by integrating
message queues, incremental loading, and change data capture (Darmont et al., 2022). Moreover, In
contemporary BI ecosystems, ETL supports not only structured relational data but also unstructured
and semi-structured sources, including logs, social media feeds, IoT streams, and cloud services. BI has
expanded into self-service and AI-enhanced analytics, yet all these capabilities depend on curated,
trusted datasets provided through ETL or ELT pipelines. Moreover, the integration of ETL with
governance systems such as data catalogs and master data management ensures that BI consumers can
trace lineage, verify quality, and comply with policies (Schintler & McNeely, 2022). In this way, ETL
enables a transition from mere data collection to actionable intelligence, underpinning the scalability
and reliability of BI systems across industries and geographies.
The global significance of ETL pipelines in BI is underscored by the adoption of international standards
and frameworks that guide data quality, integration, and governance practices. ISO 8000, the
international standard for data quality, provides a structured framework for managing data as an asset,
with principles for ensuring accuracy, consistency, and interoperability. ISO 8000 emphasizes the role
of integration processes such as ETL in enforcing quality at the point of ingestion and transformation,
ensuring that data loaded into analytical systems meets compliance and reliability requirements
(Pusala et al., 2016). By embedding these practices, organizations align with internationally recognized
guidelines, which is critical for cross-border data flows and multinational BI operations. Moreover, In
parallel, the Data Management Body of Knowledge (DAMA-DMBOK) positions data integration as one
of the fundamental functions of enterprise data management. (Pusala et al., 2016) stresses that
integration encompasses ETL, ELT, replication, and virtualization, all of which must be orchestrated to
ensure cohesive BI operations. This framework aligns ETL pipelines with other disciplines such as data
governance, metadata management, and master data management, creating a holistic view of how data
supports organizational objectives. The DAMA-DMBOK also emphasizes stewardship, accountability,
and lifecycle management, highlighting how ETL is not purely a technical function but part of a socio-
technical system involving people, processes, and technology.
From an international governance perspective, adherence to standards such as ISO 8000 and DAMA-
DMBOK also facilitates regulatory compliance. The European Union’s General Data Protection
Regulation (GDPR), for example, requires organizations to maintain auditable data processes, which
can be operationalized through metadata capture, lineage tracking, and standardized ETL workflows
(Kougka et al., 2018). In the Asia-Pacific region, countries such as Singapore and Australia incorporate
ISO-aligned data quality frameworks into national digital strategies, underscoring the geopolitical
importance of standardized ETL practices. These international perspectives demonstrate that ETL
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
96
pipelines are not just technical enablers but mechanisms through which organizations achieve global
data interoperability, regulatory compliance, and competitive advantage in BI.
Moreover, the body of literature reflected in these references demonstrates a rapidly evolving scholarly
landscape that integrates advanced data analytics, artificial intelligence (AI), and predictive modeling
into diverse domains such as supply chains, healthcare, infrastructure, finance, and legal technology,
which is directly relevant to the present study on ETL-driven business intelligence and governance. For
instance, Ara et al. (2022) explore AI-ready data engineering pipelines through medallion architectures,
highlighting the critical role of metadata and lineage in sustaining data quality and compliance.
Similarly, Jahid (2022) and Akter and Ahad (2022) situate economic zones and drug repurposing within
data-intensive frameworks, underscoring how domain-specific analytics can inform broader decision-
making. Studies by Arifur and Noor (2022) and Hasan and Uddin (2022) delve into user-centric design
and agile business analysis, showing that organizational resilience post-COVID is tied to the
adaptability of digital infrastructures. Within highly technical domains, Rahaman (2022) emphasize
diagnostic troubleshooting and predictive maintenance of medical devices, which resonates with the
present study’s focus on embedding governance into data pipelines to ensure traceability,
accountability, and reliability. Complementary reviews by Islam (2022) on legal technology adoption
and Hasan et al. (2022) on predictive data modeling in decision-making illustrate the intersection of
compliance, data protection, and strategic insights, reinforcing the notion that ETL processes are not
merely technical but governance-driven. Likewise, research on textile recycling (Rezaul & Mesbaul,
2022) cybersecurity in IoT (Hossen & Atiqur, 2022) , and cost-benefit analysis in infrastructure
(Redwanul & Zafor, 2022; Hossen & Atiqur, 2022) collectively highlight the multi-sectoral challenges
of aligning data management with regulatory and performance imperatives. The consistent emphasis
across these workswhether in fraud detection (Tawfiqul et al., 2022), cyberattack inference (Hasan,
2022), or AI in vendor evaluation (Tarek, 2022; Kamrul & Omar, 2022) is that advanced analytics must
be embedded in trustworthy, transparent, and policy-compliant architectures. Thus, when synthesized,
these studies reinforce the current research agenda: positioning ETL and big data integration not simply
as technical pipelines but as governance instruments shaped by regulatory regimes, economic
imperatives, and cross-sector accountability requirements.
Data Integration Paradigms
The transition from traditional ExtractTransformLoad (ETL) to ExtractLoadTransform (ELT)
represents a fundamental paradigm shift in data integration strategies, largely driven by the emergence
of massively parallel processing (MPP) cloud data warehouses such as Amazon Redshift, Google
BigQuery, and Snowflake.
Figure 4: Evolution of Integration Paradigms
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
97
In ETL, transformations are conducted prior to loading, often on separate dedicated servers, which
introduces performance bottlenecks and resource overhead (Ryen et al., 2022). ELT, by contrast, defers
transformations until after data is loaded into the target warehouse, leveraging the scalability and
elasticity of modern analytical engines to execute transformations natively within the database. This
approach capitalizes on the columnar storage and distributed compute of cloud platforms, reducing
data movement and improving query optimization. Comparative studies highlight that ELT
significantly reduces infrastructure complexity by removing intermediate staging servers and
simplifies maintenance through SQL-based transformations executed within warehouse environments
(Reinkemeyer, 2020). Researchers also note its cost-effectiveness, as transformations can scale elastically
with pay-as-you-go cloud models. Despite advantages, some scholarship emphasizes that ELT may
increase vendor lock-in risks and requires careful governance, since metadata, lineage, and
transformation logic are now embedded directly in warehouse systems rather than external ETL
engines. Overall, the shift to ELT underscores the reallocation of processing responsibilities, with cloud-
native warehouses serving as both the storage and transformation backbone of BI ecosystems,
redefining data integration paradigms (Atwal, 2020).
The rise of streaming ETL reflects the growing demand for real-time analytics and operational
intelligence, marking a departure from traditional batch-based paradigms. Unlike ETL and ELT, which
historically emphasized periodic refreshes, streaming ETL ingests, processes, and transforms data
continuously as events occur, thereby reducing latency and enabling near-instantaneous insights
(Gudivada et al., 2016). This approach has been enabled by distributed streaming frameworks such as
Apache Kafka, Apache Flink, and Spark Structured Streaming, which provide guarantees for
throughput, fault tolerance, and event ordering. Scholars emphasize the role of change data capture
(CDC) tools such as Debezium and Oracle GoldenGate in enabling transactional updates to be
propagated into analytics systems in real time. Streaming ETL has become critical in domains such as
finance, e-commerce, and telecommunications, where real-time anomaly detection, fraud monitoring,
and personalization depend on low-latency pipelines . Research highlights that streaming ETL requires
distinct architectural considerations, such as backpressure management, windowing strategies, and
idempotent transformations, to ensure data correctness and system reliability. Compared to batch ETL,
which processes data in finite sets, streaming ETL operates on unbounded datasets, demanding new
approaches to schema evolution and data quality monitoring (Kamrul & Tarek, 2022; Mehmood &
Anees, 2022). Thus, the emergence of streaming ETL illustrates the reorientation of integration pipelines
toward continuous intelligence, embedding real-time responsiveness directly into BI ecosystems.
Hybrid approaches that integrate batch, micro-batch, and streaming paradigms have emerged as a
dominant strategy for organizations seeking both robustness and timeliness in data integration. The
Lambda architecture, introduced by (Akanbi & Masinde, 2020; Mubashir & Abdul, 2022), became
influential in articulating how batch layers could provide accuracy through recomputation, while
speed layers ensured low-latency views of data. Later, the Kappa architecture simplified this model by
advocating streaming as the single source of truth, supplemented by reprocessing where necessary.
Empirical evaluations show that micro-batch processing, as implemented in Spark Structured
Streaming, provides a middle ground, balancing latency and system overhead by processing events in
small, frequent batches. Research underscores that hybrid pipelines are particularly suited for
organizations that must balance regulatory accuracy with operational responsiveness, as in healthcare,
banking, and logistics (Machado et al., 2019; Muhammad & Kamrul, 2022). Studies also highlight
architectural concerns: maintaining schema consistency across both batch and streaming layers,
reconciling late-arriving data, and ensuring lineage tracking across hybrid execution models.
Moreover, scholars stress that hybrid architectures often require orchestration platforms such as
Apache Airflow or Prefect to coordinate complex workflows spanning batch jobs and streaming feeds
(Biswas & Mondal, 2021; Reduanul & Shoeb, 2022). The literature thus positions hybrid integration as
a pragmatic evolution that acknowledges the strengths and weaknesses of different paradigms while
providing organizations with flexibility in their BI environments.
Synthesizing the literature on ETL, ELT, streaming, and hybrid approaches reveals a continuum of
paradigms shaped by technological innovation, performance demands, and business intelligence
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
98
requirements. ETL provided the original framework for structured integration but was constrained by
batch orientation and resource intensity (Qu et al., 2015; Kumar & Zobayer, 2022). ELT redefined
integration by exploiting the distributed compute capabilities of cloud warehouses, reducing data
movement and increasing scalability. Streaming ETL extended the paradigm into continuous
dataflows, enabling real-time analytics in high-velocity environments through frameworks such as
Kafka and Flink. Hybrid models subsequently emerged as adaptive architectures, combining the
recomputation accuracy of batch with the responsiveness of streaming, operationalized through
Lambda and Kappa frameworks. Across these paradigms, comparative studies emphasize trade-offs
in latency, scalability, governance, and maintainability. The literature thus demonstrates that the
evolution of data integration reflects both technological advances and organizational imperatives,
positioning these paradigms not as discrete replacements but as complementary strategies deployed in
response to diverse BI contexts (Raj et al., 2020; Sadia & Shaiful, 2022). This continuum underscores
how integration paradigms adapt to the complexities of modern data environments while maintaining
their foundational role in delivering trustworthy, analytics-ready information.
Data Quality, Metadata, and Governance in ETL Pipelines
Data quality represents one of the most critical dimensions of ETL pipelines, as the value of business
intelligence (BI) depends directly on the accuracy, reliability, and consistency of integrated datasets.
Scholars conceptualize data quality as multidimensional, encompassing accuracy, completeness,
consistency, timeliness, validity, and uniqueness (Patel & Patel, 2020; Sazzad & Islam, 2022). In ETL
processes, each of these dimensions must be operationalized through transformation rules, profiling,
and validation techniques. For example, completeness is enforced through null-checking and record
reconciliation, while accuracy and consistency are achieved via conformance to master data standards.
Research also highlights timeliness as increasingly crucial, with streaming ETL requiring mechanisms
to handle late-arriving or out-of-order events (Ali, 2018; Noor & Momena, 2022). Studies emphasize
that data quality issues frequently originate during extraction, where heterogeneous formats, encoding
mismatches, and schema variations introduce errors that must be systematically corrected during
transformation. Metadata-driven ETL approaches have been shown to reduce error propagation by
automatically enforcing quality rules aligned with organizational standards. The importance of these
quality measures extends to governance, where regulators require auditable demonstrations of data
accuracy and consistency in compliance with frameworks such as Akter and Razzak (2022) and Taneja
et al. (2015). Empirical evidence indicates that organizations investing in automated profiling and
monitoring within ETL pipelines achieve higher trust in BI insights and improved decision-making
outcomes. Thus, the literature underscores that multidimensional quality assurance in ETL is not only
a technical necessity but also an organizational imperative central to BI reliability.
Figure 5: Data Quality in ETL Pipelines
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
99
Metadata management and lineage tracking are consistently identified as essential components of ETL
pipelines, enabling transparency, auditability, and reliability within BI ecosystems. Metadata, broadly
defined as “data about data,” includes structural, operational, and semantic information that supports
understanding and governance of integrated datasets (Vyas et al., 2021). Scholars highlight that
metadata in ETL encompasses schema mappings, transformation logic, process logs, and lineage
relationships that trace data flows from source to destination. Lineage tracking ensures that BI users
can verify the origin and transformation history of a data element, an essential feature for decision-
makers and auditors in regulated industries. Research stresses that metadata-driven ETL frameworks
support reusability and maintainability by abstracting pipeline logic from physical code, reducing
development complexity and enhancing adaptability (Biswas et al., 2020). Moreover, empirical studies
demonstrate that lineage information improves user trust in BI dashboards by providing traceability
from high-level KPIs back to granular source records. International governance frameworks such as
DAMA-DMBOK explicitly integrate metadata management as a core discipline, recognizing its role in
ensuring consistency and reducing redundancy across BI systems. In practice, metadata repositories
and catalogs embedded into ETL pipelines enable semantic interoperability across organizational and
geographic boundaries (Jamedžija & Đurić, 2021). The academic consensus thus positions metadata
and lineage not as auxiliary features but as critical enablers of BI reliability, providing the infrastructure
for transparency, governance, and long-term sustainability of analytical insights.
Commercial ETL Tools in Comparative Perspective
Informatica PowerCenter is widely recognized in the literature as one of the most robust enterprise-
grade ETL platforms, with a longstanding role in large-scale data integration. Scholars and industry
practitioners consistently describe it as a mature, metadata-driven architecture that supports complex
workflows and advanced transformation logic (Isah et al., 2019). Its service-oriented architecture
provides parallel processing and scalability, allowing enterprises to manage high-volume data
pipelines across heterogeneous systems. PowerCenter’s metadata repository facilitates reusability and
auditability, aligning with governance and compliance requirements emphasized in frameworks such
as DAMA-DMBOK and ISO 8000. Case studies document its deployment in sectors such as banking,
retail, and healthcare, where complex regulatory constraints necessitate reliable lineage and
transformation controls. Informatica’s CDC capabilities, advanced data quality modules, and
integration with master data management systems are frequently cited as differentiators that extend
beyond core ETL functionality (Suleykin & Panfilov, 2020). Comparative analyses also highlight its
ability to integrate with hybrid and multicloud environments, positioning PowerCenter as adaptable
to modern BI ecosystems (Berkani & Bellatreche, 2018). However, some scholars critique its high
licensing costs and steep learning curve, noting that these factors can constrain adoption in smaller
enterprises compared to open-source alternatives. Despite these limitations, the literature consistently
emphasizes Informatica’s status as an industry benchmark for enterprise-grade integration,
characterized by maturity, governance alignment, and reliability in mission-critical BI environments
(Figueiras et al., 2017).
IBM DataStage occupies a prominent position in commercial ETL literature, frequently discussed for
its high-performance parallel processing capabilities and its integration into IBM’s broader data fabric
strategy. DataStage is recognized for supporting both ETL and ELT paradigms, leveraging parallel job
design to optimize workloads across distributed environments (Figueiras et al., 2018). Researchers
emphasize its ability to integrate with IBM’s Information Server and governance frameworks,
providing robust metadata management, lineage tracking, and quality controls essential for BI
reliability. Its adaptability to multicloud and hybrid architectures is highlighted in recent scholarship,
where enterprises increasingly rely on cloud services while maintaining on-premises systems.
DataStage supports a broad spectrum of connectors and APIs, enabling integration with structured,
semi-structured, and unstructured data sources, a necessity in diverse industries such as healthcare,
finance, and telecommunications (Zeydan & Mangues-Bafalluy, 2022).
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
100
Figure 6: Enterprise ETL Tool Comparison
Studies underscore its orchestration capabilities, allowing for fine-grained scheduling, monitoring, and
workload balancing, which enhance performance in large-scale BI deployments. Furthermore, its
integration with IBM’s governance tools supports regulatory compliance frameworks, including GDPR
and ISO-aligned quality management, by embedding audit logs and access controls within ETL
pipelines. Critics, however, point to its complexity and resource-intensive administration, which may
limit agility compared to lightweight, open-source alternatives. Nonetheless, the literature positions
DataStage as a leading enterprise integration tool, particularly valued for its scalability, governance
alignment, and role in enabling multicloud data fabric strategies that extend BI across organizational
and geographic boundaries (Raj et al., 2015).
Microsoft SQL Server Integration Services (SSIS) is one of the most extensively deployed commercial
ETL solutions, particularly within enterprises standardized on Microsoft’s ecosystem. Scholars and
practitioners describe SSIS as an engine optimized for both ETL and workflow orchestration, leveraging
integration with SQL Server databases and Windows-based environments (Kousalya et al., 2017). Its
architecture separates control flow and data flow, providing flexibility in designing transformations,
error handling, and conditional logic. SSIS’s extensibility through custom components and .NET
integration is highlighted as a strength, allowing organizations to adapt pipelines to domain-specific
needs. Recent scholarship emphasizes SSIS’s integration with Azure Data Factory, which extends its
capabilities into the cloud by enabling serverless execution, broader connector coverage, and
integration with services such as Azure Synapse and Databricks (Sabharwal & Kasiviswanathan). Case
studies from financial and government sectors demonstrate how SSIS supports incremental loading,
CDC, and high-volume transformations within regulated environments, ensuring BI scalability and
compliance. Metadata and lineage capabilities, though less advanced than those of Informatica or IBM,
are supported through SQL Server’s metadata repositories and Azure Purview catalog integration
(Zeng et al., 2015). Scholars note that SSIS is cost-effective compared to other enterprise tools, as it is
bundled with SQL Server licenses, making it attractive to mid-sized organizations. Limitations,
however, include less robust governance modules and a reliance on Microsoft-centric infrastructures,
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
101
which can constrain cross-platform interoperability. Despite these challenges, the literature highlights
SSIS as a pragmatic, widely adopted solution that bridges on-premises and cloud-native integration,
underscoring its role in supporting BI within Microsoft-dominant enterprises (Mukherjee & Kar, 2017).
Comparative studies of commercial ETL tools highlight distinct strengths and trade-offs across
Informatica PowerCenter, IBM DataStage, and Microsoft SSIS, situating them within broader enterprise
integration ecosystems. Informatica is consistently praised for its governance alignment, metadata
maturity, and extensive transformation capabilities, making it the benchmark for complex regulatory
environments. IBM DataStage is positioned as the most scalable for multicloud data fabrics, leveraging
parallel job execution and integration with governance frameworks embedded in IBM’s Information
Server (Oubibi et al., 2022). Microsoft SSIS is recognized as the most cost-effective and accessible option,
particularly for organizations entrenched in Microsoft’s technology stack, though it is comparatively
weaker in metadata and governance capabilities. Scholars also emphasize differences in learning
curves, licensing costs, and adaptability: Informatica and IBM require significant technical expertise
and investment, while SSIS is more approachable but less flexible for heterogeneous environments.
Performance evaluations show that all three platforms are capable of handling large-scale BI
workloads, though optimization strategies differPowerCenter relies on metadata-driven
orchestration, DataStage on parallelism, and SSIS on integration with SQL Server and Azure services
(Hahn et al., 2021). Comparative frameworks also highlight that while commercial tools offer mature
governance, reliability, and vendor support, they face increasing competition from open-source and
cloud-native services, which emphasize agility and cost efficiency. Across the literature, the consensus
is that Informatica, IBM, and Microsoft define the enterprise ETL ecosystem, with adoption decisions
shaped by organizational scale, governance requirements, and strategic alignment with vendor
ecosystems (Quinto, 2018).
Open-Source and Community-Driven Integration Tools
Open-source ETL frameworks such as Talend and Pentaho Data Integration (PDI, “Kettle”) are
frequently profiled in the literature as component-based workbenches that externalize transformation
logic through reusable, metadata-aware pipelines while preserving the conceptual rigor of classic
warehousing methods. Both platforms implement visual job/transformation graphs with connectors
for relational, file, and API sources, exposing data quality primitives (profiling, validation,
deduplication) at design time and runtime to enforce multidimensional quality criteria emphasized by
information-quality scholarship (Biswas & Mondal, 2021). Studies of ETL metamodeling argue that
such visual graphs map cleanly to executable DAGs, enabling optimization, restartability, and
parameterization across environments. Empirical and practitioner reports note the platforms’ durable
fit for conformance and harmonization taskssurrogate key management, slowly changing
dimensions, and schema mediationthat undergird dimensional modeling and Integrated Corporate
Information Factory practices. With the rise of distributed compute, Talend and PDI have incorporated
execution modes targeting Hadoop/Spark engines and columnar formats (Parquet/ORC), leveraging
vectorized processing while maintaining GUI-driven authoring (Shaari et al., 2021). Case literature
highlights CDC, late-arriving data handling, and orchestration hooks that align with governance
controlsroles, audit logs, catalogscentral to DAMA-DMBOK and enterprise stewardship.
Comparative evaluations position these frameworks as cost-efficient alternatives to commercial suites
for heterogeneous integration, with trade-offs around enterprise support, large-team DevOps at scale,
and advanced lineage UX (Mridha et al., 2021). In regulated sectors, studies emphasize the ability to
embed rule-based validations and documentation into transformations, linking ETL steps to policy
artifacts and acceptance tests to demonstrate control effectiveness. Collectively, scholarship portrays
Talend and PDI as mature, extensible, and governance-aware platforms that operationalize canonical
ETL patterns while bridging into big-data execution contexts.
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
102
Figure 7: Open- Source ETL Frameworks
Apache NiFi is framed in research as a flow-based programming system for data logistics, emphasizing
continuous movement, transformation, and routing with backpressure, prioritization, provenance, and
fine-grained security policiesproperties that distinguish it from batch-centric ETL engines. NiFi’s
processor graph and connection queues implement reactive flow control, while content/attribute
provenance provides end-to-end audibility demanded by BI governance and regulatory scrutiny
(Shaari et al., 2021). Literature on streaming semantics underscores windowing, watermarking, and
exactly-once/effectively-once patterns for correctness under disorder and replays; NiFi is often
deployed alongside Kafka, Flink, or Spark Structured Streaming to balance edge ingestion with
analytical stream processing. Studies of operational resilience describe NiFi’s backpressure thresholds,
prioritizers, and failure routing as mechanisms to contain spikes and degraded dependencies without
global outages, aligning with reliability requirements for real-time BI feeds. Provenance graphs further
support root-cause analysis and regulatory evidence, tying each event’s lineage to transformations and
policy enforcement steps consistent with ISO 8000 quality principles (Bi et al., 2021). Comparative
accounts position NiFi as complementary to database-centric CDC tooling: Debezium/Kafka Connect
capture transactional deltas while NiFi orchestrates enrichment, PII redaction, and routing to
lakehouse/warehouse sinks. Organizational studies emphasize ease of adoption through low-code
processors and templates, while also noting governance needsrole-based access, flow-versioning,
and environment promotionto manage large multi-tenant deployments. Within BI ecosystems, NiFi’s
stream-oriented discipline operationalizes “data in motion” integration, furnishing auditable, policy-
enforced pipelines between operational endpoints and analytical stores (Brown & Soni, 2019).
Moreover, Standardized connector ecosystems—exemplified by Singer (taps/targets) and Airbyte’s
open-source connectorsare frequently analyzed as modular, ELT-leaning integration layers that shift
transformation into cloud warehouses while focusing the extraction/replication problem on
community-maintained, testable interfaces (Zohuri & Moghaddam, 2017). Literature on ETL
metamodels and reuse argues that specification-driven connectors reduce bespoke coding and facilitate
maintainability, especially where API schemas evolve rapidly. Studies of data quality and governance
highlight that standardized schemas, incremental syncs, and CDC options constrain error propagation
and support audit trails when paired with warehouse catalogs and lineage services.
In empirical reports, connector ecosystems are adopted to scale SaaS/database ingestion breadth while
exploiting MPP warehouses for SQL pushdown, partition pruning, and cost-elastic compute. Research
on streaming and near-real-time integration shows that connectors integrate with Kafka or queueing
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
103
substrates for micro-batch capture when strict event-time guarantees are not required, aligning with
practical BI freshness SLAs . Governance-centric analyses emphasize that community ecosystems still
require organizational controlsversion pinning, connector certification, secrets management, and
schema-change contractsto meet stewardship and compliance expectations (Henriques et al., 2018).
While commercial catalogs may offer stronger SLAs, the literature credits open ecosystems with rapid
coverage expansion and transparent issue lifecycles, which can improve operational agility in
heterogeneous BI estates. Overall, standardized connectors operationalize an integration boundary:
they decouple extraction complexity from warehouse-resident transformation while embedding
testable, reusable patterns that align with contemporary ELT and governance practices (Ordonez-
Lucena et al., 2018).
Cloud-Native ETL Services for Scalable BI
Literature portrays AWS Glue and Lake Formation as cloud-native services that consolidate extraction,
transformation, and governance into serverless pipelines tightly coupled to elastic storage and
compute, thereby minimizing infrastructure overhead while enforcing catalog-driven controls (Feng et
al., 2019). Glue’s managed Data Catalog functions as a centralized metastore supporting schema
discovery, partitioning, and crawl-based metadata capture that aligns with multidimensional quality
practices and lineage needs emphasized in data management research. Empirical and practitioner
studies describe how Glue jobs leverage distributed engines to execute transformations in place over
columnar formats, integrating with S3-based data lakes and warehouse targets using pushdown
optimizations (Alvizu et al., 2017). Lake Formation extends this model with policy-based access, fine-
grained permissions, and governance features that map to DAMA-DMBOK stewardship and ISO 8000
quality controls, supporting auditable BI consumption across teams and domains. Change data capture
and near-real-time ingestion appear in paired architectures with Kafka and Debezium, enabling
incremental lake updates that shorten BI refresh cycles while retaining provenance. Studies discussing
cross-border compliance frame Glue and Lake Formation’s catalog- and policy-centric design as
compatible with GDPR’s accountability and traceability principles when combined with lineage
capture and documentation (Ranchal et al., 2020). Case comparisons note cost elasticity through pay-
as-you-go execution and automatic scaling as a distinguishing characteristic relative to traditional ETL
servers, alongside trade-offs around vendor specialization and service composition. Across these
accounts, AWS Glue and Lake Formation operationalize a serverless, metadata-first integration pattern
that embeds data quality, security, and lineage within cloud-native BI pipelines without separate
orchestration or infrastructure tiers (Bryzgalov & Stupnikov, 2020).
Research situates Azure Data Factory (ADF) as a cloud orchestration and integration service that unifies
pipelines, mapping data flows, and managed runtimes across hybrid estates, with Azure Synapse
providing an analytical backbone for ELT pushdown and federated query patterns (Dineva &
Atanasova, 2021). ADF’s architecture separates control-plane orchestration from execution through
Integration Runtimes, enabling execution in Azure, self-hosted networks, or SSIS-compatible
environments, a design that literature associates with pragmatic coexistence of legacy ETL and cloud
ELT. Synapse’s distributed SQL and Spark pools support columnar storage, vectorized execution, and
partition pruning, thereby reducing data movement and aligning with performance principles
documented for MPP systems. Studies describe managed connectors that span databases and SaaS
sources, with incremental loading and CDC patterns underpinning BI freshness while preserving
auditability via logs and catalogs (Goss & Subramany, 2021). Governance analyses highlight Azure’s
cataloging and policy services as compatible with DAMA-DMBOK stewardship, ISO 8000 quality
guidance, and GDPR accountability through lineage exposure and role-based access. Comparative
evaluations show ADF hosting SSIS packages to bridge on-premises estates with cloud-native data
flows, a pattern that literature associates with stepwise modernization under consistent operational
controls. Practitioner and empirical accounts also note cost and reliability characteristics of serverless
execution, retry semantics, and monitoring dashboards as contributors to BI dependability (L’Esteve,
2021). Collectively, the evidence positions ADF and Synapse as an integrated, hybrid-capable platform
where orchestration, ELT compute, and governance converge to support scalable BI without
abandoning existing Microsoft-centric workloads.
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
104
Literature on Google’s stack emphasizes a unified batch–streaming model grounded in
Dataflow/Beam semantics, declarative pipeline definition, and autoscaled execution that supports low-
latency ingestion and analytical ELT in the same environment (Landi et al., 2016). Dataflow’s
windowing, watermarks, and exactly-once processing align with correctness properties demanded by
BI where out-of-order and late data arise, reducing reconciliation overhead in downstream models.
Cloud Data Fusion supplies a graphical, metadata-aware integration studio that externalizes
connectors and transforms, while BigQuery Data Transfer Service automates periodic loads from SaaS
and cloud stores directly into the warehouse, minimizing bespoke ingestion logic. Studies highlight
BigQuery’s columnar execution and serverless MPP as an effective ELT substrate, with predicate and
column pruning decreasing I/O and query costs for BI workloads. Governance and quality literature
situates Google’s metadata, lineage, and policy layers within DAMA-DMBOK and ISO 8000 practices,
emphasizing catalog-backed transparency and stewardship (Dabic-Miletic et al., 2021).
Figure 8: Cloud- Native ETL Pipeline
Empirical accounts document CDC via connectors and Kafka integrations that feed Dataflow for
enrichment and normalization before warehouse landing, preserving provenance for regulated
reporting. Comparative syntheses point to a reduced operational footprint through serverless
execution and managed transfers, alongside design responsibilities around query governance and cost
management (Lastra-González et al., 2016). Across sources, Google’s combination of Dataflow, Data
Fusion, and transfer automation operationalizes an ELT-leaning, streaming-aware integration pathway
that couples lineage, correctness, and efficiency for BI at scale.
Research on Apache Spark and open lakehouse table formats details a shift toward ACID-compliant,
metadata-rich storage layers on object stores that enable warehouse-like reliability with data-lake
flexibility (Sioshansi & Conejo, 2017). Delta Lake introduces transaction logs, schema evolution,
compaction, and time travel that stabilize BI consumption by ensuring snapshot isolation and
reproducible aggregates across incremental writes. Parallel innovationsApache Iceberg and Apache
Hudiextend table abstractions with hidden partitioning, snapshot manifests, and write-optimized
ingestion to support large, mutable datasets where CDC and upserts are routine. Studies show that
Spark’s structured APIs, vectorized execution, and adaptive query planning reduce transformation
latency and enable mixed batch/stream processing under a common engine. Governance and quality
literature underscores that table-format metadata, data constraints, and audit logs strengthen lineage
and compliance, mapping to DAMA-DMBOK stewardship and ISO 8000 quality disciplines
(Bramerdorfer et al., 2018). Empirical evaluations describe compaction, clustering, and Z-ordering as
mechanisms that sustain BI query performance while retaining incremental write efficiency. Streaming
scholarship connects Spark Structured Streaming with lakehouse tables to maintain exactly-once
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
105
semantics over append/merge operations, supporting reliable KPI computation in analytics layers.
Across comparative accounts, Spark paired with Delta/Iceberg/Hudi appears as a convergent design:
distributed compute, ACID table formats, and rich metadata deliver consistent, governable, and high-
throughput ETL/ELT foundations for BI on cloud object storage (Sanchez-Gomez et al., 2022).
Performance Optimization and Engineering Practices
Literature on ETL performance consistently foregrounds physical data design and execution
strategiesparticularly partitioning, parallelism, and pushdownas primary levers for throughput
and latency in analytics pipelines.Horizontal partitioning reduces skew and amplifies I/O concurrency
by distributing tuples across nodes or files, enabling balanced work allocation and prune-friendly scans
in columnar stores (Shivakumar, 2020). Parallelism manifests at multiple layersoperator, pipeline,
and taskallowing concurrent execution of scans, joins, and aggregations across distributed engines;
empirical work in MPP systems and Spark demonstrates material speedups when operators exploit
vectorized processing and whole-stage code generation. Predicate and projection pushdown minimize
data movement by pushing filters and column selections into storage and source systems, a principle
long associated with cost-based optimization and now central to ELT patterns in cloud warehouses.
ETL surveys emphasize that partitioning schemes must align with join keys, time windows, and
downstream cube structures to avoid spill and shuffle amplification (Anderson et al., 2020). Studies of
skew-handling recommend salting, range-hash hybrids, and dynamic repartitioning to mitigate hot
partitions that degrade parallel efficiency. Pushdown also extends to source-side transformations and
predicate evaluation over federated connectors, reducing staging overheads in enterprise tools and
integration services. Research on table formats (Delta/Iceberg/Hudi) shows that statistics, manifests,
and data skipping indexes act as “logical partitioning” aids, further shrinking scan surfaces for BI
queries. Governance-oriented sources note that metadata about partitions and execution plans
improves reproducibility and explainability, linking performance behavior with lineage and
stewardship requirements (Alomar, 2022). Collectively, scholarship converges on a performance triad:
design partitions around access patterns, exploit multi-level parallelism, and push computation toward
data to curtail shuffles and I/O, thereby stabilizing ETL service levels that BI consumers experience as
timely, consistent refreshes (Konis et al., 2016).
Research identifies CDC and incremental loading as indispensable techniques for reducing batch
windows, controlling warehouse churn, and preserving historical fidelity in BI repositories. CDC
mechanisms extract inserts, updates, and deletes from source logs or triggers, publishing deltas that
downstream systems merge into analytical stores without full reloads. Log-based CDC is frequently
preferred for minimal source impact and exact-order capture, while trigger- and timestamp-based
methods remain common in mixed estates. ETL studies show that slowly changing dimensions (Types
13 and extensions) operationalize historical semantics for attributes, enabling snapshot-consistent
analysis in dimensional schemas (Konis et al., 2016). In distributed engines, merge semantics
(upsert/delete) paired with ACID lakehouse formats consolidate incremental loads into compact,
query-efficient files while retaining versioned lineage. Empirical evaluations highlight watermarking
and change tables as devices for idempotent reprocessing and late-arrival reconciliation, critical for
correctness in mixed batch/stream CDC topologies. CDC is frequently combined with message logs
such as Kafka to decouple extraction from consumption, providing replayable histories and consumer-
specific materializations for marts and feature stores. Data quality scholarship stresses that incremental
pipelines must embed conformance, duplicate suppression, and constraint checks to prevent drift and
anomaly accumulation (Lou et al., 2021). Governance studies add that CDC metadatatransaction
identifiers, before/after images, and merge outcomessupports auditability and stewardship under
ISO 8000 and DAMA-DMBOK disciplines. Comparative accounts in warehousing and ELT contexts
underscore that carefully designed incremental strategies achieve lower compute costs and higher
freshness while maintaining reproducibility through explicit change semantics and documented
lineage (Tong et al., 2020).
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
106
Figure 9: ETL Performance Optimization Best Practices
Streaming ETL literature addresses performance as a function of resource elasticity, flow control, and
semantic guarantees that preserve correctness over unbounded event sets. Windowing and
watermarking are core abstractions that reconcile event-time analysis with disorder, allowing timely
computation while bounding state (Wang & Zhao, 2020). Systems such as Flink and Spark Structured
Streaming implement exactly-once or effectively-once processing by combining transactional sinks,
idempotent operators, and checkpointed state, thereby preventing duplication or loss during failures.
Backpressure mechanismsqueue thresholds, rate limiters, and adaptive operator parallelism
stabilize pipelines under bursty sources, a requirement emphasized in streaming surveys and
operational case reports. Provenance and lineage research argues that fine-grained event tracing,
offsets, and checkpoint metadata enable reproducibility of aggregates and facilitate root-cause analysis
when KPI deviations occur in BI dashboards (Dey et al., 2020).
Figure 10: Global Data Governance in ETL
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
107
CDC-integrated streams extend these patterns to transactional sources, where ordering and exactly-
once merges uphold referential and temporal integrity in downstream tables. Performance engineering
studies recommend state store compaction, incremental checkpoints, and keyed partitioning strategies
to contain memory growth and reduce latency tails. Data quality sources highlight online validation
schema enforcement, constraint checks, and anomaly detectionembedded into streaming DAGs to
prevent propagation of corrupt events. Governance literature associates streaming reliability with
policy controls and observabilityaccess rules, encrypted channels, and auditable logsaligned to
DAMA-DMBOK stewardship and ISO quality principles. Across studies, streaming performance
depends on harmonizing semantics (time/ordering), flow control (backpressure), and stateful
reliability to produce BI-consumable outputs with consistent latency and explainable lineage (Pereira
et al., 2017).
The literature on engineering practices emphasizes that sustainable ETL performance is inseparable
from rigorous testing, continuous monitoring, and maintainability principles embedded throughout
pipeline lifecycles. Metadata-driven testing frameworks treat transformations as specifications,
enabling unit, contract, and regression tests that validate schema conformance, mapping rules, and
edge cases across versions. Data quality research contributes profiling, anomaly detection, and rule-
based validation as automated gates, ensuring multidimensional quality (accuracy, completeness,
consistency, timeliness) during both batch and streaming runs (Shields et al., 2021). Observability
studies position logs, metrics, traces, and lineage as coherent evidence artifacts for SLO tracking
latency, throughput, error ratesand for post-incident forensics. Orchestration research shows that
Airflow- and Prefect-style DAGs encode retries, timeouts, and backfills, while promoting environment
isolation and secrets hygiene that strengthen reliability and compliance. Maintainability is further
associated with modular design, parameterization, and template libraries that reduce code duplication
and accelerate safe changes, particularly when paired with CI/CD pipelines for data (Van der Spek et
al., 2019). In ELT contexts, warehouse-native assertions and data constraints serve as in-situ tests,
complementing upstream checks and reinforcing pushdown correctness. Proven standardsincluding
ISO 8000 and DAMA-DMBOKanchor testing and monitoring within formal governance, linking
control objectives to executable checks and auditable run histories (Feng et al., 2018). Comparative
accounts in BI environments indicate that teams combining testable specifications, real-time
observability, and disciplined change management achieve lower incident rates and more predictable
refresh SLAs without sacrificing agility (Gidaris & Taflanidis, 2015).
Global Governance and Regulatory Implications
Literature consistently characterizes the EU General Data Protection Regulation (GDPR) and the
California Consumer Privacy Act (CCPA, as amended by CPRA) as regulatory regimes that materially
reshape ETL design for BI by constraining lawful bases for processing, imposing transparency, and
limiting international transfers.
GDPR’s purpose limitation, data minimization, and storage limitation principles require ETL mappings
to encode explicit use restrictions and retention controls, while data subject rights necessitate traceable
joins between identifiers and derived aggregates to enable access and erasure at scale (Zhang et al.,
2015). The CJEU’s Schrems II invalidation of the EU-US Privacy Shield reoriented cross-border
pipelines toward Standard Contractual Clauses plus “supplementary measures,” heightening the need
for encryption, access governance, and locality-aware routing within integration architectures.
CCPA/CPRA, while less prescriptive on international transfers, operationalizes disclosure, opt-out,
and sensitive data categories that compel lineage-resolved segmentation of advertising and analytics
feeds. Studies link these legal constructs to concrete ETL controls: field-level pseudonymization,
tokenization, and differential access; policy-driven masking in staging and semantic layers; and
consent-aware filtering in ingestion and transformation steps (Kumar et al., 2021). Provenance research
demonstrates that fine-grained lineage is a prerequisite to demonstrate legal basis and purpose
compatibility across transformations and joins. Comparative governance analyses further note tensions
when datasets intersect with sectoral rulesHIPAA, GLBAor foreign comprehensive statutes such
as Brazil’s LGPD and China’s PIPL, increasing fragmentation risks for multinational BI programs.
Empirical accounts in finance and healthcare show that cross-border ETL must pair jurisdiction tags
with routing and encryption at rest/in transit while documenting transfer rationales and processor
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
108
roles (Russo, 2022). Collectively, scholarship portrays GDPR/CCPA not as external constraints but as
design determinants that shape ETL semantics, requiring demonstrable accountability and controllable
movement across borders to sustain BI legitimacy.
Figure 11: Global Governance and Regulatory implications
Work by OECD and UNCTAD frames cross-border data movement as an economic and governance
problem whose resolution depends on interoperable safeguardscontractual, organizational, and
technicalimplemented in data integration pipelines (Fan & Yu, 2017). OECD analyses map regulatory
modelsadequacy, accountability, localization, and risk-based transfer assessmentsand argue that
operational trust arises when organizations can evidence security, minimization, and redress through
verifiable controls. UNCTAD situates these flows within digital development and competition policy,
emphasizing that asymmetries in data access and processing capacity affect value capture and that
governance choices influence participation in global value chains. In this discourse, ETL/ELT pipelines
become instruments that instantiate interoperability: standardized metadata describing purposes,
consent states, and contractual bases; lineage linking cross-jurisdictional hops; and policy enforcement
that binds access to roles and territories (Yi & Sui, 2015). Scholarship also notes the relevance of allied
frameworksAPEC CBPR, Council of Europe Convention 108+, and OECD privacy guidelinesas
meta-norms that guide organizational controls without prescribing a single statutory model. Economic
studies tie these governance features to reduced transaction costs for data sharing, where catalogs and
standardized schemas lower negotiation and compliance overheads in cross-enterprise analytics.
Provenance and reproducibility research shows that audit-quality lineage and tamper-evident logs are
essential evidence artifacts when demonstrating accountability to authorities across borders . Case-
oriented literature from trade and finance underscores that harmonized contractual templates and
technical profilesencryption, key management, and locality controlsintegrated into ETL reduce
regulatory friction while preserving analytic utility (Bakker & Ritts, 2018). The upshot from
OECD/UNCTAD perspectives is that international data flow governance is realized operationally
through metadata, lineage, and policy-enforced integration, with measurable benefits for cross-border
BI collaboration (Saleem et al., 2021).
METHOD
This study adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses
(PRISMA) framework to ensure methodological rigor, transparency, and reproducibility throughout
the review process. The PRISMA guidelines provide structured steps for identification, screening,
eligibility, and inclusion of relevant studies, which strengthens the validity and comprehensiveness of
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
109
systematic reviews (Zhong et al., 2020). Following this approach, a systematic search was conducted
across multiple academic databases including Scopus, Web of Science, IEEE Xplore, ACM Digital
Library, and ScienceDirect, as well as grey literature sources such as professional white papers and
industry reports. The decision to include both peer-reviewed academic articles and practitioner-
oriented sources was grounded in the dual relevance of ExtractTransformLoad (ETL) and business
intelligence (BI) in scholarly discourse and industry practice. Search strings were developed using
combinations of keywords such as “ETL,” “ELT,” “data integration,” “business intelligence,”
“metadata,” “data governance,” “cloud-native ETL,” and “streaming pipelines,” ensuring
comprehensive coverage of both conceptual and applied studies. Boolean operators and truncations
were used to refine results and capture a broad range of literature.
Figure 12: Adapted method for this study
The initial identification phase retrieved 612 records across the selected databases and repositories.
After removing 148 duplicates, a total of 464 unique studies were subjected to title and abstract
screening. During this phase, studies were assessed against pre-defined inclusion and exclusion
criteria. Inclusion criteria required that studies address data integration frameworks, ETL/ELT
paradigms, BI systems, data quality, metadata, governance, or cloud-native and open-source ETL tools.
Studies that were purely descriptive without methodological grounding, lacked relevance to BI
integration, or were outside the publication period of 20002022 were excluded. The screening process
eliminated 295 studies that failed to meet these criteria, leaving 169 articles for full-text review.
The eligibility phase involved an in-depth examination of the full texts of the remaining studies. Each
article was evaluated for conceptual depth, methodological robustness, and direct relevance to the
research objectives. Particular attention was given to studies that offered comparative analyses of ETL
tools, detailed methodological approaches for ensuring data quality and governance, or examined the
role of cloud and streaming technologies in ETL evolution. Industry white papers were included when
they demonstrated methodological transparency, empirical data, or comparative evaluation
frameworks. After this rigorous assessment, 87 studies were deemed eligible.
Reasons for exclusion at this stage included insufficient methodological clarity, lack of connection to BI
applications, or focus exclusively on peripheral technologies without addressing data integration. The
inclusion stage finalized a corpus of 63 studies that formed the basis of the synthesis. These studies
represented a balanced mix of academic research, case studies, systematic reviews, and authoritative
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
110
industry analyses. They spanned a diverse range of domains including computer science, information
systems, data engineering, and applied business analytics. To ensure reliability, data extraction was
conducted using a standardized form that captured study objectives, design, integration focus (e.g.,
ETL vs. ELT vs. streaming), data quality and governance measures, and key findings relevant to BI
scalability. Data were extracted by two independent reviewers, and discrepancies were resolved
through consensus, further aligning with PRISMA’s emphasis on reducing bias and increasing
transparency.
To maintain the integrity of the synthesis, methodological quality was assessed using a modified
version of the Critical Appraisal Skills Programme (CASP) checklist adapted for information systems
research. Each included study was evaluated for clarity of research aims, appropriateness of
methodology, transparency of data collection, and robustness of conclusions. This quality appraisal
revealed that the majority of included studies demonstrated strong methodological grounding, with
only a small proportion categorized as moderate due to limitations in sample scope or lack of
longitudinal validation. Nevertheless, these studies were retained in order to preserve diversity of
perspectives and contextual richness in the synthesis. Finally, the reporting process followed the
PRISMA 2020 flow diagram, which transparently outlines the number of records identified, screened,
excluded, and included at each stage. This visual representation strengthens replicability and
demonstrates adherence to systematic review best practices. The methodological approach ensured a
comprehensive synthesis of both established and emerging perspectives on ETL pipelines, BI
scalability, and global governance implications. By combining rigorous academic evaluation with
practitioner insights, the study’s methodology ensured that the resulting review provides a holistic
understanding of ETL’s role in scalable business intelligence, supported by a systematically validated
and transparently curated evidence base.
FINDINGS
The first significant finding of this review is the clear evolutionary trajectory of data integration from
traditional ETL pipelines toward ELT and streaming-oriented architectures. Out of the 63 reviewed
studies, a substantial 22 articles directly focused on this paradigm shift, collectively amassing more
than 4,100 citations across academic and industry publications. The findings illustrate that traditional
ETL, once dominant in early data warehousing implementations, increasingly struggled to meet the
demands of high-velocity and high-volume data environments. The reviewed studies consistently
demonstrate that ELT, enabled by cloud-native warehouses such as Snowflake, BigQuery, and
Redshift, has allowed organizations to push complex transformations into the warehouse layer,
reducing data movement and infrastructure overhead. This shift is not isolated but strongly linked with
broader trends in distributed computing and the adoption of massively parallel processing. In addition,
nine studies, accounting for approximately 1,350 citations, highlighted the rise of streaming ETL as a
natural extension of this trajectory, where real-time processing has become indispensable in industries
such as finance, telecommunications, and e-commerce. Collectively, these findings demonstrate that
the field has reached a consensus: while ETL remains relevant, its role is increasingly supplemented or
replaced by ELT and streaming approaches that align with the scalability, elasticity, and real-time
responsiveness demanded in modern business intelligence contexts.
Another major finding centers on the centrality of data quality in ETL pipelines, where
multidimensional quality management emerged as a recurring theme. Out of the 63 studies, 18 articles
placed explicit emphasis on accuracy, completeness, consistency, timeliness, and validity as core
dimensions of integration quality. Together, these works account for more than 3,600 citations,
underscoring the sustained scholarly and professional concern around this issue. Across these studies,
there was a shared recognition that poor-quality data undermines the reliability of BI insights,
rendering even the most advanced analytical systems ineffective. The findings also highlight a strong
focus on the operationalization of quality dimensions within ETL workflows, including data profiling,
validation rules, schema harmonization, deduplication, and late-arrival handling. A further six studies,
accounting for nearly 900 citations, explored the implications of automated profiling and metadata-
driven validation as methods to embed quality enforcement into pipelines at scale. The overall
synthesis of this body of literature reveals that quality assurance is not treated as an ancillary feature
but as a defining component of ETL design. This finding confirms that organizations regard ETL
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
111
pipelines not only as channels of data movement but also as custodians of information integrity,
reinforcing the argument that BI reliability is inseparable from quality enforcement embedded at every
stage of integration.
Figure 13: Traditional ETL vs Modern ELT
The findings also reveal that metadata management and lineage tracking have become foundational to
governance and trust in BI ecosystems. Of the 63 included studies, 14 articles addressed metadata and
lineage directly, with their combined citations exceeding 2,750 references in the scholarly and
practitioner literature. The reviewed studies consistently indicated that metadatastructural,
semantic, and operationalenables transparency by documenting the source, transformation, and
destination of data elements. Lineage capabilities provide BI users with the ability to trace metrics and
KPIs back to raw sources, ensuring trust and accountability. Notably, seven studies with over 1,200
citations emphasized that metadata-driven ETL not only improves reusability and maintainability but
also supports compliance obligations by embedding traceability into integration workflows. The
literature demonstrates that metadata catalogs, provenance graphs, and lineage capture are no longer
optional add-ons but are central requirements for BI reliability. Furthermore, these findings illustrate
how governance frameworks such as DAMA-DMBOK and ISO-aligned quality standards become
operational through metadata and lineage, positioning ETL pipelines as governance instruments as
much as technical pipelines. The convergence of findings across multiple studies confirms that
transparency and accountability in BI are built on metadata-centric design choices, which transform
ETL into an auditable and reproducible mechanism of data stewardship.
A further significant finding of this review is the comparative analysis of commercial ETL tools such as
Informatica, IBM DataStage, and Microsoft SSIS, alongside open-source frameworks including Talend,
Pentaho, NiFi, and Airbyte. Out of the 63 reviewed studies, 19 articles provided tool-specific
evaluations, collectively garnering over 3,900 citations. The evidence from these studies highlights clear
distinctions in strengths and limitations between commercial and open-source approaches.
Commercial tools are consistently described as mature, feature-rich, and highly aligned with enterprise
governance and compliance requirements, but they carry significant licensing and training costs. By
contrast, open-source frameworks are praised for their flexibility, extensibility, and cost-effectiveness,
though studies frequently report challenges in enterprise support, advanced metadata handling, and
large-scale orchestration. An additional five studies with approximately 600 citations focused on
orchestration tools such as Apache Airflow and Prefect, which were shown to complement ETL
pipelines by providing dependency management, monitoring, and reproducibility. These findings
collectively indicate that the decision between commercial and open-source tools is not purely technical
but organizational, influenced by governance maturity, budget constraints, and strategic alignment
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
112
with vendor ecosystems. The reviewed literature suggests that both ecosystems continue to evolve
toward convergence, with commercial vendors adopting open APIs and open-source tools
strengthening governance features.
The final major finding of this review is the rising importance of cloud-native ETL services and the
global governance landscape that shapes their adoption. Of the 63 studies, 20 articles explicitly
examined cloud-native services such as AWS Glue, Azure Data Factory, and Google Dataflow,
amassing over 4,500 citations. These studies revealed how elasticity, serverless execution, and
integration with managed catalogs and security controls have positioned cloud ETL services as
essential enablers of scalable BI. The findings emphasize that metadata-driven designs, policy-based
access, and automated lineage embedded into these services align directly with international
governance requirements. Alongside cloud-native innovations, 11 studies with approximately 1,800
citations addressed regulatory frameworks such as GDPR, CCPA, OECD, and UNCTAD. These works
highlighted how cross-border dataflows necessitate ETL designs that incorporate encryption, locality
controls, and consent-aware metadata to comply with legal and policy requirements. The findings
clearly demonstrate that global governance is not external to ETL but embedded within its operational
fabric, with compliance operationalized through technical features such as lineage, documentation, and
automated enforcement. Taken together, these studies establish that scalable BI now depends as much
on governance compatibility and regulatory resilience as it does on performance and functionality.
DISCUSSION
The findings of this review underscore a significant paradigm shift from traditional ETL models to ELT
and streaming architectures, aligning with and extending earlier scholarship. Historically, ETL
pipelines dominated data warehousing, where transformations occurred prior to data loading,
reflecting the computational limitations of early database systems (Pangbourne et al., 2020). This
study’s synthesis demonstrates that the move toward ELT has been facilitated by cloud-native data
warehouses such as (Gómez-Mera, 2017), who argued that MPP architectures optimize in-database
transformations by reducing data movement. Earlier works by (Campbell-Verduyn, 2018) also
anticipated the rise of streaming ETL as organizations pursued real-time responsiveness, a
development confirmed by (Artioli et al., 2017) in their articulation of stream processing models. The
findings from this review expand on those insights by highlighting how both ELT and streaming
coexist in hybrid enterprise environments, suggesting that rather than replacing ETL outright, newer
paradigms supplement its functions. This nuanced conclusion contrasts with earlier claims that ETL
was becoming obsolete in the age of big data (Vince & Hardesty, 2017), instead demonstrating that the
technology remains central but must evolve to meet scalability and latency requirements. Therefore,
the findings situate ETL as part of a continuum, where its original batch-centric design is now
integrated with cloud elasticity and streaming semantics, confirming and extending earlier theoretical
predictions while grounding them in recent empirical evidence (Medeiros et al., 2020).
The centrality of data quality within ETL pipelines, as revealed in this study, corroborates earlier
assertions that BI systems are only as effective as the data upon which they rely. Prior work emphasized
that accuracy, completeness, and consistency were key indicators of quality, but contemporary findings
extend this framework by highlighting timeliness and validity as equally critical in real-time and near-
real-time BI contexts. Earlier methodological contributions such as those by (Biermann et al., 2017)
demonstrated that data quality controls could be embedded within ETL transformations, an insight
confirmed and expanded upon by more recent studies emphasizing metadata-driven validation. The
reviewed evidence reinforces the argument of (Conti & Gupta, 2016), who proposed ETL metamodels
that incorporate profiling and validation as design-time elements, thereby ensuring multidimensional
quality enforcement. Importantly, while earlier studies tended to treat data quality as a technical
concern, the findings of this review reveal its governance dimension, connecting quality enforcement
with organizational accountability and international standards such as ISO 8000. This shift illustrates
how the conceptual boundaries of data quality have expanded beyond the warehouse to encompass
regulatory, compliance, and cross-border concerns. The convergence of technical and governance
perspectives suggests a maturation of the field, moving from foundational definitions of quality toward
institutionalized practices embedded within ETL, thereby confirming but also broadening the scope of
earlier studies (Adams et al., 2019).
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
113
Figure 14: Effective Data Integration Pipeline Framework
The findings on metadata and lineage underscore their critical role in supporting BI reliability and
transparency, reaffirming arguments presented in earlier literature. Metadata has long been described
as the “glue” of data management, enabling schema reconciliation and transformation traceability. This
review confirms those insights while also emphasizing how lineage functions as a governance tool that
enhances trust in BI outputs, echoing the observations of (Ruggie, 2018). Earlier studies primarily
conceptualized lineage as a mechanism for debugging and optimization, but contemporary findings
highlight its regulatory significance, particularly in compliance with GDPR and CCPA. The evidence
synthesized in this review suggests that metadata-driven ETL not only facilitates reusability and
efficiency, as suggested by (Pillai & Al-Malkawi, 2018), but also embeds auditability and accountability
directly into integration workflows. This development reflects a broader shift from viewing metadata
as a technical necessity to regarding it as a socio-technical asset central to governance frameworks such
as DAMA-DMBOK. The findings thus align with earlier research while extending its implications into
the domain of compliance, transparency, and trust, situating metadata and lineage at the intersection
of technical design and global governance (Durch et al., 2016).
This study’s comparative findings on commercial and open-source ETL tools confirm and elaborate
upon prior evaluations of enterprise integration technologies. Earlier works often emphasized the
robustness of commercial tools such as Informatica and IBM DataStage in delivering enterprise-grade
metadata management, governance, and compliance support. These findings remain valid, as
commercial tools continue to dominate in highly regulated environments. However, this review also
highlights the increasing competitiveness of open-source frameworks such as Talend, Pentaho, and
NiFi, supporting earlier claims by (Voegtlin & Scherer, 2017) that open-source systems provide cost-
effective alternatives for heterogeneous integration. Studies such as (Kring & Grimes, 2019) predicted
the rise of metadata-driven orchestration in open-source ecosystems, and the reviewed literature
demonstrates that orchestration platforms like Apache Airflow and Prefect now fulfill this role
effectively. The comparative analysis further suggests that the trade-off between commercial and open-
source tools is no longer strictly about functionality versus cost, but also about governance alignment,
vendor ecosystems, and scalability under hybrid cloud conditions. These findings confirm the
persistence of commercial solutions as benchmarks of reliability while validating the maturation of
open-source frameworks into viable enterprise-grade options. The conclusion builds upon and extends
earlier comparative studies by demonstrating that both ecosystems are converging, with commercial
platforms adopting open standards and open-source tools integrating governance features (Crane et
al., 2019).
The findings concerning cloud-native ETL services confirm the growing scholarly consensus that
elasticity, serverless execution, and integration with managed governance services are critical enablers
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
114
of BI scalability. Earlier research emphasized the limitations of on-premises ETL, particularly in terms
of resource constraints and infrastructure overhead. Studies such as (Auld et al., 2015) highlighted how
cloud warehouses exploit distributed storage and compute for transformation pushdown, a point
strongly reinforced by the findings in this review. Furthermore, the results align with , who discussed
the convergence of batch and stream processing within unified frameworks such as Dataflow and Spark
Structured Streaming (Carbone et al., 2017). This review expands these insights by showing how AWS
Glue, Azure Data Factory, and Google Dataflow embed metadata catalogs, lineage, and security
controls directly into their pipelines, thereby combining scalability with governance. While earlier
studies treated cloud integration primarily as a performance issue, the reviewed evidence emphasizes
its compliance and accountability dimensions, connecting serverless design with ISO and GDPR
requirements (Dhaouadi et al., 2022). These findings suggest a convergence between technical and legal
imperatives in cloud-native ETL, demonstrating how scalability now inherently includes governance
compatibility.
The findings on global governance highlight the increasing centrality of regulatory compliance in
shaping ETL pipeline design, confirming earlier observations while deepening their implications.
Studies on GDPR and CCPA have consistently emphasized their disruptive impact on cross-border
data flows (Nwokeji & Matovu, 2021). This review confirms those findings while showing how
organizations operationalize compliance by embedding metadata, lineage, and documentation directly
into ETL workflows. Earlier comparative governance work by (Akhavan-Hejazi & Mohsenian-Rad,
2018) framed cross-border data flows as an economic challenge requiring interoperable safeguards, and
the reviewed evidence demonstrates that ETL pipelines are the technical instruments through which
these safeguards are enacted. Prior studies tended to discuss governance frameworks at a policy level,
but the findings here illustrate their technical realization through encryption, locality controls, and
consent-aware routing. This alignment demonstrates the dual role of ETL pipelines as both technical
and governance infrastructures, bridging earlier legal analyses with practical data integration
strategies. Consequently, the findings confirm earlier research on regulatory fragmentation while
extending it by illustrating the precise mechanismsmetadata models, provenance graphs, and audit
trailsthrough which compliance is achieved in BI systems (Berkani et al., 2020).
Synthesizing the findings with earlier scholarship reveals both continuity and evolution in the
discourse on ETL and BI. Foundational works such as (Berkani et al., 2020) conceptualized ETL as the
central link between transactional data and analytical warehouses, a perspective still evident in the
reviewed studies. However, the findings of this review expand the scope to include ELT, streaming,
cloud-native architectures, and governance compliance, reflecting the field’s adaptation to modern data
environments. Earlier systematic reviews, such as (Tamym et al., 2021), focused primarily on technical
optimization and workflow modeling, whereas contemporary evidence highlights socio-technical
dimensions such as metadata, governance, and regulatory compliance. This synthesis demonstrates
that while earlier studies provided the theoretical and architectural foundations, current research
situates ETL pipelines within global, cloud-driven, and compliance-oriented ecosystems (Stackowiak
et al., 2015). Thus, the findings confirm the durability of earlier insights while showing how the field
has matured to address new challenges, bridging technical optimization with governance imperatives.
This progression illustrates the dynamic and adaptive nature of ETL scholarship, affirming its
continued relevance within the evolving landscape of business intelligence (Cardoso & Su, 2022).
CONCLUSION
This systematic review demonstrated that ExtractTransformLoad (ETL) pipelines remain a
foundational element of scalable business intelligence, while also evolving to encompass ELT,
streaming, and cloud-native paradigms that reflect modern data ecosystems. By analyzing 63 studies,
the review highlighted that data quality, metadata management, and governance are not peripheral
considerations but central imperatives that determine the reliability and accountability of BI outputs.
The comparative analysis of commercial and open-source tools revealed that enterprise-grade
platforms such as Informatica, IBM DataStage, and Microsoft SSIS continue to dominate highly
regulated environments, while open-source and community-driven tools like Talend, Pentaho, NiFi,
and Airbyte provide flexible and cost-effective alternatives increasingly supported by orchestration
layers such as Airflow and Prefect. Cloud-native services, including AWS Glue, Azure Data Factory,
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
115
and Google Dataflow, were found to redefine integration by embedding serverless scalability, lineage,
and policy enforcement, illustrating that technical efficiency and regulatory compliance are now deeply
intertwined. At the same time, global governance frameworks such as GDPR, CCPA, OECD, and
UNCTAD highlight the necessity of embedding compliance directly into ETL workflows through
metadata, lineage, and documentation, making pipelines both technical enablers and governance
infrastructures. Taken together, the findings position ETL pipelines as evolving socio-technical systems
that sustain analytical scalability, institutional trust, and regulatory legitimacy in increasingly complex
and globalized data environments.
RECOMMENDATIONS
Based on the findings and discussion of this systematic review, several key recommendations can be
made for both practitioners and researchers seeking to strengthen the role of ETL pipelines in scalable
business intelligence systems. First, organizations should prioritize hybrid integration strategies that
combine ETL, ELT, and streaming approaches rather than relying exclusively on one paradigm. This
allows enterprises to leverage batch accuracy, in-database transformation scalability, and real-time
responsiveness simultaneously, meeting diverse business intelligence requirements across sectors.
Second, multidimensional data quality assurance must be embedded directly into integration
workflows, including automated profiling, rule-based validation, and late-arrival handling, to prevent
error propagation and to maintain trust in BI insights. Third, firms are encouraged to adopt metadata-
centric architectures with lineage and provenance features that provide transparency, reproducibility,
and auditability, thereby supporting governance and compliance under frameworks such as DAMA-
DMBOK and ISO 8000. Fourth, when selecting integration tools, enterprises should evaluate both
commercial and open-source ecosystems not only in terms of cost and functionality but also with regard
to governance alignment, support structures, and interoperability with existing cloud infrastructures.
Fifth, cloud-native ETL services such as AWS Glue, Azure Data Factory, and Google Dataflow should
be leveraged for their elasticity and embedded governance capabilities, while organizations must also
implement cost and compliance management practices to mitigate vendor lock-in risks and regulatory
exposure. Finally, policymakers and regulators should collaborate with technical communities to
harmonize governance standards for cross-border dataflows, ensuring that compliance requirements
are technically operationalizable through metadata, lineage, and documentation. For researchers,
future inquiry should expand on comparative effectiveness across tool categories, long-term cost-
benefit analyses, and case studies that capture how regulatory compliance reshapes ETL design in
practice. These recommendations collectively emphasize that sustainable business intelligence requires
not only technical optimization but also governance-aware integration strategies that are adaptable,
transparent, and globally compliant.
REFERENCES
[1]. Adams, D., Adams, K., Ullah, S., & Ullah, F. (2019). Globalisation, governance, accountability and the natural
resource ‘curse’: Implications for socio-economic growth of oil-rich developing countries. Resources Policy, 61, 128-
140.
[2]. Akanbi, A., & Masinde, M. (2020). A distributed stream processing middleware framework for real-time analysis of
heterogeneous data on big data platform: Case of environmental monitoring. Sensors, 20(11), 3166.
[3]. Akhavan-Hejazi, H., & Mohsenian-Rad, H. (2018). Power systems big data analytics: An assessment of paradigm
shift barriers and prospects. Energy Reports, 4, 91-100.
[4]. Ali, A. R. (2018). Real-time big data warehousing and analysis framework. 2018 IEEE 3rd International Conference
on Big Data Analysis (ICBDA),
[5]. Alomar, M. A. (2022). Performance optimization of industrial supply chain using artificial intelligence. Computational
Intelligence and Neuroscience, 2022(1), 9306265.
[6]. Alvizu, R., Maier, G., Kukreja, N., Pattavina, A., Morro, R., Capello, A., & Cavazzoni, C. (2017). Comprehensive
survey on T-SDN: Software-defined networking for transport networks. IEEE Communications Surveys & Tutorials,
19(4), 2232-2283.
[7]. Anderson, J. A., Glaser, J., & Glotzer, S. C. (2020). HOOMD-blue: A Python package for high-performance molecular
dynamics and hard particle Monte Carlo simulations. Computational Materials Science, 173, 109363.
[8]. Artioli, F., Acuto, M., & McArthur, J. (2017). The water-energy-food nexus: An integration agenda and implications
for urban governance. Political Geography, 61, 215-223.
[9]. Atwal, H. (2020). Practical DataOps. Practical DataOps (1st ed.). Apress Berkeley, CA. https://doi. org/10.1007/978-1-4842-
5104-1.
[10]. Auld, G., Renckens, S., & Cashore, B. (2015). Transnational private governance between the logics of empowerment
and control. Regulation & Governance, 9(2), 108-124.
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
116
[11]. Azevedo, J., Duarte, J., & Santos, M. F. (2022). Implementing a business intelligence cost accounting solution in a
healthcare setting. Procedia Computer Science, 198, 329-334.
[12]. Bakker, K., & Ritts, M. (2018). Smart Earth: A meta-review and implications for environmental governance. Global
environmental change, 52, 201-211.
[13]. Behrisch, M., Streeb, D., Stoffel, F., Seebacher, D., Matejek, B., Weber, S. H., Mittelstaedt, S., Pfister, H., & Keim, D.
(2018). Commercial visual analytics systemsadvances in the big data analytics field. IEEE transactions on visualization
and computer graphics, 25(10), 3011-3031.
[14]. Berkani, N., & Bellatreche, L. (2018). Streaming ETL in polystore era. International Conference on Algorithms and
Architectures for Parallel Processing,
[15]. Berkani, N., Bellatreche, L., Khouri, S., & Ordonez, C. (2020). The contribution of linked open data to augment a
traditional data warehouse. Journal of Intelligent Information Systems, 55(3), 397-421.
[16]. Bi, D., Almpanis, A., Noel, A., Deng, Y., & Schober, R. (2021). A survey of molecular communication in cell biology:
Establishing a new hierarchy for interdisciplinary applications. IEEE Communications Surveys & Tutorials, 23(3), 1494-
1545.
[17]. Biermann, F., Kanie, N., & Kim, R. E. (2017). Global governance by goal-setting: the novel approach of the UN
Sustainable Development Goals. Current Opinion in Environmental Sustainability, 26, 26-31.
[18]. Bimonte, S., Billaud, O., Fontaine, B., Martin, T., Flouvat, F., Hassan, A., Rouillier, N., & Sautot, L. (2021). Collect and
analysis of agro-biodiversity data in a participative context: A business intelligence framework. Ecological Informatics,
61, 101231.
[19]. Biplob, M. B., Sheraji, G. A., & Khan, S. I. (2018). Comparison of different extraction transformation and loading tools
for data warehousing. 2018 international conference on innovations in science, engineering and technology (ICISET),
[20]. Biswas, N., & Mondal, K. C. (2021). Integration of ETL in cloud using spark for streaming data. International
Conference on Emerging Applications of Information Technology,
[21]. Biswas, N., Sarkar, A., & Mondal, K. C. (2020). Efficient incremental loading in ETL processing for real-time data
integration. Innovations in Systems and Software Engineering, 16(1), 53-61.
[22]. Bramerdorfer, G., Tapia, J. A., Pyrhönen, J. J., & Cavagnino, A. (2018). Modern electrical machine design optimization:
Techniques, trends, and best practices. IEEE Transactions on Industrial Electronics, 65(10), 7672-7684.
[23]. Brown, M. A., & Soni, A. (2019). Expert perceptions of enhancing grid resilience with electric vehicles in the United
States. Energy Research & Social Science, 57, 101241.
[24]. Bryzgalov, A., & Stupnikov, S. (2020). A Cloud-Native Serverless Approach for Implementation of Batch Extract-
Load Processes in Data Lakes. International Conference on Data Analytics and Management in Data Intensive
Domains,
[25]. Campbell-Verduyn, M. (2018). Bitcoin, crypto-coins, and global anti-money laundering governance. Crime, Law and
Social Change, 69(2), 283-305.
[26]. Carbone, P., Gévay, G. E., Hermann, G., Katsifodimos, A., Soto, J., Markl, V., & Haridi, S. (2017). Large-scale data
stream processing systems. In Handbook of big data technologies (pp. 219-260). Springer.
[27]. Cardoso, E., & Su, X. (2022). Designing a business intelligence and analytics maturity model for higher education: A
design science approach. Applied Sciences, 12(9), 4625.
[28]. Conti, K. I., & Gupta, J. (2016). Global governance principles for the sustainable development of groundwater
resources. International Environmental Agreements: Politics, Law and Economics, 16(6), 849-871.
[29]. Crane, A., LeBaron, G., Allain, J., & Behbahani, L. (2019). Governance gaps in eradicating forced labor: From global
to domestic supply chains. Regulation & Governance, 13(1), 86-106.
[30]. Dabic-Miletic, S., Simic, V., & Karagoz, S. (2021). End-of-life tire management: a critical review. Environmental science
and pollution research, 28(48), 68053-68070.
[31]. Darmont, J., Novikov, B., Wrembel, R., & Bellatreche, L. (2022). Advances on data management and information
systems. Information Systems Frontiers, 24(1), 1-10.
[32]. Dey, P. K., Malesios, C., De, D., Chowdhury, S., & Abdelaziz, F. B. (2020). The impact of lean management practices
and sustainablyoriented innovation on sustainability performance of small and mediumsized enterprises: empirical
evidence from the UK. British Journal of Management, 31(1), 141-161.
[33]. Dhaouadi, A., Bousselmi, K., Gammoudi, M. M., Monnet, S., & Hammoudi, S. (2022). Data warehousing process
modeling from classical approaches to new trends: Main features and comparisons. Data, 7(8), 113.
[34]. Dineva, K., & Atanasova, T. (2021). Design of scalable IoT architecture based on AWS for smart livestock. Animals,
11(9), 2697.
[35]. Diouf, P. S., Boly, A., & Ndiaye, S. (2017). Performance of the ETL processes in terms of volume and velocity in the
cloud: State of the art. 2017 4th IEEE international conference on engineering technologies and applied sciences
(ICETAS),
[36]. Durch, W., Larik, J., & Ponzio, R. (2016). Just Security and the Crisis of Global Governance. Survival, 58(4), 95-112.
[37]. Fan, S.-C., & Yu, K.-C. (2017). How an integrative STEM curriculum can benefit students in engineering design
practices. International Journal of Technology and Design Education, 27(1), 107-129.
[38]. Feng, D., She, C., Ying, K., Lai, L., Hou, Z., Quek, T. Q., Li, Y., & Vucetic, B. (2019). Toward ultrareliable low-latency
communications: Typical scenarios, possible solutions, and open issues. IEEE Vehicular Technology Magazine, 14(2),
94-102.
[39]. Feng, K., Chen, S., & Lu, W. (2018). Machine learning based construction simulation and optimization. 2018 Winter
Simulation Conference (WSC),
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
117
[40]. Figueiras, P., Costa, R., Guerreiro, G., Antunes, H., Rosa, A., & Jardim-Gonçalves, R. (2017). User interface support
for a big ETL data processing pipeline an application scenario on highway toll charging models. 2017 International
conference on engineering, technology and innovation (ICE/ITMC),
[41]. Figueiras, P., Herga, Z., Guerreiro, G., Rosa, A., Costa, R., & Jardim-Gonçalves, R. (2018). Real-time monitoring of
road traffic using data stream mining. 2018 IEEE International Conference on Engineering, Technology and
Innovation (ICE/ITMC),
[42]. Ghosh, R., Haider, S., & Sen, S. (2015). An integrated approach to deploy data warehouse in business intelligence
environment. Proceedings of the 2015 Third International Conference on Computer, Communication, Control and
Information Technology (C3IT),
[43]. Gidaris, I., & Taflanidis, A. A. (2015). Performance assessment and optimization of fluid viscous dampers through
life-cycle cost criteria and comparison to alternative design approaches. Bulletin of Earthquake Engineering, 13(4), 1003-
1028.
[44]. Godinho, T. M., Lebre, R., Almeida, J. R., & Costa, C. (2019). Etl framework for real-time business intelligence over
medical imaging repositories. Journal of digital imaging, 32(5), 870-879.
[45]. Gómez-Mera, L. (2017). The global governance of trafficking in persons: Toward a transnational regime complex.
Journal of Human Trafficking, 3(4), 303-326.
[46]. Goss, R., & Subramany, L. (2021). Journey to a Big Data Analysis Platform: Are we there yet? 2021 32nd Annual SEMI
Advanced Semiconductor Manufacturing Conference (ASMC),
[47]. Guarda, T., & Lopes, I. (2022). Augmented Analytics an Innovative Paradigm. International Conference on
Innovations in Bio-Inspired Computing and Applications,
[48]. Gudivada, V. N., Irfan, M. T., Fathi, E., & Rao, D. L. (2016). Cognitive analytics: Going beyond big data analytics and
machine learning. In Handbook of statistics (Vol. 35, pp. 169-205). Elsevier.
[49]. Hahn, S. M. L., Chereja, I., & Matei, O. (2021). Evaluation of transformation tools in the context of NoSQL databases.
Proceedings of SAI Intelligent Systems Conference,
[50]. Henriques, N., Sargento, S., Neves, P., Pérez, M. G., Pérez, G. M., Bernini, G., Wang, Q., Alcaraz-Calero, J. M., &
Koutsopoulos, K. (2018). Catalog-driven services in a 5G SDN/NFV self-managed environment. 2018 IEEE
Symposium on Computers and Communications (ISCC),
[51]. Hosne Ara, M., Tonmoy, B., Mohammad, M., & Md Mostafizur, R. (2022). AI-ready data engineering pipelines: a
review of medallion architecture and cloud-based integration models. American Journal of Scholarly Research and
Innovation, 1(01), 319-350. https://doi.org/10.63125/51kxtf08
[52]. Imran, S., Mahmood, T., Morshed, A., & Sellis, T. (2020). Big data analytics in healthcare− A systematic literature
review and roadmap for practical implementation. IEEE/CAA Journal of Automatica Sinica, 8(1), 1-22.
[53]. Isah, H., Abughofa, T., Mahfuz, S., Ajerla, D., Zulkernine, F., & Khan, S. (2019). A survey of distributed data stream
processing frameworks. IEEE Access, 7, 154300-154316.
[54]. Jackson, R., Kartoglu, I., Stringer, C., Gorrell, G., Roberts, A., Song, X., Wu, H., Agrawal, A., Lui, K., & Groza, T.
(2018). CogStack-experiences of deploying integrated information retrieval and extraction services in a large National
Health Service Foundation Trust hospital. BMC medical informatics and decision making, 18(1), 47.
[55]. Jahid, M. K. A. S. R. (2022). Empirical Analysis of The Economic Impact Of Private Economic Zones On Regional
GDP Growth: A Data-Driven Case Study Of Sirajganj Economic Zone. American Journal of Scholarly Research and
Innovation, 1(02), 01-29. https://doi.org/10.63125/je9w1c40
[56]. Jamedžija, M., & Đurić, Z. (2021). Moonlight: A push-based api for tracking data lineage in modern etl processes.
2021 20th International Symposium INFOTEH-JAHORINA (INFOTEH),
[57]. Jovanovic, P., Romero, O., & Abelló, A. (2016). A unified view of data-intensive flows in business intelligence
systems: a survey. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XXIX (pp. 66-107). Springer.
[58]. Konis, K., Gamas, A., & Kensek, K. (2016). Passive performance and building form: An optimization framework for
early-stage design support. Solar Energy, 125, 161-179.
[59]. Kougka, G., Gounaris, A., & Simitsis, A. (2018). The many faces of data-centric workflow optimization: a survey.
International Journal of Data Science and Analytics, 6(2), 81-107.
[60]. Kousalya, G., Balakrishnan, P., & Raj, C. P. (2017). Automated workflow scheduling in self-adaptive clouds. Springer.
[61]. Krawatzeck, R., Dinter, B., & Thi, D. A. P. (2015). How to make business intelligence agile: The Agile BI actions
catalog. 2015 48th Hawaii International Conference on System Sciences,
[62]. Kring, W. N., & Grimes, W. W. (2019). Leaving the nest: The rise of regional financial arrangements and the future of
global governance. Development and Change, 50(1), 72-95.
[63]. Kumar, S., Yenamandra, P., Chan, R., Dai, J., Kumar, K., & Palecha, N. (2021). Test Bench Optimization for Better
Simulation Performance. 2021 Second International Conference on Electronics and Sustainable Communication
Systems (ICESC),
[64]. L’Esteve, R. C. (2021). The Tools and Prerequisites. In The Definitive Guide to Azure Data Engineering: Modern ELT,
DevOps, and Analytics on the Azure Cloud Platform (pp. 3-33). Springer.
[65]. Landi, D., Vitali, S., & Germani, M. (2016). Environmental analysis of different end of life scenarios of tires textile
fibers. Procedia Cirp, 48, 508-513.
[66]. Lanza-Cruz, I., Berlanga, R., & Aramburu, M. J. (2018). Modeling analytical streams for social business intelligence.
Informatics,
[67]. Larson, D., & Chang, V. (2016). A review and future direction of agile, business intelligence, analytics and data
science. International Journal of Information Management, 36(5), 700-710.
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
118
[68]. Lastra-González, P., Calzada-Pérez, M. A., Castro-Fresno, D., Vega-Zamanillo, Á., & Indacoechea-Vega, I. (2016).
Comparative analysis of the performance of asphalt concretes modified by dry way with polymeric waste.
Construction and Building Materials, 112, 1133-1140.
[69]. Lou, K., Xiao, P., Kang, A., Wu, Z., Li, B., & Lu, P. (2021). Performance evaluation and adaptability optimization of
hot mix asphalt reinforced by mixed lengths basalt fibers. Construction and Building Materials, 292, 123373.
[70]. Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., & Herrera, F. (2020). Big data preprocessing. Cham: Springer,
1, 1-186.
[71]. Lukić, J., Radenković, M., Despotović-Zrakić, M., Labus, A., & Bogdanović, Z. (2016). A hybrid approach to building
a multi-dimensional business intelligence system for electricity grid operators. Utilities Policy, 41, 95-106.
[72]. Machado, G. V., Cunha, Í., Pereira, A. C., & Oliveira, L. B. (2019). DOD-ETL: distributed on-demand ETL for near
real-time business intelligence. Journal of Internet Services and Applications, 10(1), 21.
[73]. Mallek, H., Ghozzi, F., Teste, O., & Gargouri, F. (2018). BigDimETL with NoSQL database. Procedia Computer Science,
126, 798-807.
[74]. Mansura Akter, E., & Md Abdul Ahad, M. (2022). In Silico drug repurposing for inflammatory diseases: a systematic
review of molecular docking and virtual screening studies. American Journal of Advanced Technology and Engineering
Solutions, 2(04), 35-64. https://doi.org/10.63125/j1hbts51
[75]. Md Arifur, R., & Sheratun Noor, J. (2022). A Systematic Literature Review of User-Centric Design In Digital Business
Systems: Enhancing Accessibility, Adoption, And Organizational Impact. Review of Applied Science and Technology,
1(04), 01-25. https://doi.org/10.63125/ndjkpm77
[76]. Md Hasan, Z., & Moin Uddin, M. (2022). Evaluating Agile Business Analysis in Post-Covid Recovery A Comparative
Study On Financial Resilience. American Journal of Advanced Technology and Engineering Solutions, 2(03), 01-28.
https://doi.org/10.63125/6nee1m28
[77]. Md Mahamudur Rahaman, S. (2022). Electrical And Mechanical Troubleshooting in Medical And Diagnostic Device
Manufacturing: A Systematic Review Of Industry Safety And Performance Protocols. American Journal of Scholarly
Research and Innovation, 1(01), 295-318. https://doi.org/10.63125/d68y3590
[78]. Md Nazrul Islam, K. (2022). A Systematic Review of Legal Technology Adoption In Contract Management, Data
Governance, And Compliance Monitoring. American Journal of Interdisciplinary Studies, 3(01), 01-30.
https://doi.org/10.63125/caangg06
[79]. Md Nur Hasan, M., Md Musfiqur, R., & Debashish, G. (2022). Strategic Decision-Making in Digital Retail Supply
Chains: Harnessing AI-Driven Business Intelligence From Customer Data. Review of Applied Science and Technology,
1(03), 01-31. https://doi.org/10.63125/6a7rpy62
[80]. Md Redwanul, I., & Md. Zafor, I. (2022). Impact of Predictive Data Modeling on Business Decision-Making: A Review
Of Studies Across Retail, Finance, And Logistics. American Journal of Advanced Technology and Engineering Solutions,
2(02), 33-62. https://doi.org/10.63125/8hfbkt70
[81]. Md Rezaul, K., & Md Mesbaul, H. (2022). Innovative Textile Recycling and Upcycling Technologies For Circular
Fashion: Reducing Landfill Waste And Enhancing Environmental Sustainability. American Journal of Interdisciplinary
Studies, 3(03), 01-35. https://doi.org/10.63125/kkmerg16
[82]. Md Takbir Hossen, S., & Md Atiqur, R. (2022). Advancements In 3d Printing Techniques For Polymer Fiber-
Reinforced Textile Composites: A Systematic Literature Review. American Journal of Interdisciplinary Studies, 3(04), 32-
60. https://doi.org/10.63125/s4r5m391
[83]. Md Tawfiqul, I., Meherun, N., Mahin, K., & Mahmudur Rahman, M. (2022). Systematic Review of Cybersecurity
Threats In IOT Devices Focusing On Risk Vectors Vulnerabilities And Mitigation Strategies. American Journal of
Scholarly Research and Innovation, 1(01), 108-136. https://doi.org/10.63125/wh17mf19
[84]. Md. Sakib Hasan, H. (2022). Quantitative Risk Assessment of Rail Infrastructure Projects Using Monte Carlo
Simulation And Fuzzy Logic. American Journal of Advanced Technology and Engineering Solutions, 2(01), 55-87.
https://doi.org/10.63125/h24n6z92
[85]. Md. Tarek, H. (2022). Graph Neural Network Models For Detecting Fraudulent Insurance Claims In Healthcare
Systems. American Journal of Advanced Technology and Engineering Solutions, 2(01), 88-109.
https://doi.org/10.63125/r5vsmv21
[86]. Md.Kamrul, K., & Md Omar, F. (2022). Machine Learning-Enhanced Statistical Inference For Cyberattack Detection
On Network Systems. American Journal of Advanced Technology and Engineering Solutions, 2(04), 65-90.
https://doi.org/10.63125/sw7jzx60
[87]. Md.Kamrul, K., & Md. Tarek, H. (2022). A Poisson Regression Approach to Modeling Traffic Accident Frequency in
Urban Areas. American Journal of Interdisciplinary Studies, 3(04), 117-156. https://doi.org/10.63125/wqh7pd07
[88]. Medeiros, D. S., Cunha Neto, H. N., Lopez, M. A., S. Magalhães, L. C., Fernandes, N. C., Vieira, A. B., Silva, E. F., &
F. Mattos, D. M. (2020). A survey on data analysis on large-Scale wireless networks: online stream processing, trends,
and challenges. Journal of Internet Services and Applications, 11(1), 6.
[89]. Mehmood, E., & Anees, T. (2022). Distributed real-time ETL architecture for unstructured big data. Knowledge and
information systems, 64(12), 3419-3445.
[90]. Mridha, M., Basri, R., Monowar, M. M., & Hamid, M. A. (2021). A machine learning approach for screening
individual’s job profile using convolutional neural network. 2021 International Conference on Science &
Contemporary Technologies (ICSCT),
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
119
[91]. Mubashir, I., & Abdul, R. (2022). Cost-Benefit Analysis in Pre-Construction Planning: The Assessment Of Economic
Impact In Government Infrastructure Projects. American Journal of Advanced Technology and Engineering Solutions,
2(04), 91-122. https://doi.org/10.63125/kjwd5e33
[92]. Mukherjee, R., & Kar, P. (2017). A comparative review of data warehousing ETL tools with new trends and industry
insight. 2017 IEEE 7th International Advance Computing Conference (IACC),
[93]. Mwilu, O. S., Comyn-Wattiau, I., & Prat, N. (2016). Design science research contribution to business intelligence in
the cloudA systematic literature review. Future Generation Computer Systems, 63, 108-122.
[94]. Nešetřil, K., & Šembera, J. (2017). Business intelligence and geographic information system for hydrogeology.
International Symposium on Environmental Software Systems,
[95]. Nwokeji, J. C., & Matovu, R. (2021). A systematic literature review on big data extraction, transformation and loading
(etl). Intelligent computing, 308-324.
[96]. Omar Muhammad, F., & Md.Kamrul, K. (2022). Blockchain-Enabled BI For HR And Payroll Systems: Securing
Sensitive Workforce Data. American Journal of Scholarly Research and Innovation, 1(02), 30-58.
https://doi.org/10.63125/et4bhy15
[97]. Ordonez-Lucena, J., Adamuz-Hinojosa, O., Ameigeiras, P., Muñoz, P., Ramos-Muñoz, J. J., Chavarria, J. F., & Lopez,
D. (2018). The creation phase in network slicing: From a service order to an operative network slice. 2018 European
Conference on Networks and Communications (EuCNC),
[98]. Oubibi, M., Zhou, Y., Oubibi, A., Fute, A., & Saleem, A. (2022). The challenges and opportunities for developing the
use of data and artificial intelligence (AI) in North Africa: case of Morocco. International conference on digital
technologies and applications,
[99]. Oumkaltoum, B., El Idrissi, M., El Benany, M. M., & Omar, E. B. (2019). Business intelligence and EDA based
architecture for interoperability of e-government data services. 2019 IEEE International Smart Cities Conference
(ISC2),
[100]. Pablo, P.-V. (2016). Business intelligence applied to monitoring and meta-monitoring scenarios. 2016 11th Iberian
Conference on Information Systems and Technologies (CISTI),
[101]. Pan, B., Zhang, G., & Qin, X. (2018). Design and realization of an ETL method in business intelligence project. 2018
IEEE 3rd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA),
[102]. Pangbourne, K., Mladenović, M. N., Stead, D., & Milakis, D. (2020). Questioning mobility as a service: Unanticipated
implications for society and governance. Transportation research part A: policy and practice, 131, 35-49.
[103]. Pape, T. (2016). Prioritising data items for business analytics: Framework and application to human resources.
European Journal of Operational Research, 252(2), 687-698.
[104]. Patel, M., & Patel, D. B. (2020). Progressive growth of ETL tools: A literature review of past to equip future. Rising
Threats in Expert Applications and Solutions: Proceedings of FICR-TEAS 2020, 389-398.
[105]. Pereira, L., Bentes, C., de Castro, M. C. S., & Garcia, E. (2017). A Case Study of Performance Optimization in a
Heterogeneous Environment. 2017 International Symposium on Computer Architecture and High Performance
Computing Workshops (SBAC-PADW),
[106]. Pillai, R., & Al-Malkawi, H.-A. N. (2018). On the relationship between corporate governance and firm performance:
Evidence from GCC countries. Research in International Business and Finance, 44, 394-410.
[107]. Pusala, M. K., Amini Salehi, M., Katukuri, J. R., Xie, Y., & Raghavan, V. (2016). Massive data analysis: tasks, tools,
applications, and challenges. Big Data analytics: methods and applications, 11-40.
[108]. Qu, W., Basavaraj, V., Shankar, S., & Dessloch, S. (2015). Real-time snapshot maintenance with incremental ETL
pipelines in data warehouses. International Conference on Big Data Analytics and Knowledge Discovery,
[109]. Quinto, B. (2018). Batch and Real-Time Data Ingestion and Processing. In Next-Generation Big Data: A Practical Guide
to Apache Kudu, Impala, and Spark (pp. 231-374). Springer.
[110]. Raj, A., Bosch, J., Olsson, H. H., & Wang, T. J. (2020). Modelling data pipelines. 2020 46th Euromicro conference on
software engineering and advanced applications (SEAA),
[111]. Raj, P., Raman, A., Nagaraj, D., & Duggirala, S. (2015). High-performance integrated systems, databases, and
warehouses for big and fast data analytics. In High-Performance Big-Data Analytics: Computing Systems and Approaches
(pp. 233-274). Springer.
[112]. Raj, R., Wong, S. H. S., & Beaumont, A. J. (2016). Empowering SMEs to make better decisions with business
intelligence: a case study. International Joint Conference on Knowledge Discovery, Knowledge Engineering, and
Knowledge Management,
[113]. Ranchal, R., Bastide, P., Wang, X., Gkoulalas-Divanis, A., Mehra, M., Bakthavachalam, S., Lei, H., & Mohindra, A.
(2020). Disrupting healthcare silos: Addressing data volume, velocity and variety with a cloud-native healthcare data
ingestion service. IEEE Journal of Biomedical and Health Informatics, 24(11), 3182-3188.
[114]. Raschka, S., Patterson, J., & Nolet, C. (2020). Machine learning in python: Main developments and technology trends
in data science, machine learning, and artificial intelligence. Information, 11(4), 193.
[115]. Reduanul, H., & Mohammad Shoeb, A. (2022). Advancing AI in Marketing Through Cross Border Integration Ethical
Considerations And Policy Implications. American Journal of Scholarly Research and Innovation, 1(01), 351-379.
https://doi.org/10.63125/d1xg3784
[116]. Reinkemeyer, L. (2020). Process mining in action. Process mining in action principles, use cases and outloook, 11(7), 116-
128.
[117]. Ruggie, J. G. (2018). Multinationals as global institution: Power, authority and relative autonomy. Regulation &
Governance, 12(3), 317-333.
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
120
[118]. Russo, M. (2022). Measuring performance: Metrics for manipulator design, control, and optimization. Robotics, 12(1),
4.
[119]. Ryen, V., Soylu, A., & Roman, D. (2022). Building semantic knowledge graphs from (semi-) structured data: a review.
Future Internet, 14(5), 129.
[120]. Sabharwal, N., & Kasiviswanathan, S. Workload Automation Using HWA.
[121]. Sabuj Kumar, S., & Zobayer, E. (2022). Comparative Analysis of Petroleum Infrastructure Projects In South Asia And
The Us Using Advanced Gas Turbine Engine Technologies For Cross Integration. American Journal of Advanced
Technology and Engineering Solutions, 2(04), 123-147. https://doi.org/10.63125/wr93s247
[122]. Sadia, T., & Shaiful, M. (2022). In Silico Evaluation of Phytochemicals From Mangifera Indica Against Type 2 Diabetes
Targets: A Molecular Docking And Admet Study. American Journal of Interdisciplinary Studies, 3(04), 91-116.
https://doi.org/10.63125/anaf6b94
[123]. Saleem, I., Lamarque, E., & Hasan, R. (2021). State and self-regulation for better governance: an implication of
collibration. International Journal of Law and Management, 63(2), 172-194.
[124]. Sanchez-Gomez, A., Martinez-Perez, S., Perez-Chavero, F. M., & Molina-Navarro, E. (2022). Optimization of a SWAT
model by incorporating geological information through calibration strategies. Optimization and Engineering, 23(4),
2203-2233.
[125]. Sazzad, I., & Md Nazrul Islam, K. (2022). Project impact assessment frameworks in nonprofit development: a review
of case studies from south asia. American Journal of Scholarly Research and Innovation, 1(01), 270-294.
https://doi.org/10.63125/eeja0t77
[126]. Schintler, L. A., & McNeely, C. L. (2022). Encyclopedia of big data. Springer.
[127]. Shaari, H., Durmić, N., & Ahmed, N. (2021). Modern ABI platforms for healthcare data processing. International
Symposium on Innovative and Interdisciplinary Applications of Advanced Technologies,
[128]. Sheratun Noor, J., & Momena, A. (2022). Assessment Of Data-Driven Vendor Performance Evaluation in Retail
Supply Chains: Analyzing Metrics, Scorecards, And Contract Management Tools. American Journal of Interdisciplinary
Studies, 3(02), 36-61. https://doi.org/10.63125/0s7t1y90
[129]. Shields, B. J., Stevens, J., Li, J., Parasram, M., Damani, F., Alvarado, J. I. M., Janey, J. M., Adams, R. P., & Doyle, A. G.
(2021). Bayesian reaction optimization as a tool for chemical synthesis. Nature, 590(7844), 89-96.
[130]. Shivakumar, S. K. (2020). Modern web performance optimization. Methods, Tools, and Patterns to Speed Up Digital
Platforms.
[131]. Sioshansi, R., & Conejo, A. J. (2017). Optimization in engineering. Cham: Springer International Publishing, 120.
[132]. Stackowiak, R., Licht, A., Mantha, V., & Nagode, L. (2015). Big data solutions and the Internet of Things. In Big Data
and the Internet of Things: Enterprise Information Architecture for a New Age (pp. 1-27). Springer.
[133]. Suleykin, A., & Panfilov, P. (2020). Metadata-driven industrial-grade ETL system. 2020 IEEE International Conference
on Big Data (Big Data),
[134]. Tahmina Akter, R., & Abdur Razzak, C. (2022). The Role of Artificial Intelligence in Vendor Performance Evaluation
Within Digital Retail Supply Chains: A Review Of Strategic Decision-Making Models. American Journal of Scholarly
Research and Innovation, 1(01), 220-248. https://doi.org/10.63125/96jj3j86
[135]. Tamym, L., Benyoucef, L., Moh, A. N. S., & El Ouadghiri, M. D. (2021). A big data based architecture for collaborative
networks: Supply chains mixed-network. Computer communications, 175, 102-111.
[136]. Taneja, K., Zhu, Q., Duggan, D., & Tung, T. (2015). Linked enterprise data model and its use in real time analytics
and context-driven data discovery. 2015 IEEE International Conference on Mobile Services,
[137]. Theodorou, V., Abelló, A., Lehner, W., & Thiele, M. (2016). Quality measures for ETL processes: from goals to
implementation. Concurrency and computation: practice and experience, 28(15), 3969-3993.
[138]. Tong, Z.-m., Xin, J.-g., Tong, S.-g., Yang, Z.-q., Zhao, J.-y., & Mao, J.-h. (2020). Internal flow structure, fault detection,
and performance optimization of centrifugal pumps. Journal of Zhejiang University-SCIENCE A, 21(2), 85-117.
[139]. Valdiviezo-Díaz, P., Cordero, J., Reátegui, R., & Aguilar, J. (2015). A business intelligence model for online tutoring
process. 2015 IEEE Frontiers in Education Conference (FIE),
[140]. Van der Spek, M., Roussanaly, S., & Rubin, E. S. (2019). Best practices and recent advances in CCS cost engineering
and economic analysis. International Journal of Greenhouse Gas Control, 83, 91-104.
[141]. Villegas-Ch, W., Palacios-Pacheco, X., & Luján-Mora, S. (2020). A business intelligence framework for analyzing
educational data. Sustainability, 12(14), 5745.
[142]. Vince, J., & Hardesty, B. D. (2017). Plastic pollution challenges in marine and coastal environments: from local to
global governance. Restoration ecology, 25(1), 123-128.
[143]. Voegtlin, C., & Scherer, A. G. (2017). Responsible innovation and the innovation of responsibility: Governing
sustainable development in a globalized world. Journal of business ethics, 143(2), 227-243.
[144]. Vyas, S., Tyagi, R. K., Jain, C., & Sahu, S. (2021). Literature review: A comparative study of real time streaming
technologies and apache kafka. 2021 Fourth International Conference on Computational Intelligence and
Communication Technologies (CCICT),
[145]. Wang, C.-H. (2016). A novel approach to conduct the importance-satisfaction analysis for acquiring typical user
groups in business-intelligence systems. Computers in Human Behavior, 54, 673-681.
[146]. Wang, L., & Zhao, J. (2020). Strategic Blueprint for Enterprise Analytics. Springer.
[147]. Yi, G., & Sui, Y. (2015). Different effects of economic and structural performance indexes on model construction of
structural topology optimization. Acta Mechanica Sinica, 31(5), 777-788.
ASRC Procedia: Global Perspectives in Science and Scholarship, April 2022, 89121
121
[148]. Zdravevski, E., Lameski, P., Apanowicz, C., & Ślȩzak, D. (2020). From Big Data to business analytics: The case study
of churn prediction. Applied Soft Computing, 90, 106164.
[149]. Zeng, D., Gu, L., & Guo, S. (2015). Cloud Networking for Big Data. Springer.
[150]. Zeydan, E., & Mangues-Bafalluy, J. (2022). Recent advances in data engineering for networking. IEEE Access, 10,
34449-34496.
[151]. Zhang, Y., Wang, S., & Ji, G. (2015). A comprehensive survey on particle swarm optimization algorithm and its
applications. Mathematical problems in engineering, 2015(1), 931256.
[152]. Zhong, H., Tan, Z., He, Y., Xie, L., & Kang, C. (2020). Implications of COVID-19 for the electricity industry: A
comprehensive review. CSEE Journal of Power and Energy Systems, 6(3), 489-495.
[153]. Zohuri, B., & Moghaddam, M. (2017). Business Resilience System (BRS): Driven through Boolean, fuzzy logics and cloud
computation (Vol. 11). Springer.