A Lakehouse to support mobile networks management PDF Free Download

1 / 10
0 views10 pages

A Lakehouse to support mobile networks management PDF Free Download

A Lakehouse to support mobile networks management PDF free Download. Think more deeply and widely.

A Lakehouse to support mobile networks management
Diogo Miguel Pimp˜
ao Vilela
diogo.pimpao.vilela@tecnico.ulisboa.pt
Instituto Superior T ´
ecnico, Lisboa, Portugal
November 2024
Abstract
Currently, over 4 billion people are connected to the internet, with a large portion of these users
accessing it through mobile devices. Every day, these users generate a vast amount of data collected
by companies. With advances in Data Science, it has become possible to develop methods and models
to extract valuable insights about customers and enhance strategic decision-making. However, due to
the diversity of applications, devices, and technologies, this data is dispersed across various sources and
exists in multiple formats, both structured and unstructured. Integrating and interpreting this variety of
data represents a significant and complex challenge.
In this context, a data lakehouse was developed to support mobile networks management. This sys-
tem provides a scalable and efficient solution for data ingestion, processing, and storage, optimizing
access to critical information. It can strengthen the company’s processes, especially in its main applica-
tion: a platform that automates and enhances the workflows of telecommunications engineering teams,
ensuring accessible and reliable data for the analysis of Key Performance Indicators (KPIs).
To demonstrate the data architecture in action, an application was developed to forecast downlink
traffic on a telecommunications network. Traffic forecasting allows network managers to identify areas
requiring more antennas, as well as to develop strategies to optimize network resource usage, improve
energy efficiency, and reduce costs. The ARIMA and Holt-Winters methods were used for forecasting,
treating traffic as a time series.
Keywords: Telecommunications Management, Data Lakehouse, Time Series Forecasting
1. Introduction
The rapid expansion of the telecommunications
sector has led to substantial data generation, in-
cluding performance metrics, cell configurations,
and user activity. This data, essential for network
performance insights, is stored in diverse formats
across multiple sources, creating significant chal-
lenges in efficient ingestion, processing, and anal-
ysis.
Traditional data architectures, such as data
warehouses, have been instrumental but often
lack scalability, adaptability, and cost-effectiveness,
essential for today’s telecommunications de-
mands [1]. The data lakehouse architecture
emerges as a solution, combining the strengths of
both data lakes and warehouses to manage struc-
tured and unstructured data while enabling ad-
vanced analytics and machine learning.
Managing and handling large volumes of
telecommunications data is essential for maintain-
ing network performance, providing optimal user
experience, and minimizing operational costs. Ex-
isting systems frequently need more speed, flexi-
bility, and the ability to scale effectively in response
to the growing volume and variety of data.
In this thesis, a data lakehouse tailored to the
needs of the telecommunications industry is devel-
oped, specifically designed to enhance data inges-
tion, processing, and accessibility, thereby facilitat-
ing KPI analysis crucial for effective network man-
agement. This work is conducted in collaboration
with Multivision Consulting, a company renowned
for its Software as a Service (SaaS) platform Sim-
plifyd, which empowers telecom operators by op-
timizing network management through automation
and decision support.
To demonstrate the architecture’s capabilities,
the thesis includes a use case focused on predict-
ing downlink traffic, a KPI vital for user experience,
through ARIMA and Holt-Winters models. Accu-
rate traffic predictions are essential for efficient net-
work resource allocation, capacity management,
and cost optimization. By employing these fore-
casting models, the thesis illustrates how the archi-
tecture supports enhanced decision-making and
optimizes network performance.
2. Data Engineering
Modern companies collect vast amounts of struc-
tured and unstructured data from numerous
1
sources to improve customer understanding and
support strategic decision-making. Integrating and
connecting these diverse data sources is essential
for actionable insights, but this task is complex and
critical.
Traditional databases and data warehouses, the
primary solutions for business intelligence, offer
structured storage and processing but come with
significant drawbacks, including high costs for on-
premises hardware, proprietary data formats, and
reliance on centralized IT departments for analy-
sis [1]. They also need help scaling to meet the
increasing demand for real-time data analytics and
the massive volume, velocity, and variety of big
data [1]. This challenge set the stage for alterna-
tive solutions.
2.1. The Rise of Big Data and Hadoop
Hadoop, introduced in 2006, enabled parallel pro-
cessing and large-scale data handling through the
Map-Reduce model across clusters of standard
hardware, making it an attractive option for han-
dling growing data volumes [2]. The Hadoop
ecosystem helped organizations handle big data
more cost-effectively, but its complex architecture,
high IT resource demands, and focus on data ac-
cumulation often resulted in data swamps, where
valuable data becomes difficult to manage. Fur-
thermore, cloud computing advancements, includ-
ing platforms like Amazon S3 (AWS S3), Azure
Data Lake Storage (ADLS), and Google Cloud
Storage (GCS), offered more flexible, cost-effective
alternatives, highlighting Hadoop’s operational and
economic limitations [3].
2.2. Transition to Open Data Architecture
Although Hadoop did not meet all expectations, it
laid the foundation for the open data ecosystem
based on openness, modularity, and diversity [4].
Today’s ecosystem benefits from cloud scalability,
query acceleration technologies, and open-source
data formats, making data accessible for business
analytics and AI/ML workloads at reduced costs.
Four main trends have simplified the adoption
of open data solutions. First, cloud data lakes,
like Amazon S3, ADLS, and GCS, enable scal-
able storage of structured and unstructured data
without on-premises hardware, supported by
decreasing cloud costs [3]. Second, open-source
data formats like Apache Parquet and Iceberg
improve data compatibility across platforms and
help organizations reduce storage costs [5].
Third, cloud-native vendors now offer managed
solutions, eliminating the need for costly hardware
investments while enabling pay-as-you-go models
for greater flexibility [3]. Finally, modern open data
tools empower users to operate at varying levels
of abstraction, allowing data analysts, scientists,
and business users to focus on insights rather than
database management.
The open data environment offers advantages
like cost-effective storage, scalability, and vendor
choice, mainly due to the separation of compute
and storage resources. Most solutions are open-
source or SaaS, easing integration and manage-
ment [6]. Additionally, democratization enables di-
verse users to access data using preferred tools,
while processing engines like Spark and Dremio
provide flexibility in storing data in various formats,
even for enterprises with legacy systems [6].
2.3. Data Lakehouse Architecture
Traditional data warehouses excel at structured
data analysis but are costly and lack flexibility,
while data lakes provide low-cost, flexible storage
for raw data but often face data quality issues,
risking transformation into data swamps [7]. Data
lakehouses integrate the benefits of both models,
supporting flexible storage of raw data while
ensuring data governance, quality, and ACID
compliance for analytics and machine learning
workloads [7].
Data lakehouses also facilitate efficient data
management by supporting diverse processing
engines and query tools, allowing seamless ac-
cess across various data science, analytics, and
business intelligence applications. This flexibility
empowers organizations to unify data storage
and processing workflows under one architecture,
reducing the need for complex data movement
and enabling real-time analytics on vast datasets.
Moreover, lakehouses often integrate with cloud-
based and open-source ecosystems, making them
adaptable to evolving business requirements and
technology landscapes.
In addition to unifying data storage and process-
ing, data lakehouse architectures organize data as
it flows through distinct stages, each designed to
enhance data structure, quality, and accessibility.
This layered approach, the medallion architecture,
ensures that data ingestion, processing, and
consumption are efficiently managed from raw
input to enriched, ready-for-analysis datasets [8].
Data ingestion frameworks leverage distributed
processing and open-source tools to scale data
pipelines efficiently, drawing data from diverse
sources and formats into the lakehouse. This
process typically begins by storing data in its
native format within a foundational layer, then
progressively transforming it through structured
2
stages to enhance organization and quality [9, 10].
Data processing focuses on refining data qual-
ity, categorizing information, and maintaining
consistency across layers. In the Curated layer,
schema evolution and historical tracking enable
structured data management, while the Enriched
layer provides trusted, high-quality data ready for
advanced analytics [11].
A well-designed data architecture ensures
quick, versatile user access via SQL, APIs, and
other interfaces. Modern open data tools support
individuals with different roles, from data analysts
to scientists, to explore insights at the appropriate
level of detail.
With the foundational layers in place, the next
step is selecting modern tools to handle each data
lifecycle stage efficiently. These tools are crucial
in managing data ingestion, processing, storage,
and consumption, providing the scalability, flexi-
bility, and precision needed for an optimized data
lakehouse architecture.
2.4. Data Ingestion Tools
Apache NiFi provides a user-friendly interface for
creating data flows and is highly scalable, reliable,
and secure, suitable for large-scale data process-
ing [12, 13].
Airbyte is an open-source tool offering a straight-
forward approach to syncing data sources, known
for its connector ecosystem, simplicity, and scala-
bility, though primarily designed for Extract, Load,
and Transform (ELT) processes [14].
NiFi is the preferred choice for this project due to
its extensive community and features.
2.5. Data Processing Tools
Apache Hadoop is an open-source framework de-
signed for scalable, cost-effective data handling
across nodes, using Hadoop Distributed File Sys-
tem (HDFS) for distributed data storage [15].
Apache Spark offers faster processing than
Hadoop due to its in-memory caching, support for
multiple programming languages, and active com-
munity support, making it a more flexible solution
for data pipelines [16].
2.6. Data Object Storage
Object storage plays a crucial role in modern data
infrastructure by providing scalable, reliable, and
cost-effective solutions for handling vast amounts
of data. MinIO is an open-source tool optimized for
large-scale, unstructured data storage with an S3-
compatible interface, making it well-suited for cloud
and hybrid setups [17].
OVH offers cloud-based S3-compatible storage
for enterprise needs, emphasizing high availability
and secure data handling [18].
Wasabi provides a low-cost, cloud-based stor-
age solution without egress fees, which is ideal for
backups and data lakes. However, it has notable
API limitations that can impact high-volume data
operations [19].
While AWS was considered for its superior in-
tegration and performance, MinIO was ultimately
used to avoid additional costs during testing, mak-
ing it a practical choice for this project.
2.7. Data Catalog
Nessie provides version control for the data lake-
house, enhancing data integrity and traceability. It
also manages metadata for efficient data manipu-
lation and analysis, making it valuable in a lake-
house architecture [20].
2.8. Query Engine
Dremio and Trino offer distributed SQL engines
that query data directly from lakes. Dremio was
chosen for its high-performance analytics, caching,
and integration with Apache Iceberg, making it
more suitable than Trino for this architecture [20].
Beyond the technical infrastructure, the real
value emerges through data-driven insights. For
telecommunications, key performance indicators
(KPIs) provide critical measures for network diag-
nostics. These metrics enable a comprehensive
view of network health and performance over time.
2.9. Key Performance Indicators
For effective network diagnostics, Multivision cov-
ers KPIs related to accessibility, availability, in-
tegrity, interference, paging, retainability, traffic,
and utilization.
These KPIs provide a structured view of network
health by tracking critical metrics such as connec-
tion success rates, network availability, dropped
calls, and signal quality. Each KPI category ad-
dresses a specific performance aspect, ensuring
comprehensive monitoring that supports proactive
management. By analyzing these metrics, Mul-
tivision can identify potential issues, optimize re-
source allocation, and maintain high service quality
for its users.
These KPIs are recorded at specific intervals
and form time series data that reveal trends, pat-
terns, and fluctuations over time. Analyzing these
series provides deeper insights into the network’s
operational dynamics and performance.
3. Time Series
Forecasting is a vital task within the business sec-
tor. It guides decisions related to production,
scheduling, and dimension, for example, while of-
3
fering insights for long-term strategic planning.
Time series, defined as sequences of obser-
vations xtrecorded over time [21], facilitate this
by allowing analysts to estimate future behaviour
based on historical patterns. A critical step in
time series analysis is identifying a probability
model that accurately reflects the data. In this
context, each xtis viewed as a realization of a
random variable Xt, making the time series itself a
sequence of these random variables.
Modelling a time series involves specifying joint
distributions for the sequence of random variables
or, when this is impractical due to the sheer num-
ber of parameters, focusing on expected values
E(Xt)and covariances E(Xt+hXt)[21]. This pro-
cess, known as model fitting, captures the statis-
tical characteristics of the data, providing a struc-
tured framework for making reliable forecasts.
3.1. ACF and PACF Functions
Models generate future values in a time series
based on prior observations. Critical concepts
include the mean, the Autocorrelation Function
(ACF), the Partial Autocorrelation Function (PACF),
and stationarity.
Many time series models, particularly ARIMA,
rely on the assumption of stationarity to function ef-
fectively. This assumption ensures the series’ con-
sistent behaviour over time. A time series is con-
sidered weakly stationary or wide-sense stationary
if its mean and covariance do not vary with time.
The ACF measures shared information between
observations separated by a time lag, while PACF
isolates the correlation between observations at a
specific lag, excluding intermediate lags.
For a stationary time series Xt, the autocovari-
ance function (ACVF) at lag his defined by 1.
γX(h) = Cov(Xt+h, Xt)(1)
The autocorrelation function (ACF) at lag h,
which normalizes the autocovariance, is given
by 2.
ρx(h) = γX(h)
γX(0) =Cor(Xt+h, Xt)(2)
Identifying significant lags in ACF and PACF
helps determine model parameters.
With a clear understanding of ACF and PACF
functions and the assumptions of stationarity, the
next step is to evaluate how well a chosen model
captures the underlying patterns within a time se-
ries. Model evaluation is essential to ensure that
the model is a good fit for the observed data and
capable of making reliable forecasts. This step in-
volves assessing model performance, which allows
for adjustments and improvements in predictive ac-
curacy.
3.2. Model Evaluation
In model evaluation, the goal is to assess the effec-
tiveness and reliability of a chosen model by com-
paring predicted values with observed values [22].
Forecast errors are a straightforward evaluation
metric calculated as the difference between actual
and forecasted values. Key error metrics include
the Mean Absolute Error (MAE) and Root Mean
Squared Error (RMSE), defined in 3 and 4.
MAE =mean(|et|)(3)
RMSE =qmean(e2
t)(4)
The Akaike Information Criterion (AIC) and its
corrected version, AICc, allow for comparison
between models by penalizing complexity to avoid
overfitting, with the lowest AICc value indicating
the preferred model [23].
Lastly, data is often divided into training and test
sets to validate model generalizability, commonly
with 20% held out for testing. This split helps
reveal issues like overfitting, where a model fits
well on training data but underperforms on unseen
data, thus providing a clearer indication of model
performance in real-world scenarios [24].
With the model evaluation framework estab-
lished, the next step is to apply specific forecasting
models to capture and predict the patterns within
the data. Among the most widely used methods
are the ARIMA and Holt-Winters models, which of-
fer unique strengths for handling time series data.
These models provide structured approaches for
addressing seasonality, trends, and dependencies
within observations, making them particularly ef-
fective for generating reliable forecasts in dynamic
time series contexts.
3.3. ARIMA
The ARIMA (Auto Regressive Integrated Moving
Average) model is a widely used framework in time
series forecasting, particularly effective for cap-
turing autocorrelations within data. ARIMA builds
on the simpler ARMA model, which combines
two core components: Auto-Regressive (AR),
predicting future values based on past values,
and Moving Average (MA), forecasting based on
previous errors [22].
The AR model, denoted as AR(p), predicts val-
ues as a linear combination of past observations
shown in equation 5.
4
yt=c+ϕ1yt1+ϕ2yt2+· · · +ϕpytp+ϵt(5)
Similarly, the MA model (MA(q)) uses past errors
in its predictions, given by 6.
yt=c+ϵt+ϕ1ϵt1+ϕ2ϵt2+· · · +ϕqϵtq(6)
In ARIMA models, the ”I” (Integrated) compo-
nent manages differencing to achieve stationarity,
as most real-world time series are non-stationary
and may exhibit trends or seasonality.
For seasonal data, ARIMA models can in-
corporate a seasonal component, indicated as
ARIMA(p, d, q)(P, D, Q)m, where mrepresents the
season length (e.g., 24 hours for hourly data).
Selecting appropriate model parameters for p
and qcan be achieved using the AICc criterion,
which finds the optimal fit by evaluating different
combinations of these parameters to minimize in-
formation loss [22]. This approach ensures an ef-
fective model fit but may involve high computational
costs due to the extensive fitting process.
For stationarity assessment, statistical tests like
the Augmented Dickey-Fuller (ADF) and KPSS can
be used [25] [26]. If necessary, transformations
like logarithms and square roots stabilize the vari-
ance. This thorough process of ensuring station-
arity, identifying seasonal patterns, and selecting
model parameters enables ARIMA to provide reli-
able forecasts across varied time series data.
3.4. Holt-Winters
The Holt-Winters method, also known as triple ex-
ponential smoothing, is a popular technique for
forecasting time series data that displays seasonal-
ity and trends. It decomposes the series into three
core components: level, trend, and seasonal ef-
fects, and applies exponential smoothing to cap-
ture the temporal dynamics in each. Exponen-
tial smoothing relies on an exponentially weighted
moving average (EWMA), providing a smoothed
representation of the time series by giving more
weight to recent observations [27].
The Holt-Winters approach includes specific
smoothing equations for each component, con-
trolled by parameters (α, β, γ)representing the
smoothing factors for level, trend, and seasonality.
There are two versions of this method: the additive
and multiplicative models. The additive method is
best suited for series with constant seasonal vari-
ation, while the multiplicative method applies when
seasonal effects scale with the level of the series.
In the additive approach, seasonal fluctuations are
added to or subtracted from the trend and level,
while in the multiplicative approach, they are ex-
pressed as a proportion, adjusting the series by di-
vision [22].
Choosing between these methods depends on
the nature of the data’s seasonality, allowing flexi-
bility in modelling a wide range of time series pat-
terns.
4. Implementation
4.1. Company Data
This project processes two data types within the
pipeline: Performance Management (PM) and
Configuration Management (CM) data. PM data,
composed of KPIs and metrics, monitors network
performance, including signal quality and data
transfer rates, enabling proactive issue detection
and quality maintenance. CM data, encompassing
settings like antenna orientation and power levels,
ensures optimal network configuration. These
data types support effective network monitoring,
traffic handling, and coverage enhancement, fos-
tering a robust and efficient network management
approach.
4.2. Data Organization
Multivision’s data is structured across multiple
hierarchical layers based on vendor, technology,
and aggregation intervals, enhancing data ac-
cessibility and processing efficiency. Vendors
are categorized into Nokia, Ericsson, ZTE, and
Huawei, with 2G, 3G, 4G, and 5G technology
layers. Each vendor-technology combination (e.g.,
Ericsson 4G) is segmented by time intervals of 5,
15, or 60 minutes, with CM data typically stored
at 60-minute intervals due to infrequent changes.
This systematic structure supports performance
monitoring and efficient reporting across various
network configurations.
Customer data is transferred via SFTP (Secure
File Transfer Protocol), ensuring the secure han-
dling of large files. Collected from antennas, this
data is saved in TSV (Tab-Separated Values) Files,
a structured text format in which records are sepa-
rated by newlines and tab characters separate val-
ues within each record.
4.3. Data Lakehouse Zones
The data lakehouse architecture consists of four
zones, each providing distinct functions designed
to manage data efficiently through various pro-
cessing stages.
Data first enters the Landing Zone, preserved
in its original TSV format, ensuring data integrity.
From there, it moves to the Raw Zone, where it
is converted into efficient Iceberg files. Iceberg’s
columnar storage format, particularly with Parquet,
reduces data size and boosts query performance
5
by allowing schema evolution and version control.
In the Curated Zone, data undergoes thorough
cleaning, validation, and transformation to conform
to consistent schema standards, setting it up for re-
liable analysis. When data reaches the Enriched
Zone, it is fully processed and ready for advanced
analytical tasks, with aggregations and transforma-
tions complete for streamlined integration into ap-
plications and reports.
Figure 1 illustrates the data lakehouse zones
used for this work.
Figure 1: Data Lakehouse Zones.
Query engines like Dremio optimize access
across all these zones, while Nessie serves as
a catalog, tracking data versions and schema
changes, making data management seamless and
efficient.
4.4. Proposed Architecture
The proposed data lakehouse architecture is de-
signed to support efficient data management
through clearly defined data zones that transform
raw TSV files into enriched, query-ready datasets
for advanced analytics.
Data ingestion starts by uploading structured
TSV files into the landing zone, preserving raw
data in its original form. From there, data is
stored in Apache Iceberg format within the raw
zone. Iceberg’s advanced table format supports
schema evolution and optimized data querying,
significantly reducing storage size and increasing
processing efficiency. After initial storage, data
undergoes cleaning and validation in the curated
zone, which prepares it for reliable analysis by en-
forcing a schema. Data is further processed and
aggregated in the enriched zone, making it ready
for complex analytics and applications.
The architecture operates within a locally de-
ployed Minio environment for this project, given
that the dataset is a fraction of the total data vol-
ume. Although alternatives like OVH and Wasabi
were tested, their performance did not meet the
project’s specifications, with AWS as the preferred
option for future large-scale deployments. The ar-
chitecture also integrates Dremio as a query en-
gine, optimizing data access across zones, and
Nessie as a data catalog for tracking data versions
and maintaining schema consistency.
The architecture’s flexibility is enhanced by
Docker containerization, which isolates services
from ingestion to query, promoting seamless scal-
ing and ease of deployment. This modular ap-
proach allows each service to be modified without
impacting the entire system.
Figure 2 presents the overall design of the archi-
tecture, which is explained in detail in subsequent
sections on data ingestion, processing, and enrich-
ment steps.
Figure 2: Proposed Data Architecture.
4.5. Data Ingestion
This project employs Apache NiFi to manage data
ingestion from Multivision’s repository into Minio,
the object storage solution. NiFi retrieves and or-
ganizes data from Ericsson’s 2G and 4G networks,
stored in 15-minute intervals.
Separate NiFi processor groups were config-
ured to handle performance management (PM)
and configuration management (CM) data require-
ments. These processors efficiently manage tasks
such as identifying relevant files, filtering data by
attributes (e.g., date or type), retrieving data con-
tents, and finally, storing them in Minio’s landing
zone. This workflow ensures that the data is main-
tained in its original TSV format, ready for further
processing and analysis in subsequent stages.
4.6. Data Processing
Landing to Raw
The data processing phase transitions data from
the landing to the raw zone by reading TSV files
and converting them into Iceberg format using
Spark and JupyterHub. This transformation is
essential for storage efficiency and data querying
speed. For instance, the landing zone, containing
data in TSV format, occupies 45.3 Gb, whereas the
converted raw data in Iceberg format only requires
4.4 Gb. This significant reduction highlights Ice-
berg’s advantage: by storing data in compressed,
columnar Parquet files, Iceberg reduces storage
costs and enhances query performance.
Once converted, the data is stored in the raw
zone in Minio. Additionally, Nessie is employed as
a data catalog to maintain metadata for these ta-
bles, streamlining schema management and sup-
6
porting future queries and analysis with enhanced
data governance and reliability.
Raw to Curated
The process from the raw to the curated zone
involves reading Iceberg files, applying essential
cleaning and transformations, and adding schema
information before storing the data in the curated
bucket.
The code used first loads each dataset’s table
structure, verifying schema alignment by matching
columns with schema definitions. If columns are
missing, they are filled with NULL values to main-
tain consistency with the expected schema. Once
all transformations are complete, the data is written
in the curated bucket, ready for analysis with reli-
able schema compliance and data quality assured.
Curated to Enriched
Transitioning from the curated to enriched phase
is case-specific, and, in this project, downlink traf-
fic forecasting was selected to showcase the ar-
chitecture’s capabilities. The downlink traffic KPI,
essential for analyzing data transferred from net-
work sources to users, was calculated using a set
of counters: general (GDT), enhanced (EDT), and
background traffic (BT). General traffic refers to
standard data, enhanced metrics prioritize high-
demand services, and background traffic relates to
lower-priority tasks like updates. The total downlink
traffic (TDT) is given by equation 7.
T DT =GDT +EDT +BT (7)
Data transformations and aggregation were per-
formed in Dremio using SQL queries, consolidating
15-minute interval data into hourly intervals. The
enriched dataset, now accessible for analytical and
operational applications, enables efficient integra-
tion across various system components.
Due to substantial missing data in the 4G dataset
(around 80%), the study pivoted to 2G data for
building models, drawing on data from March and
April to strengthen model reliability.
5. Results and Application
5.1. Results
The new data architecture has significantly im-
proved processing efficiency and storage costs.
Previously, achieving the same results took
approximately 15 minutes. This extended pro-
cessing time was largely due to the limitations of
AWS Lambda, which, while allowing for server-
less, event-triggered code execution, imposes
constraints on concurrent executions. These
restrictions reduced the system’s ability to handle
high volumes of data simultaneously. Furthermore,
the previous process loaded all counters in a single
operation, creating a bottleneck that strained the
database and slowed down processing efficiency.
The new system, powered by Iceberg and
Dremio, computes these values in 21.17 seconds,
as shown in Table 1.
Table 1: Dremio Query Report.
Summary
Duration 21.17s
Input 9.81Gb / 138M Rows
Output 113.34Kb / 1.4K Rows
This setup not only allows for faster data pro-
cessing but also compresses data by a factor of
10, leading to significantly lower storage costs.
Iceberg’s compression has reduced file sizes from
45.3 Gb in TSV format to 4.4 Gb (Raw Stage),
demonstrating high storage efficiency.
Adopting the Iceberg format ensures data in-
tegrity and efficient resource usage, enabling sub-
stantial cost savings while maintaining technical
performance. These improvements justify the tran-
sition to the new architecture, which optimizes pro-
cessing time and financial overhead.
5.2. Application
At the final stage of the architecture, the data
becomes accessible for export to the Simplifyd
platform or for direct analysis by data scientists
and analysts. An example application is network
traffic analysis, which identifies periods where
capacity cells can be deactivated, shifting traffic
to coverage cells. In a large-scale scenario (e.g.,
managing 6578 cells across a country), this
approach has substantial energy-saving and en-
vironmental benefits, potentially reducing energy
consumption by 37%, operational costs by 900kC,
and CO2emissions by 2556 tons annually [28].
Data preparation focused on transforming the
imported data from Dremio to a standardized for-
mat. Calculations were done to convert traffic mea-
surements to gigabytes (Gb), ensuring unit consis-
tency.
After transformations, data visualization high-
lighted key patterns, particularly daily usage peaks
and seasonal trends. Figure 3 shows the evolution
of downlink traffic from 1 March 2024 until 30 April
2024.
The traffic data showed fluctuations between 5
and 25Gb, with statistics indicating moderate vari-
ance. These statistics offer some insights, allow-
ing better understanding and preparation for fur-
ther modelling.
7
Figure 3: Downlink Traffic Evolution.
Missing Values
Three methods were implemented to handle miss-
ing data: removing NaNs, linear interpolation, and
imputation using the ARIMA model. Removing
NaNs is a basic approach generally applied when
missing data is minimal, though it risks discarding
potentially useful information. Linear interpolation
estimates missing values using surrounding data
points, providing a smooth approximation. The
most sophisticated method, ARIMA-based imputa-
tion, uses time series patterns to predict missing
values, providing a more accurate fill that respects
traffic seasonality.
Seasonal ARIMA Model
Since ARIMA requires stationary data, the Aug-
mented Dickey-Fuller (ADF) test tested the dataset
for stationarity. Stationarity implies that statistical
properties, such as mean and variance, are
consistent over time. With a p-value below 0.05
and an ADF statistic below the critical threshold,
the test confirmed that the data met stationarity
requirements, validating it for ARIMA modelling.
The Seasonal ARIMA model was configured
to capture hourly seasonal trends (with a period
of 24). An automated fitting process optimized
the ARIMA parameters to minimize the AICc
value, enhancing prediction accuracy. The model
effectively accounted for short-term and seasonal
variations in downlink traffic, enabling reliable
forecasting of daily patterns.
The model’s performance was evaluated
with three approaches to address missing val-
ues—dropping NaNs, interpolation, and ARIMA
imputation—using RMSE and MAE as error
metrics (Table 2).
Dropping NaNs had the highest errors, as it
discards valuable data points. Interpolation im-
proved accuracy, while ARIMA-based imputation
produced the lowest RMSE and MAE values. The
model made accurate test predictions using this
Table 2: RMSE and MAE for the different methods.
Method RM SE MAE
Drop NANs 5.32 3.81
Interpolation 1.87 1.08
ARIMA model 1.65 1.01
optimal ARIMA-based imputation, closely match-
ing the actual traffic data and demonstrating the
value of time series-based imputation. Figure 4
displays the predictions for the test set using the
ARIMA model to predict the missing values.
Figure 4: Forecast Plot.
Holt-Winters
The Holt-Winters method was applied as an al-
ternative forecasting approach, using both additive
and multiplicative seasonal models. The additive
model proved more effective, especially given the
stable seasonal patterns of network traffic. Additive
seasonality worked well because the amplitude of
daily traffic fluctuations remained consistent, align-
ing with telecommunications usage patterns. The
model effectively captured the traffic dynamics, al-
though the ARIMA model showed a slightly better
fit. This method provided an additional tool for vali-
dating the seasonal trends identified by ARIMA.
Aggregated Results
A comprehensive comparison across methods
showed that Seasonal ARIMA delivered the best
forecasting accuracy, particularly with ARIMA-
based imputation. Holt-Winters’ additive model
and ARIMA with interpolation also performed well.
Table 3 summarizes each method’s RMSE and
MAE results, confirming Seasonal ARIMA with
ARIMA imputation as the most accurate approach,
with promising applications for forecasting future
network traffic patterns.
6. Conclusions
This thesis addressed Multivision’s need for a scal-
able and efficient data lakehouse to support mo-
bile network management in the telecommunica-
8
Table 3: RMSE and MAE with ARIMA and Holt-Winters.
Method RM SE MAE
ARIMA (Drop NaN) 5.32 3.81
ARIMA (Interpolation) 1.87 1.08
ARIMA (Predicting NaN) 1.65 1.01
HW (Additive) 1.93 1.46
HW (Multiplicative) 10.18 8.93
tions sector. By leveraging contemporary data en-
gineering tools, a robust system was developed to
overcome challenges in data ingestion, process-
ing, and accessibility. Furthermore, the integra-
tion of time series forecasting models for down-
link traffic demonstrated the lakehouse’s practical
value in optimizing network performance, enhanc-
ing decision-making, and improving operational ef-
ficiency.
6.1. Data Lakehouse Achievements
This thesis developed a data lakehouse to effi-
ciently manage both structured and unstructured
data in the telecommunications sector. The multi-
zone pipeline spanning landing, raw, curated,
and enriched data stages facilitated a stream-
lined data flow and optimized lifecycle manage-
ment. Performance improvements were notable,
with automated data aggregation queries now av-
eraging 21.17 seconds, enhancing access to crit-
ical network data and enabling faster, data-driven
decision-making. Furthermore, converting TSV
files to the Iceberg format achieved a 10:1 com-
pression ratio, significantly reducing storage costs
and showcasing the lakehouse’s efficiency in han-
dling growing data demands in telecommunica-
tions.
6.2. Time Series Achievements
The forecasting models in this thesis focused on
downlink traffic, a key KPI for telecommunications.
The ARIMA model accurately captured daily and
seasonal trends, closely aligning with observed
data, while the Holt-Winters model offered an al-
ternative but slightly less accurate option. To-
gether, these models demonstrated the architec-
ture’s capacity for reliable traffic prediction, sup-
porting proactive resource allocation and network
planning.
6.3. Future Work
Future work should consider automating the
architecture using tools like Airflow to streamline
data pipelines for production environments [29].
This automation would enable a seamless, unin-
terrupted data flow, eliminating manual tasks and
improving efficiency.
Expanding the architecture’s applications is
another potential area, such as developing ma-
chine learning models to predict the causes of
dropped calls (another KPI), which could be cate-
gorized into factors like weak signal and network
congestion. This would further demonstrate the
architecture’s versatility in predictive analytics,
offering telecom operators valuable insights for
network optimization.
Lastly, incorporating additional metrics like cell
coverage area and neighboring cell relationships
could provide a more comprehensive view of net-
work load distribution, complementing traffic pre-
diction for more efficient resource management.
For example, in a scenario with 6,578 cells, apply-
ing this strategy to 43% of the cells could result
in approximately 37% energy savings, a C900,000
reduction in operational costs, and an annual de-
crease of 2,556 tons in CO2emissions [28]. These
enhancements would enable telecom operators to
optimize resource allocation, maximizing efficiency
and minimizing environmental impact.
References
[1] Ahmed A. Harby and Farhana Zulkernine.
From data warehouse to lakehouse: A com-
parative review. In 2022 IEEE International
Conference on Big Data (Big Data), pages
389–395, 2022.
[2] Jeffrey Dean and Sanjay Ghemawat. Mapre-
duce: simplified data processing on large
clusters. Commun. ACM, 51(1):107–113,
January 2008.
[3] Casber Wang. What is the open data
ecosystem and why it’s here to stay.
https://sapphireventures.com/blog/what-
is-the-open-data-ecosystem-and-why-its-
here-to-stay/, 2021.
[4] Vladimir Belov and Evgeny Nikulchev. Anal-
ysis of big data storage tools for data lakes
based on apache hadoop platform. Interna-
tional Journal of Advanced Computer Science
and Applications, 12:551–557, 08 2021.
[5] Salisu Borodo, Siti Mariyam Shamsuddin, and
Shafaatunnur Hasan. Big data platforms and
techniques. Indonesian Journal of Electrical
Engineering and Computer Science, 1:191,
01 2016.
[6] Medium. Demystifying data lakes: Design, ar-
chitecture, and best practices, 2023.
[7] Draˇ
zen Oreˇ
sˇ
canin and Tomislav Hlupi´
c. Data
lakehouse - a novel step in analytics architec-
ture. pages 1242–1246, 2021.
9
[8] Databricks. What is a medallion architecture?
https://www.databricks.com/glossary/medallion-
architecture, 2023.
[9] Jaber Alwidian, Sana Abdel Rahman, Maram
Gnaim, Fatima Al-Taharwah, et al. Big data
ingestion and preparation tools. Modern Ap-
plied Science, 14(9):12–27, 2020.
[10] Mohamed CHERRADI and Anass EL HAD-
DADI. A scalable framework for data lakes
ingestion. Procedia Computer Science,
215:809–814, 2022. 4th International Con-
ference on Innovative Data Communication
Technology and Application.
[11] Eman Shaikh, Iman Mohiuddin, Yasmeen Al-
ufaisan, and Irum Nahvi. Apache spark: A big
data processing engine, 2019.
[12] Apache NiFi. https://nifi.apache.org/. Ac-
cessed: 2024-10-22.
[13] Mohammad Irfan, Reena, and Jossy George.
Data ingestion - cloud based ingestion analy-
sis using nifi, 2023.
[14] Airbyte: Open Source Data Integration.
https://airbyte.com/. Accessed: 2024-10-22.
[15] Vladimir Belov and Evgeny Nikulchev. Anal-
ysis of big data storage tools for data lakes
based on apache hadoop platform. Interna-
tional Journal of Advanced Computer Science
and Applications, 12(8), 2021.
[16] Vandana Vijay, Vaibhav Sharma, Vandana
Srivastava, and Kumar Vipin. A compara-
tive study on hadoop mapreduce and apache
spark framework for big data analytics. Inter-
national Journal of Research Publication and
Reviews, 5:3228–3232, 02 2024.
[17] A. Makris, I. Kontopoulos, S. Nektarios Xyalis,
E. Psomakelis, T. Theodoropoulos, A. Varvari-
gos, and K. Tserpes. A study on the perfor-
mance of distributed storage systems in edge
computing environments, 2024.
[18] OVHcloud: Cloud Computing and Web Host-
ing Solutions. https://www.ovhcloud.com/pt/.
Accessed: 2024-10-22.
[19] Wasabi: Setup Limits and Policies.
https://docs.wasabi.com/docs/setup-limits-
and-policies. Accessed: 2024-10-22.
[20] Tomer Shiran, Jason Hughes, and Alex
Merced. Apache Iceberg: The Definitive
Guide. O’Reilly Media, 2024. Data Lakehouse
Functionality, Performance, and Scalability on
the Data Lake.
[21] Richard A. Davis Peter J. Brockwell. Introduc-
tion to Time Series and Forecasting. Springer,
3rd edition, 2016.
[22] Rob J Hyndman and George Athanasopou-
los. Forecasting: Principles and Practice.
OTexts, Melbourne, Australia, 3rd edition,
2021. Monash University, Australia.
[23] H. Akaike. Information theory and an exten-
sion of the maximum likelihood principle. In
2nd International Symposium on Information
Theory, 1971.
[24] S. Shalev-Shwartz and S. Ben-David. Un-
derstanding Machine Learning: From Theory
to Algorithms. Cambridge University Press,
2014.
[25] S. E. Said and D. A. Dickey. Testing for unit
roots in autoregressive-moving average mod-
els of unknown order. Biometrika, 1983.
[26] D. Kwiatkowski, P. C. Phillips, P. Schmidt, and
Y. Shin. Testing the null hypothesis of station-
arity against the alternative of a unit root: How
sure are we that economic time series have a
unit root? Journal of Econometrics, 1992.
[27] Chris Chatfield and Mohammad Yar. Holt-
winters forecasting: some practical issues.
Journal of the Royal Statistical Society Series
D: The Statistician, 37(2):129–140, 1988.
[28] Diogo Clemente, L´
ucio Ferreira, Gabriela
Soares, Nuno Valente, and Pedro Sebasti˜
ao.
Implementation of a cloud-based, traffic
aware and energy efficient management of
base stations’ activity. In 2018 21st Interna-
tional Symposium on Wireless Personal Mul-
timedia Communications (WPMC), 2018.
[29] Apache airflow. https://airflow.apache.org/.
Accessed: 2024-10-30.
10