A Lakehouse to support mobile networks management PDF Free Download

Name: A Lakehouse to support mobile networks management PDF
Author: brooks784

1 / 10

0 views•10 pages

A Lakehouse to support mobile networks management PDF Free Download

A Lakehouse to support mobile networks management PDF free Download. Think more deeply and widely.

A Lakehouse to support mobile networks management

Diogo Miguel Pimp˜

ao Vilela

diogo.pimpao.vilela@tecnico.ulisboa.pt

Instituto Superior T ´

ecnico, Lisboa, Portugal

November 2024

Abstract

Currently, over 4 billion people are connected to the internet, with a large portion of these users

accessing it through mobile devices. Every day, these users generate a vast amount of data collected

by companies. With advances in Data Science, it has become possible to develop methods and models

to extract valuable insights about customers and enhance strategic decision-making. However, due to

the diversity of applications, devices, and technologies, this data is dispersed across various sources and

exists in multiple formats, both structured and unstructured. Integrating and interpreting this variety of

data represents a signiﬁcant and complex challenge.

In this context, a data lakehouse was developed to support mobile networks management. This sys-

tem provides a scalable and efﬁcient solution for data ingestion, processing, and storage, optimizing

access to critical information. It can strengthen the company’s processes, especially in its main applica-

tion: a platform that automates and enhances the workﬂows of telecommunications engineering teams,

ensuring accessible and reliable data for the analysis of Key Performance Indicators (KPIs).

To demonstrate the data architecture in action, an application was developed to forecast downlink

trafﬁc on a telecommunications network. Trafﬁc forecasting allows network managers to identify areas

requiring more antennas, as well as to develop strategies to optimize network resource usage, improve

energy efﬁciency, and reduce costs. The ARIMA and Holt-Winters methods were used for forecasting,

treating trafﬁc as a time series.

Keywords: Telecommunications Management, Data Lakehouse, Time Series Forecasting

1. Introduction

The rapid expansion of the telecommunications

sector has led to substantial data generation, in-

cluding performance metrics, cell conﬁgurations,

and user activity. This data, essential for network

performance insights, is stored in diverse formats

across multiple sources, creating signiﬁcant chal-

lenges in efﬁcient ingestion, processing, and anal-

ysis.

Traditional data architectures, such as data

warehouses, have been instrumental but often

lack scalability, adaptability, and cost-effectiveness,

essential for today’s telecommunications de-

mands [1]. The data lakehouse architecture

emerges as a solution, combining the strengths of

both data lakes and warehouses to manage struc-

tured and unstructured data while enabling ad-

vanced analytics and machine learning.

Managing and handling large volumes of

telecommunications data is essential for maintain-

ing network performance, providing optimal user

experience, and minimizing operational costs. Ex-

isting systems frequently need more speed, ﬂexi-

bility, and the ability to scale effectively in response

to the growing volume and variety of data.

In this thesis, a data lakehouse tailored to the

needs of the telecommunications industry is devel-

oped, speciﬁcally designed to enhance data inges-

tion, processing, and accessibility, thereby facilitat-

ing KPI analysis crucial for effective network man-

agement. This work is conducted in collaboration

with Multivision Consulting, a company renowned

for its Software as a Service (SaaS) platform Sim-

plifyd, which empowers telecom operators by op-

timizing network management through automation

and decision support.

To demonstrate the architecture’s capabilities,

the thesis includes a use case focused on predict-

ing downlink trafﬁc, a KPI vital for user experience,

through ARIMA and Holt-Winters models. Accu-

rate trafﬁc predictions are essential for efﬁcient net-

work resource allocation, capacity management,

and cost optimization. By employing these fore-

casting models, the thesis illustrates how the archi-

tecture supports enhanced decision-making and

optimizes network performance.

2. Data Engineering

Modern companies collect vast amounts of struc-

tured and unstructured data from numerous

sources to improve customer understanding and

support strategic decision-making. Integrating and

connecting these diverse data sources is essential

for actionable insights, but this task is complex and

critical.

Traditional databases and data warehouses, the

primary solutions for business intelligence, offer

structured storage and processing but come with

signiﬁcant drawbacks, including high costs for on-

premises hardware, proprietary data formats, and

reliance on centralized IT departments for analy-

sis [1]. They also need help scaling to meet the

increasing demand for real-time data analytics and

the massive volume, velocity, and variety of big

data [1]. This challenge set the stage for alterna-

tive solutions.

2.1. The Rise of Big Data and Hadoop

Hadoop, introduced in 2006, enabled parallel pro-

cessing and large-scale data handling through the

Map-Reduce model across clusters of standard

hardware, making it an attractive option for han-

dling growing data volumes [2]. The Hadoop

ecosystem helped organizations handle big data

more cost-effectively, but its complex architecture,

high IT resource demands, and focus on data ac-

cumulation often resulted in data swamps, where

valuable data becomes difﬁcult to manage. Fur-

thermore, cloud computing advancements, includ-

ing platforms like Amazon S3 (AWS S3), Azure

Data Lake Storage (ADLS), and Google Cloud

Storage (GCS), offered more ﬂexible, cost-effective

alternatives, highlighting Hadoop’s operational and

economic limitations [3].

2.2. Transition to Open Data Architecture

Although Hadoop did not meet all expectations, it

laid the foundation for the open data ecosystem

based on openness, modularity, and diversity [4].

Today’s ecosystem beneﬁts from cloud scalability,

query acceleration technologies, and open-source

data formats, making data accessible for business

analytics and AI/ML workloads at reduced costs.

Four main trends have simpliﬁed the adoption

of open data solutions. First, cloud data lakes,

like Amazon S3, ADLS, and GCS, enable scal-

able storage of structured and unstructured data

without on-premises hardware, supported by

decreasing cloud costs [3]. Second, open-source

data formats like Apache Parquet and Iceberg

improve data compatibility across platforms and

help organizations reduce storage costs [5].

Third, cloud-native vendors now offer managed

solutions, eliminating the need for costly hardware

investments while enabling pay-as-you-go models

for greater ﬂexibility [3]. Finally, modern open data

tools empower users to operate at varying levels

of abstraction, allowing data analysts, scientists,

and business users to focus on insights rather than

database management.

The open data environment offers advantages

like cost-effective storage, scalability, and vendor

choice, mainly due to the separation of compute

and storage resources. Most solutions are open-

source or SaaS, easing integration and manage-

ment [6]. Additionally, democratization enables di-

verse users to access data using preferred tools,

while processing engines like Spark and Dremio

provide ﬂexibility in storing data in various formats,

even for enterprises with legacy systems [6].

2.3. Data Lakehouse Architecture

Traditional data warehouses excel at structured

data analysis but are costly and lack ﬂexibility,

while data lakes provide low-cost, ﬂexible storage

for raw data but often face data quality issues,

risking transformation into data swamps [7]. Data

lakehouses integrate the beneﬁts of both models,

supporting ﬂexible storage of raw data while

ensuring data governance, quality, and ACID

compliance for analytics and machine learning

workloads [7].

Data lakehouses also facilitate efﬁcient data

management by supporting diverse processing

engines and query tools, allowing seamless ac-

cess across various data science, analytics, and

business intelligence applications. This ﬂexibility

empowers organizations to unify data storage

and processing workﬂows under one architecture,

reducing the need for complex data movement

and enabling real-time analytics on vast datasets.

Moreover, lakehouses often integrate with cloud-

based and open-source ecosystems, making them

adaptable to evolving business requirements and

technology landscapes.

In addition to unifying data storage and process-

ing, data lakehouse architectures organize data as

it ﬂows through distinct stages, each designed to

enhance data structure, quality, and accessibility.

This layered approach, the medallion architecture,

ensures that data ingestion, processing, and

consumption are efﬁciently managed from raw

input to enriched, ready-for-analysis datasets [8].

Data ingestion frameworks leverage distributed

processing and open-source tools to scale data

pipelines efﬁciently, drawing data from diverse

sources and formats into the lakehouse. This

process typically begins by storing data in its

native format within a foundational layer, then

progressively transforming it through structured

stages to enhance organization and quality [9, 10].

Data processing focuses on reﬁning data qual-

ity, categorizing information, and maintaining

consistency across layers. In the Curated layer,

schema evolution and historical tracking enable

structured data management, while the Enriched

layer provides trusted, high-quality data ready for

advanced analytics [11].

A well-designed data architecture ensures

quick, versatile user access via SQL, APIs, and

other interfaces. Modern open data tools support

individuals with different roles, from data analysts

to scientists, to explore insights at the appropriate

level of detail.

With the foundational layers in place, the next

step is selecting modern tools to handle each data

lifecycle stage efﬁciently. These tools are crucial

in managing data ingestion, processing, storage,

and consumption, providing the scalability, ﬂexi-

bility, and precision needed for an optimized data

lakehouse architecture.

2.4. Data Ingestion Tools

Apache NiFi provides a user-friendly interface for

creating data ﬂows and is highly scalable, reliable,

and secure, suitable for large-scale data process-

ing [12, 13].

Airbyte is an open-source tool offering a straight-

forward approach to syncing data sources, known

for its connector ecosystem, simplicity, and scala-

bility, though primarily designed for Extract, Load,

and Transform (ELT) processes [14].

NiFi is the preferred choice for this project due to

its extensive community and features.

2.5. Data Processing Tools

Apache Hadoop is an open-source framework de-

signed for scalable, cost-effective data handling

across nodes, using Hadoop Distributed File Sys-

tem (HDFS) for distributed data storage [15].

Apache Spark offers faster processing than

Hadoop due to its in-memory caching, support for

multiple programming languages, and active com-

munity support, making it a more ﬂexible solution

for data pipelines [16].

2.6. Data Object Storage

Object storage plays a crucial role in modern data

infrastructure by providing scalable, reliable, and

cost-effective solutions for handling vast amounts

of data. MinIO is an open-source tool optimized for

large-scale, unstructured data storage with an S3-

compatible interface, making it well-suited for cloud

and hybrid setups [17].

OVH offers cloud-based S3-compatible storage

for enterprise needs, emphasizing high availability

and secure data handling [18].

Wasabi provides a low-cost, cloud-based stor-

age solution without egress fees, which is ideal for

backups and data lakes. However, it has notable

API limitations that can impact high-volume data

operations [19].

While AWS was considered for its superior in-

tegration and performance, MinIO was ultimately

used to avoid additional costs during testing, mak-

ing it a practical choice for this project.

2.7. Data Catalog

Nessie provides version control for the data lake-

house, enhancing data integrity and traceability. It

also manages metadata for efﬁcient data manipu-

lation and analysis, making it valuable in a lake-

house architecture [20].

2.8. Query Engine

Dremio and Trino offer distributed SQL engines

that query data directly from lakes. Dremio was

chosen for its high-performance analytics, caching,

and integration with Apache Iceberg, making it

more suitable than Trino for this architecture [20].

Beyond the technical infrastructure, the real

value emerges through data-driven insights. For

telecommunications, key performance indicators

(KPIs) provide critical measures for network diag-

nostics. These metrics enable a comprehensive

view of network health and performance over time.

2.9. Key Performance Indicators

For effective network diagnostics, Multivision cov-

ers KPIs related to accessibility, availability, in-

tegrity, interference, paging, retainability, trafﬁc,

and utilization.

These KPIs provide a structured view of network

health by tracking critical metrics such as connec-

tion success rates, network availability, dropped

calls, and signal quality. Each KPI category ad-

dresses a speciﬁc performance aspect, ensuring

comprehensive monitoring that supports proactive

management. By analyzing these metrics, Mul-

tivision can identify potential issues, optimize re-

source allocation, and maintain high service quality

for its users.

These KPIs are recorded at speciﬁc intervals

and form time series data that reveal trends, pat-

terns, and ﬂuctuations over time. Analyzing these

series provides deeper insights into the network’s

operational dynamics and performance.

3. Time Series

Forecasting is a vital task within the business sec-

tor. It guides decisions related to production,

scheduling, and dimension, for example, while of-

fering insights for long-term strategic planning.

Time series, deﬁned as sequences of obser-

vations xtrecorded over time [21], facilitate this

by allowing analysts to estimate future behaviour

based on historical patterns. A critical step in

time series analysis is identifying a probability

model that accurately reﬂects the data. In this

context, each xtis viewed as a realization of a

random variable Xt, making the time series itself a

sequence of these random variables.

Modelling a time series involves specifying joint

distributions for the sequence of random variables

or, when this is impractical due to the sheer num-

ber of parameters, focusing on expected values

E(Xt)and covariances E(Xt+hXt)[21]. This pro-

cess, known as model ﬁtting, captures the statis-

tical characteristics of the data, providing a struc-

tured framework for making reliable forecasts.

3.1. ACF and PACF Functions

Models generate future values in a time series

based on prior observations. Critical concepts

include the mean, the Autocorrelation Function

(ACF), the Partial Autocorrelation Function (PACF),

and stationarity.

Many time series models, particularly ARIMA,

rely on the assumption of stationarity to function ef-

fectively. This assumption ensures the series’ con-

sistent behaviour over time. A time series is con-

sidered weakly stationary or wide-sense stationary

if its mean and covariance do not vary with time.

The ACF measures shared information between

observations separated by a time lag, while PACF

isolates the correlation between observations at a

speciﬁc lag, excluding intermediate lags.

For a stationary time series Xt, the autocovari-

ance function (ACVF) at lag his deﬁned by 1.

γX(h) = Cov(Xt+h, Xt)(1)

The autocorrelation function (ACF) at lag h,

which normalizes the autocovariance, is given

by 2.

ρx(h) = γX(h)

γX(0) =Cor(Xt+h, Xt)(2)

Identifying signiﬁcant lags in ACF and PACF

helps determine model parameters.

With a clear understanding of ACF and PACF

functions and the assumptions of stationarity, the

next step is to evaluate how well a chosen model

captures the underlying patterns within a time se-

ries. Model evaluation is essential to ensure that

the model is a good ﬁt for the observed data and

capable of making reliable forecasts. This step in-

volves assessing model performance, which allows

for adjustments and improvements in predictive ac-

curacy.

3.2. Model Evaluation

In model evaluation, the goal is to assess the effec-

tiveness and reliability of a chosen model by com-

paring predicted values with observed values [22].

Forecast errors are a straightforward evaluation

metric calculated as the difference between actual

and forecasted values. Key error metrics include

the Mean Absolute Error (MAE) and Root Mean

Squared Error (RMSE), deﬁned in 3 and 4.

MAE =mean(|et|)(3)

RMSE =qmean(e2

t)(4)

The Akaike Information Criterion (AIC) and its

corrected version, AICc, allow for comparison

between models by penalizing complexity to avoid

overﬁtting, with the lowest AICc value indicating

the preferred model [23].

Lastly, data is often divided into training and test

sets to validate model generalizability, commonly

with 20% held out for testing. This split helps

reveal issues like overﬁtting, where a model ﬁts

well on training data but underperforms on unseen

data, thus providing a clearer indication of model

performance in real-world scenarios [24].

With the model evaluation framework estab-

lished, the next step is to apply speciﬁc forecasting

models to capture and predict the patterns within

the data. Among the most widely used methods

are the ARIMA and Holt-Winters models, which of-

fer unique strengths for handling time series data.

These models provide structured approaches for

addressing seasonality, trends, and dependencies

within observations, making them particularly ef-

fective for generating reliable forecasts in dynamic

time series contexts.

3.3. ARIMA

The ARIMA (Auto Regressive Integrated Moving

Average) model is a widely used framework in time

series forecasting, particularly effective for cap-

turing autocorrelations within data. ARIMA builds

on the simpler ARMA model, which combines

two core components: Auto-Regressive (AR),

predicting future values based on past values,

and Moving Average (MA), forecasting based on

previous errors [22].

The AR model, denoted as AR(p), predicts val-

ues as a linear combination of past observations

shown in equation 5.

yt=c+ϕ1yt−1+ϕ2yt−2+· · · +ϕpyt−p+ϵt(5)

Similarly, the MA model (MA(q)) uses past errors

in its predictions, given by 6.

yt=c+ϵt+ϕ1ϵt−1+ϕ2ϵt−2+· · · +ϕqϵt−q(6)

In ARIMA models, the ”I” (Integrated) compo-

nent manages differencing to achieve stationarity,

as most real-world time series are non-stationary

and may exhibit trends or seasonality.

For seasonal data, ARIMA models can in-

corporate a seasonal component, indicated as

ARIMA(p, d, q)(P, D, Q)m, where mrepresents the

season length (e.g., 24 hours for hourly data).

Selecting appropriate model parameters for p

and qcan be achieved using the AICc criterion,

which ﬁnds the optimal ﬁt by evaluating different

combinations of these parameters to minimize in-

formation loss [22]. This approach ensures an ef-

fective model ﬁt but may involve high computational

costs due to the extensive ﬁtting process.

For stationarity assessment, statistical tests like

the Augmented Dickey-Fuller (ADF) and KPSS can

be used [25] [26]. If necessary, transformations

like logarithms and square roots stabilize the vari-

ance. This thorough process of ensuring station-

arity, identifying seasonal patterns, and selecting

model parameters enables ARIMA to provide reli-

able forecasts across varied time series data.

3.4. Holt-Winters

The Holt-Winters method, also known as triple ex-

ponential smoothing, is a popular technique for

forecasting time series data that displays seasonal-

ity and trends. It decomposes the series into three

core components: level, trend, and seasonal ef-

fects, and applies exponential smoothing to cap-

ture the temporal dynamics in each. Exponen-

tial smoothing relies on an exponentially weighted

moving average (EWMA), providing a smoothed

representation of the time series by giving more

weight to recent observations [27].

The Holt-Winters approach includes speciﬁc

smoothing equations for each component, con-

trolled by parameters (α, β, γ)representing the

smoothing factors for level, trend, and seasonality.

There are two versions of this method: the additive

and multiplicative models. The additive method is

best suited for series with constant seasonal vari-

ation, while the multiplicative method applies when

seasonal effects scale with the level of the series.

In the additive approach, seasonal ﬂuctuations are

added to or subtracted from the trend and level,

while in the multiplicative approach, they are ex-

pressed as a proportion, adjusting the series by di-

vision [22].

Choosing between these methods depends on

the nature of the data’s seasonality, allowing ﬂexi-

bility in modelling a wide range of time series pat-

terns.

4. Implementation

4.1. Company Data

This project processes two data types within the

pipeline: Performance Management (PM) and

Conﬁguration Management (CM) data. PM data,

composed of KPIs and metrics, monitors network

performance, including signal quality and data

transfer rates, enabling proactive issue detection

and quality maintenance. CM data, encompassing

settings like antenna orientation and power levels,

ensures optimal network conﬁguration. These

data types support effective network monitoring,

trafﬁc handling, and coverage enhancement, fos-

tering a robust and efﬁcient network management

approach.

4.2. Data Organization

Multivision’s data is structured across multiple

hierarchical layers based on vendor, technology,

and aggregation intervals, enhancing data ac-

cessibility and processing efﬁciency. Vendors

are categorized into Nokia, Ericsson, ZTE, and

Huawei, with 2G, 3G, 4G, and 5G technology

layers. Each vendor-technology combination (e.g.,

Ericsson 4G) is segmented by time intervals of 5,

15, or 60 minutes, with CM data typically stored

at 60-minute intervals due to infrequent changes.

This systematic structure supports performance

monitoring and efﬁcient reporting across various

network conﬁgurations.

Customer data is transferred via SFTP (Secure

File Transfer Protocol), ensuring the secure han-

dling of large ﬁles. Collected from antennas, this

data is saved in TSV (Tab-Separated Values) Files,

a structured text format in which records are sepa-

rated by newlines and tab characters separate val-

ues within each record.

4.3. Data Lakehouse Zones

The data lakehouse architecture consists of four

zones, each providing distinct functions designed

to manage data efﬁciently through various pro-

cessing stages.

Data ﬁrst enters the Landing Zone, preserved

in its original TSV format, ensuring data integrity.

From there, it moves to the Raw Zone, where it

is converted into efﬁcient Iceberg ﬁles. Iceberg’s

columnar storage format, particularly with Parquet,

reduces data size and boosts query performance

by allowing schema evolution and version control.

In the Curated Zone, data undergoes thorough

cleaning, validation, and transformation to conform

to consistent schema standards, setting it up for re-

liable analysis. When data reaches the Enriched

Zone, it is fully processed and ready for advanced

analytical tasks, with aggregations and transforma-

tions complete for streamlined integration into ap-

plications and reports.

Figure 1 illustrates the data lakehouse zones

used for this work.

Figure 1: Data Lakehouse Zones.

Query engines like Dremio optimize access

across all these zones, while Nessie serves as

a catalog, tracking data versions and schema

changes, making data management seamless and

efﬁcient.

4.4. Proposed Architecture

The proposed data lakehouse architecture is de-

signed to support efﬁcient data management

through clearly deﬁned data zones that transform

raw TSV ﬁles into enriched, query-ready datasets

for advanced analytics.

Data ingestion starts by uploading structured

TSV ﬁles into the landing zone, preserving raw

data in its original form. From there, data is

stored in Apache Iceberg format within the raw

zone. Iceberg’s advanced table format supports

schema evolution and optimized data querying,

signiﬁcantly reducing storage size and increasing

processing efﬁciency. After initial storage, data

undergoes cleaning and validation in the curated

zone, which prepares it for reliable analysis by en-

forcing a schema. Data is further processed and

aggregated in the enriched zone, making it ready

for complex analytics and applications.

The architecture operates within a locally de-

ployed Minio environment for this project, given

that the dataset is a fraction of the total data vol-

ume. Although alternatives like OVH and Wasabi

were tested, their performance did not meet the

project’s speciﬁcations, with AWS as the preferred

option for future large-scale deployments. The ar-

chitecture also integrates Dremio as a query en-

gine, optimizing data access across zones, and

Nessie as a data catalog for tracking data versions

and maintaining schema consistency.

The architecture’s ﬂexibility is enhanced by

Docker containerization, which isolates services

from ingestion to query, promoting seamless scal-

ing and ease of deployment. This modular ap-

proach allows each service to be modiﬁed without

impacting the entire system.

Figure 2 presents the overall design of the archi-

tecture, which is explained in detail in subsequent

sections on data ingestion, processing, and enrich-

ment steps.

Figure 2: Proposed Data Architecture.

4.5. Data Ingestion

This project employs Apache NiFi to manage data

ingestion from Multivision’s repository into Minio,

the object storage solution. NiFi retrieves and or-

ganizes data from Ericsson’s 2G and 4G networks,

stored in 15-minute intervals.

Separate NiFi processor groups were conﬁg-

ured to handle performance management (PM)

and conﬁguration management (CM) data require-

ments. These processors efﬁciently manage tasks

such as identifying relevant ﬁles, ﬁltering data by

attributes (e.g., date or type), retrieving data con-

tents, and ﬁnally, storing them in Minio’s landing

zone. This workﬂow ensures that the data is main-

tained in its original TSV format, ready for further

processing and analysis in subsequent stages.

4.6. Data Processing

Landing to Raw

The data processing phase transitions data from

the landing to the raw zone by reading TSV ﬁles

and converting them into Iceberg format using

Spark and JupyterHub. This transformation is

essential for storage efﬁciency and data querying

speed. For instance, the landing zone, containing

data in TSV format, occupies 45.3 Gb, whereas the

converted raw data in Iceberg format only requires

4.4 Gb. This signiﬁcant reduction highlights Ice-

berg’s advantage: by storing data in compressed,

columnar Parquet ﬁles, Iceberg reduces storage

costs and enhances query performance.

Once converted, the data is stored in the raw

zone in Minio. Additionally, Nessie is employed as

a data catalog to maintain metadata for these ta-

bles, streamlining schema management and sup-

porting future queries and analysis with enhanced

data governance and reliability.

Raw to Curated

The process from the raw to the curated zone

involves reading Iceberg ﬁles, applying essential

cleaning and transformations, and adding schema

information before storing the data in the curated

bucket.

The code used ﬁrst loads each dataset’s table

structure, verifying schema alignment by matching

columns with schema deﬁnitions. If columns are

missing, they are ﬁlled with NULL values to main-

tain consistency with the expected schema. Once

all transformations are complete, the data is written

in the curated bucket, ready for analysis with reli-

able schema compliance and data quality assured.

Curated to Enriched

Transitioning from the curated to enriched phase

is case-speciﬁc, and, in this project, downlink traf-

ﬁc forecasting was selected to showcase the ar-

chitecture’s capabilities. The downlink trafﬁc KPI,

essential for analyzing data transferred from net-

work sources to users, was calculated using a set

of counters: general (GDT), enhanced (EDT), and

background trafﬁc (BT). General trafﬁc refers to

standard data, enhanced metrics prioritize high-

demand services, and background trafﬁc relates to

lower-priority tasks like updates. The total downlink

trafﬁc (TDT) is given by equation 7.

T DT =GDT +EDT +BT (7)

Data transformations and aggregation were per-

formed in Dremio using SQL queries, consolidating

15-minute interval data into hourly intervals. The

enriched dataset, now accessible for analytical and

operational applications, enables efﬁcient integra-

tion across various system components.

Due to substantial missing data in the 4G dataset

(around 80%), the study pivoted to 2G data for

building models, drawing on data from March and

April to strengthen model reliability.

5. Results and Application

5.1. Results

The new data architecture has signiﬁcantly im-

proved processing efﬁciency and storage costs.

Previously, achieving the same results took

approximately 15 minutes. This extended pro-

cessing time was largely due to the limitations of

AWS Lambda, which, while allowing for server-

less, event-triggered code execution, imposes

constraints on concurrent executions. These

restrictions reduced the system’s ability to handle

high volumes of data simultaneously. Furthermore,

the previous process loaded all counters in a single

operation, creating a bottleneck that strained the

database and slowed down processing efﬁciency.

The new system, powered by Iceberg and

Dremio, computes these values in 21.17 seconds,

as shown in Table 1.

Table 1: Dremio Query Report.

Summary

Duration 21.17s

Input 9.81Gb / 138M Rows

Output 113.34Kb / 1.4K Rows

This setup not only allows for faster data pro-

cessing but also compresses data by a factor of

10, leading to signiﬁcantly lower storage costs.

Iceberg’s compression has reduced ﬁle sizes from

45.3 Gb in TSV format to 4.4 Gb (Raw Stage),

demonstrating high storage efﬁciency.

Adopting the Iceberg format ensures data in-

tegrity and efﬁcient resource usage, enabling sub-

stantial cost savings while maintaining technical

performance. These improvements justify the tran-

sition to the new architecture, which optimizes pro-

cessing time and ﬁnancial overhead.

5.2. Application

At the ﬁnal stage of the architecture, the data

becomes accessible for export to the Simplifyd

platform or for direct analysis by data scientists

and analysts. An example application is network

trafﬁc analysis, which identiﬁes periods where

capacity cells can be deactivated, shifting trafﬁc

to coverage cells. In a large-scale scenario (e.g.,

managing 6578 cells across a country), this

approach has substantial energy-saving and en-

vironmental beneﬁts, potentially reducing energy

consumption by 37%, operational costs by 900kC,

and CO2emissions by 2556 tons annually [28].

Data preparation focused on transforming the

imported data from Dremio to a standardized for-

mat. Calculations were done to convert trafﬁc mea-

surements to gigabytes (Gb), ensuring unit consis-

tency.

After transformations, data visualization high-

lighted key patterns, particularly daily usage peaks

and seasonal trends. Figure 3 shows the evolution

of downlink trafﬁc from 1 March 2024 until 30 April

2024.

The trafﬁc data showed ﬂuctuations between 5

and 25Gb, with statistics indicating moderate vari-

ance. These statistics offer some insights, allow-

ing better understanding and preparation for fur-

ther modelling.

Figure 3: Downlink Trafﬁc Evolution.

Missing Values

Three methods were implemented to handle miss-

ing data: removing NaNs, linear interpolation, and

imputation using the ARIMA model. Removing

NaNs is a basic approach generally applied when

missing data is minimal, though it risks discarding

potentially useful information. Linear interpolation

estimates missing values using surrounding data

points, providing a smooth approximation. The

most sophisticated method, ARIMA-based imputa-

tion, uses time series patterns to predict missing

values, providing a more accurate ﬁll that respects

trafﬁc seasonality.

Seasonal ARIMA Model

Since ARIMA requires stationary data, the Aug-

mented Dickey-Fuller (ADF) test tested the dataset

for stationarity. Stationarity implies that statistical

properties, such as mean and variance, are

consistent over time. With a p-value below 0.05

and an ADF statistic below the critical threshold,

the test conﬁrmed that the data met stationarity

requirements, validating it for ARIMA modelling.

The Seasonal ARIMA model was conﬁgured

to capture hourly seasonal trends (with a period

of 24). An automated ﬁtting process optimized

the ARIMA parameters to minimize the AICc

value, enhancing prediction accuracy. The model

effectively accounted for short-term and seasonal

variations in downlink trafﬁc, enabling reliable

forecasting of daily patterns.

The model’s performance was evaluated

with three approaches to address missing val-

ues—dropping NaNs, interpolation, and ARIMA

imputation—using RMSE and MAE as error

metrics (Table 2).

Dropping NaNs had the highest errors, as it

discards valuable data points. Interpolation im-

proved accuracy, while ARIMA-based imputation

produced the lowest RMSE and MAE values. The

model made accurate test predictions using this

Table 2: RMSE and MAE for the different methods.

Method RM SE MAE

Drop NANs 5.32 3.81

Interpolation 1.87 1.08

ARIMA model 1.65 1.01

optimal ARIMA-based imputation, closely match-

ing the actual trafﬁc data and demonstrating the

value of time series-based imputation. Figure 4

displays the predictions for the test set using the

ARIMA model to predict the missing values.

Figure 4: Forecast Plot.

Holt-Winters

The Holt-Winters method was applied as an al-

ternative forecasting approach, using both additive

and multiplicative seasonal models. The additive

model proved more effective, especially given the

stable seasonal patterns of network trafﬁc. Additive

seasonality worked well because the amplitude of

daily trafﬁc ﬂuctuations remained consistent, align-

ing with telecommunications usage patterns. The

model effectively captured the trafﬁc dynamics, al-

though the ARIMA model showed a slightly better

ﬁt. This method provided an additional tool for vali-

dating the seasonal trends identiﬁed by ARIMA.

Aggregated Results

A comprehensive comparison across methods

showed that Seasonal ARIMA delivered the best

forecasting accuracy, particularly with ARIMA-

based imputation. Holt-Winters’ additive model

and ARIMA with interpolation also performed well.

Table 3 summarizes each method’s RMSE and

MAE results, conﬁrming Seasonal ARIMA with

ARIMA imputation as the most accurate approach,

with promising applications for forecasting future

network trafﬁc patterns.

6. Conclusions

This thesis addressed Multivision’s need for a scal-

able and efﬁcient data lakehouse to support mo-

bile network management in the telecommunica-

Table 3: RMSE and MAE with ARIMA and Holt-Winters.

Method RM SE MAE

ARIMA (Drop NaN) 5.32 3.81

ARIMA (Interpolation) 1.87 1.08

ARIMA (Predicting NaN) 1.65 1.01

HW (Additive) 1.93 1.46

HW (Multiplicative) 10.18 8.93

tions sector. By leveraging contemporary data en-

gineering tools, a robust system was developed to

overcome challenges in data ingestion, process-

ing, and accessibility. Furthermore, the integra-

tion of time series forecasting models for down-

link trafﬁc demonstrated the lakehouse’s practical

value in optimizing network performance, enhanc-

ing decision-making, and improving operational ef-

ﬁciency.

6.1. Data Lakehouse Achievements

This thesis developed a data lakehouse to efﬁ-

ciently manage both structured and unstructured

data in the telecommunications sector. The multi-

zone pipeline — spanning landing, raw, curated,

and enriched data stages — facilitated a stream-

lined data ﬂow and optimized lifecycle manage-

ment. Performance improvements were notable,

with automated data aggregation queries now av-

eraging 21.17 seconds, enhancing access to crit-

ical network data and enabling faster, data-driven

decision-making. Furthermore, converting TSV

ﬁles to the Iceberg format achieved a 10:1 com-

pression ratio, signiﬁcantly reducing storage costs

and showcasing the lakehouse’s efﬁciency in han-

dling growing data demands in telecommunica-

tions.

6.2. Time Series Achievements

The forecasting models in this thesis focused on

downlink trafﬁc, a key KPI for telecommunications.

The ARIMA model accurately captured daily and

seasonal trends, closely aligning with observed

data, while the Holt-Winters model offered an al-

ternative but slightly less accurate option. To-

gether, these models demonstrated the architec-

ture’s capacity for reliable trafﬁc prediction, sup-

porting proactive resource allocation and network

planning.

6.3. Future Work

Future work should consider automating the

architecture using tools like Airﬂow to streamline

data pipelines for production environments [29].

This automation would enable a seamless, unin-

terrupted data ﬂow, eliminating manual tasks and

improving efﬁciency.

Expanding the architecture’s applications is

another potential area, such as developing ma-

chine learning models to predict the causes of

dropped calls (another KPI), which could be cate-

gorized into factors like weak signal and network

congestion. This would further demonstrate the

architecture’s versatility in predictive analytics,

offering telecom operators valuable insights for

network optimization.

Lastly, incorporating additional metrics like cell

coverage area and neighboring cell relationships

could provide a more comprehensive view of net-

work load distribution, complementing trafﬁc pre-

diction for more efﬁcient resource management.

For example, in a scenario with 6,578 cells, apply-

ing this strategy to 43% of the cells could result

in approximately 37% energy savings, a C900,000

reduction in operational costs, and an annual de-

crease of 2,556 tons in CO2emissions [28]. These

enhancements would enable telecom operators to

optimize resource allocation, maximizing efﬁciency

and minimizing environmental impact.

References

[1] Ahmed A. Harby and Farhana Zulkernine.

From data warehouse to lakehouse: A com-

parative review. In 2022 IEEE International

Conference on Big Data (Big Data), pages

389–395, 2022.

[2] Jeffrey Dean and Sanjay Ghemawat. Mapre-

duce: simpliﬁed data processing on large

clusters. Commun. ACM, 51(1):107–113,

January 2008.

[3] Casber Wang. What is the open data

ecosystem and why it’s here to stay.

https://sapphireventures.com/blog/what-

is-the-open-data-ecosystem-and-why-its-

here-to-stay/, 2021.

[4] Vladimir Belov and Evgeny Nikulchev. Anal-

ysis of big data storage tools for data lakes

based on apache hadoop platform. Interna-

tional Journal of Advanced Computer Science

and Applications, 12:551–557, 08 2021.

[5] Salisu Borodo, Siti Mariyam Shamsuddin, and

Shafaatunnur Hasan. Big data platforms and

techniques. Indonesian Journal of Electrical

Engineering and Computer Science, 1:191,

01 2016.

[6] Medium. Demystifying data lakes: Design, ar-

chitecture, and best practices, 2023.

[7] Draˇ

zen Oreˇ

sˇ

canin and Tomislav Hlupi´

c. Data

lakehouse - a novel step in analytics architec-

ture. pages 1242–1246, 2021.

[8] Databricks. What is a medallion architecture?

https://www.databricks.com/glossary/medallion-

architecture, 2023.

[9] Jaber Alwidian, Sana Abdel Rahman, Maram

Gnaim, Fatima Al-Taharwah, et al. Big data

ingestion and preparation tools. Modern Ap-

plied Science, 14(9):12–27, 2020.

[10] Mohamed CHERRADI and Anass EL HAD-

DADI. A scalable framework for data lakes

ingestion. Procedia Computer Science,

215:809–814, 2022. 4th International Con-

ference on Innovative Data Communication

Technology and Application.

[11] Eman Shaikh, Iman Mohiuddin, Yasmeen Al-

ufaisan, and Irum Nahvi. Apache spark: A big

data processing engine, 2019.

[12] Apache NiFi. https://niﬁ.apache.org/. Ac-

cessed: 2024-10-22.

[13] Mohammad Irfan, Reena, and Jossy George.

Data ingestion - cloud based ingestion analy-

sis using niﬁ, 2023.

[14] Airbyte: Open Source Data Integration.

https://airbyte.com/. Accessed: 2024-10-22.

[15] Vladimir Belov and Evgeny Nikulchev. Anal-

ysis of big data storage tools for data lakes

based on apache hadoop platform. Interna-

tional Journal of Advanced Computer Science

and Applications, 12(8), 2021.

[16] Vandana Vijay, Vaibhav Sharma, Vandana

Srivastava, and Kumar Vipin. A compara-

tive study on hadoop mapreduce and apache

spark framework for big data analytics. Inter-

national Journal of Research Publication and

Reviews, 5:3228–3232, 02 2024.

[17] A. Makris, I. Kontopoulos, S. Nektarios Xyalis,

E. Psomakelis, T. Theodoropoulos, A. Varvari-

gos, and K. Tserpes. A study on the perfor-

mance of distributed storage systems in edge

computing environments, 2024.

[18] OVHcloud: Cloud Computing and Web Host-

ing Solutions. https://www.ovhcloud.com/pt/.

Accessed: 2024-10-22.

[19] Wasabi: Setup Limits and Policies.

https://docs.wasabi.com/docs/setup-limits-

and-policies. Accessed: 2024-10-22.

[20] Tomer Shiran, Jason Hughes, and Alex

Merced. Apache Iceberg: The Deﬁnitive

Guide. O’Reilly Media, 2024. Data Lakehouse

Functionality, Performance, and Scalability on

the Data Lake.

[21] Richard A. Davis Peter J. Brockwell. Introduc-

tion to Time Series and Forecasting. Springer,

3rd edition, 2016.

[22] Rob J Hyndman and George Athanasopou-

los. Forecasting: Principles and Practice.

OTexts, Melbourne, Australia, 3rd edition,

2021. Monash University, Australia.

[23] H. Akaike. Information theory and an exten-

sion of the maximum likelihood principle. In

2nd International Symposium on Information

Theory, 1971.

[24] S. Shalev-Shwartz and S. Ben-David. Un-

derstanding Machine Learning: From Theory

to Algorithms. Cambridge University Press,

2014.

[25] S. E. Said and D. A. Dickey. Testing for unit

roots in autoregressive-moving average mod-

els of unknown order. Biometrika, 1983.

[26] D. Kwiatkowski, P. C. Phillips, P. Schmidt, and

Y. Shin. Testing the null hypothesis of station-

arity against the alternative of a unit root: How

sure are we that economic time series have a

unit root? Journal of Econometrics, 1992.

[27] Chris Chatﬁeld and Mohammad Yar. Holt-

winters forecasting: some practical issues.

Journal of the Royal Statistical Society Series

D: The Statistician, 37(2):129–140, 1988.

[28] Diogo Clemente, L´

ucio Ferreira, Gabriela

Soares, Nuno Valente, and Pedro Sebasti˜

ao.

Implementation of a cloud-based, trafﬁc

aware and energy efﬁcient management of

base stations’ activity. In 2018 21st Interna-

tional Symposium on Wireless Personal Mul-

timedia Communications (WPMC), 2018.

[29] Apache airﬂow. https://airﬂow.apache.org/.

Accessed: 2024-10-30.

0 views·10 pages

A Lakehouse to support mobile networks management PDF Free Download

A Lakehouse to support mobile networks management PDF free Download. Think more deeply and widely.

Uploaded by brooks784 on 4/10/2026

/10

100%