Optimizing Data Analysis Through Offline Large Language Models and Scalable Data Management Techniques PDF Free Download

1 / 189
0 views189 pages

Optimizing Data Analysis Through Offline Large Language Models and Scalable Data Management Techniques PDF Free Download

Optimizing Data Analysis Through Offline Large Language Models and Scalable Data Management Techniques PDF free Download. Think more deeply and widely.

National Technical University of Athens
School of Electrical and Computer Engineering
Division of Communications, Electronics & Information Systems
Optimizing Data Analysis Through Offline
Large Language Models and Scalable Data
Management Techniques
Research, Study and Implementation
PhD Thesis
of
ANASTASIOS NIKOLAKOPOULOS
Supervisor: Theodora Varvarigou
Professor, NTUA
Athens, May 2025
National Technical University of Athens
School of Electrical and Computer Engineering
Division of Communications, Electronics & Information Systems
Optimizing Data Analysis Through Offline Large
Language Models and Scalable Data Management
Techniques
Research, Study and Implementation
PhD Thesis
of
ANASTASIOS NIKOLAKOPOULOS
Supervisor: Theodora Varvarigou
Professor, NTUA
Approved by the examination committee on 8th May 2025.
(Signature) (Signature) (Signature)
........................ .......................... ..........................
Theodora Varvarigou Emmanouil Varvarigos Symeon Papavasileiou
Professor, NTUA Professor, NTUA Professor, NTUA
(Signature) (Signature) (Signature)
.............................. ...................... .................
Konstantinos Tserpes Dimitrios Askounis Antonios Litke
Assistant Professor, NTUA Professor, NTUA Lecturer, HAA
(Signature)
........................
Anastasios Doulamis
Professor, NTUA
Athens, May 2025
4
National Technical University of Athens
School of Electrical and Computer Engineering
Division of Communications, Electronics & Information Systems
Copyright © All rights reserved.
Anastasios Nikolakopoulos, 2025.
The copying, storage and distribution of this PhD diploma thesis, exall or part of it, is
prohibited for commercial purposes. Reprinting, storage and distribution for non - profit,
educational or of a research nature is allowed, provided that the source is indicated and
that this message is retained.
The content of this thesis does not necessarily reflect the views of the Department, the
Supervisor, or the committee that approved it.
DISCLAIMER ON ACADEMIC ETHICS AND INTELLECTUAL PROPERTY RIGHTS
Being fully aware of the implications of copyright laws, I expressly state that this PhD
diploma thesis, as well as the electronic files and source codes developed or modified in
the course of this thesis, are solely the product of my personal work and do not infringe
any rights of intellectual property, personality and personal data of third parties, do not
contain work / contributions of third parties for which the permission of the authors /
beneficiaries is required and are not a product of partial or complete plagiarism, while
the sources used are limited to the bibliographic references only and meet the rules of
scientific citing. The points where I have used ideas, text, files and / or sources of other
authors are clearly mentioned in the text with the appropriate citation and the relevant
complete reference is included in the bibliographic references section. I fully, individually
and personally undertake all legal and administrative consequences that may arise in the
event that it is proven, in the course of time, that this thesis or part of it does not belong
to me because it is a product of plagiarism.
(Signature)
.............................
Anastasios
Nikolakopoulos
PhD, National Technical University of Athens
Athens, May 2025
...To my loving parents, Konstantinos & Georgia, and my beloved wife, Chrysa
Abstract
The importance of data across various sectors demands innovative approaches to
data management and analytics. This PhD thesis investigates the integration of offline
large language models (LLMs) for automated code generation, aiming to streamline data
analysis processes, and thus enhance the scalability and efficiency of data management
systems. By leveraging offline LLMs, the proposed approach empowers users to perform
data analyses without extensive programming skills, thereby democratizing data ana-
lytics. The research delves into the architecture and implementation of scalable data
management systems that can efficiently handle datasets of several volumes. Based on
an efficient data management platform, the capabilities of offline LLMs to generate ana-
lytical code are examined, showcasing how these models can transform user queries into
executable scripts that facilitate data manipulation and interpretation. Through exper-
iments and case studies, the practical applications and benefits of the proposed study
are showcased. The results highlight the potential of offline Large Language Models in
Data Science and Analysis. This thesis contributes to the field by presenting a study that
integrates AI-driven code generation with robust data management practices, ultimately
paving the way for more efficient and user-friendly data analytics solutions.
Keywords
Data Science, Data Analysis, Machine Learning, Artificial Intelligence, Large Language
Models, Indoor Localization, Code Generation, Big Data
3
Abstract in Greek
Η σηµασία που έχουν αποκτήσει τα δεδοµένα σε ποικίλους τοµείς, καθιστά αναγκαία
την ανάπτυξη καινοτόµων προσεγγίσεων στη διαχείριση και ανάλυσή τους. Η παρούσα δι-
δακτορική διατριβή ερευνά την ενσωµάτωση τοπικών Μεγάλων Γλωσσικών Μοντέλων (offline
LLMs) για την αυτόµατη παραγωγή κώδικα, µε στόχο την απλοποίηση των διαδικασιών α-
νάλυσης δεδοµένων και, κατ’ επέκταση, τη ϐελτίωση της επεκτασιµότητας και αποδοτικότη-
τας των συστηµάτων διαχείρισης δεδοµένων. Μέσω της αξιοποίησης τοπικά εκτελούµενων
LLMs, η προτεινόµενη προσέγγιση δίνει τη δυνατότητα σε χρήστες χωρίς εκτεταµένες προ-
γραµµατιστικές γνώσεις να πραγµατοποιούν αναλύσεις δεδοµένων, συµβάλλοντας έτσι στη
δηµοκρατικοποίηση της επιστήµης των δεδοµένων. Η έρευνα εστιάζει στην αρχιτεκτονική και
την υλοποίηση επεκτάσιµων συστηµάτων διαχείρισης δεδοµένων, ικανών να διαχειρίζονται
αποδοτικά σύνολα δεδοµένων µεγάλου όγκου. Βασιζόµενη σε µια αποδοτική πλατφόρµα
διαχείρισης δεδοµένων, η εργασία εξετάζει τις δυνατότητες των offline LLMs για την παραγω-
γή αναλυτικού κώδικα, αναδεικνύοντας πώς αυτά τα µοντέλα µπορούν να µετασχηµατίσουν
ϕυσικές ερωτήσεις χρηστών σε εκτελέσιµα σενάρια, τα οποία διευκολύνουν τον χειρισµό και
την ερµηνεία των δεδοµένων. Μέσα από πειραµατικές διαδικασίες και µελέτες περίπτωσης,
παρουσιάζονται οι πρακτικές εφαρµογές και τα οφέλη της προτεινόµενης προσέγγισης. Τα
αποτελέσµατα αναδεικνύουν την αποδοτικότητα των τοπικά εκτελούµενων Μεγάλων Γλωσ-
σικών Μοντέλων στην επιστήµη και ανάλυση δεδοµένων. Η διατριβή συµβάλλει στο πεδίο
παρουσιάζοντας µια µελέτη που συνδυάζει τεχνικές αυτόµατης παραγωγής κώδικα µέσω τε-
χνητής νοηµοσύνης µε στιβαρές πρακτικές διαχείρισης δεδοµένων, ανοίγοντας τον δρόµο για
πιο αποδοτικές και ϕιλικές προς τον χρήστη λύσεις ανάλυσης δεδοµένων.
Λέξεις Κλειδιά
Επιστήµη ∆εδοµένων, Ανάλυση ∆εδοµένων, Μηχανική Μάθηση, Τεχνητή Νοηµοσύνη,
Μεγάλα Γλωσσικά Μοντέλα, Γεωεντοπισµός Εσωτερικών Χώρων, Παραγωγή Κώδικα, Μεγάλα
∆εδοµένα
5
Acknowledgements
I would like to express my gratitude to my supervisor, Dr. Theodora Varvarigou, for her
unwavering trust, insightful guidance, and continuous support throughout the course of
my doctoral journey. I am also thankful to the members of my examination committee for
their valuable feedback, constructive suggestions, and contribution to the evaluation and
improvement of this dissertation. Moreover, I would like to extend my appreciation to my
colleagues at the Distributed Knowledge and Media Systems (DKMS) laboratory for their
collaboration, insightful discussions, and the positive and inspiring research environment
they helped foster. Last but not least, my heartfelt thanks go to my parents and my wife
for their endless patience, understanding, and unconditional support, which gave me the
strength and motivation to persevere during challenging times.
Athens, May 2025
Anastasios Nikolakopoulos
7
Table of Contents
Abstract 3
Abstract in Greek 5
Acknowledgements 7
Prelude 17
Extended PhD Thesis Summary in Greek 19
I Introduction 35
1 Introduction: Towards Efficient Data Management and Analysis 37
1.1 Scalable Data Profiling for Quality Analytics Extraction . . . . . . . . . . . 37
1.2 A Scalable Data Management and Interoperability Solution . . . . . . . . . 38
1.2.1 Data Management Challenges . . . . . . . . . . . . . . . . . . . . . . 40
1.2.2 Efficient Data Management Layer Proposal . . . . . . . . . . . . . . . 42
1.3 Large Language Models and their Impact in Modern Applications . . . . . . 43
1.3.1 Applications and Impact of LLMs . . . . . . . . . . . . . . . . . . . . 43
1.3.2 Exploring Offline LLMs in Data Analysis . . . . . . . . . . . . . . . . 44
II Main Analysis 47
2 Literature Review: Assessing Similar and Existing Approaches to Data Man-
agement and AI-Powered Analysis 49
2.1 Generic Approaches to Scalable Data Management and Analytics . . . . . . 49
2.1.1 Data Profiling and Quality Analysis . . . . . . . . . . . . . . . . . . . 49
2.1.2 User-Defined Quality Rules . . . . . . . . . . . . . . . . . . . . . . . 50
2.2 Related Scalable Data Management Proposals and Solutions . . . . . . . . 51
2.2.1 EU Research Projects and Initiatives . . . . . . . . . . . . . . . . . . 51
2.2.2 Research Work on Scientific Literature . . . . . . . . . . . . . . . . . 52
2.3 Applications of LLMs for Code Generation and Data Analysis Operations . . 54
2.3.1 Code Generation Techniques . . . . . . . . . . . . . . . . . . . . . . 54
2.3.2 Literature Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.3.3 Other Applications and Market Tools . . . . . . . . . . . . . . . . . . 56
2.3.4 Prompt Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9
TABLE OF CONTENTS
3 Study Design, Methodology, and Implementation: A Complete Analysis of the
Framework 59
3.1 Initial Generic Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1.2 Framework Operational Flows . . . . . . . . . . . . . . . . . . . . . . 61
3.1.3 Technologies To Use . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2 The Scalable Data Management Software Infrastructure . . . . . . . . . . . 63
3.2.1 Data Processing and Virtualization . . . . . . . . . . . . . . . . . . . 63
3.2.2 Pre-Processing and Filtering Tool . . . . . . . . . . . . . . . . . . . . 64
3.2.3 Virtual Data Repository . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2.4 Virtual Data Container . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3 The Final System with the Utilization of Offline LLMs . . . . . . . . . . . . . 69
3.3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.2 Dataset Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 Technical Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.1 LLMs and Data Processing Platform . . . . . . . . . . . . . . . . . . . 76
3.4.2 Prompt Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.3 Dataset Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4 Testing and Results: Assessing the Study’s Performance 87
4.1 Validating the Scalable Capabilities of the Underlying Infrastructure . . . . 87
4.2 Evaluating the Final System . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2.1 Physical Machine Specifications . . . . . . . . . . . . . . . . . . . . . 101
4.2.2 Testing Results Collection . . . . . . . . . . . . . . . . . . . . . . . . 103
4.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5 Thesis Findings: Critical Assessment of the Proposed System 123
5.1 General Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.2 Validating Scalable Data Management Frameworks . . . . . . . . . . . . . . 123
5.3 Entrusting Offline LLMs for Future Data Analytics . . . . . . . . . . . . . . 124
5.3.1 Strengths and Limitations . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6 Future Use Case: Real-Time Indoor Localization Frameworks 127
6.1 Indoor Localization in Today’s Applications . . . . . . . . . . . . . . . . . . 127
6.1.1 Location Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.1.2 Indoor Localization as an Area Network . . . . . . . . . . . . . . . . . 128
6.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2.1 RSS, PDR and Filtering Techniques . . . . . . . . . . . . . . . . . . . 129
6.2.2 Machine Learning and Ultra Wideband . . . . . . . . . . . . . . . . . 132
6.3 TAN Conceptualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.3.1 Scope and Characteristics . . . . . . . . . . . . . . . . . . . . . . . . 134
6.3.2 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.3.3 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10
TABLE OF CONTENTS
6.4 Proof-of-Concept Implementation . . . . . . . . . . . . . . . . . . . . . . . . 140
6.4.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.4.2 Testing and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.5 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
III Epilogue 157
7 Conlusions: Towards Scalable Data Management Empowered By Offline Large
Language Models 159
7.1 Managing Scalable Volumes of Data . . . . . . . . . . . . . . . . . . . . . . 159
7.2 Leveraging Offline LLMs for Data Analysis . . . . . . . . . . . . . . . . . . . 159
Appendices 161
A Appendix 163
A.1 The Python code readability calculation function . . . . . . . . . . . . . . . 163
A.2 Summary of the ‘Shared Cars Locations’ Dataset . . . . . . . . . . . . . . . 164
Bibliography 179
List of Abbreviations 181
11
List of Figures
1.1 The 16 critical infrastructure sectors of global industry and economy [1]. . 39
3.1 The initial architecture of the proposed scalable data management and anal-
ysis framework, along with its two operational flows, ’Data Profiler’ and ’Data
Evaluator’ [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2 Architectural view of the scalable data management layer, with its three
subcomponents [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3 The information flow of the Virtual Data Container [3] . . . . . . . . . . . . 67
3.4 Accepted operators by VDC’s Rules System [1]. . . . . . . . . . . . . . . . . 69
3.5 The proposed system’s final architecture overview [4]. . . . . . . . . . 72
4.1 A JSON object from the “urn-ngsi-ld-ITI-Customs” dataset, containing in-
formation for a given port’s customs item [1]. . . . . . . . . . . . . . . . . . 88
4.2 A JSON object that contains information for a random-anonymized-OTE
cellular network user in the area of Thessaloniki, Greece [1]. . . . . . . . . 90
4.3 Screenshot from scalable data management layer’s Apache Spark Cluster
WebUI [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4 Screenshot from Spark cluster driver’s logs [1]. . . . . . . . . . . . . . . . . 91
4.5 A screenshot from the Apache Spark Cluster’s WebUI, for the November
2022 subset [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.6 Screenshot from the logs of the Spark driver “driver-20230901183325-
0000”, for the November 2022 subset [1]. . . . . . . . . . . . . . . . . . . . 92
4.7 A screenshot from the Spark Cluster’s WebUI, for the Customs subset con-
taining 6 million objects [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.8 A screenshot from “driver-20230901184317-0001” Spark cluster driver’s
logs, for the 6 million subset [1]. . . . . . . . . . . . . . . . . . . . . . . . . 93
4.9 A screenshot from Spark Cluster’s WebUI, for the Customs subset compris-
ing 11 million objects [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.10 A screenshot from the logs of the Spark driver “driver-20230901185350-
0002”, for the 11 million subset [1]. . . . . . . . . . . . . . . . . . . . . . . 94
4.11 A screenshot from the Spark Cluster’s WebUI, for the complete Customs
dataset [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.12 A screenshot from the logs of the Spark driver “driver-20230901191216-
0003”, for the complete Customs subset [1]. . . . . . . . . . . . . . . . . . 95
13
LIST OF FIGURES
4.13 A screenshot from the Spark Cluster’s WebUI, for the Mobility subset of
2019 with 6.9 million objects [1]. . . . . . . . . . . . . . . . . . . . . . . . 96
4.14 A screenshot from the logs of the Spark driver “driver-20231004114021-
0000”, regarding the Mobility 2019 subset [1]. . . . . . . . . . . . . . . . . 96
4.15 The Spark Cluster’s WebUI, depicting the driver “driver-20231004114951-
0001” in running state, for the 2020 Mobility subset [1]. . . . . . . . . . . 97
4.16 A screenshot from the “driver-20231004114951-0001” Spark driver’s logs,
for the Mobility 2020 subset [1]. . . . . . . . . . . . . . . . . . . . . . . . . 97
4.17 A screenshot from the Spark Cluster’s WebUI, where the driver “driver-
20231004115659-0002” is depicted in active state, for the Mobility 2021
subset [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.18 A screenshot from the “driver-20231004115659-0002” Spark driver’s logs,
which handled the Mobility subset of 2021 [1]. . . . . . . . . . . . . . . . . 98
4.19 A screenshot from the Spark Cluster’s WebUI, where the driver “driver-
20231004120403-0003” is depicted in active state, for the complete Mobility
dataset [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.20 A screenshot from the Spark driver’s “driver-20231004120403-0003” logs,
as it ceased its operation on the complete Mobility dataset [1]. . . . . . . . 99
4.21 Scatter plot with the tests of ITI Customs and OTE Mobility subsets, in
different plot lines [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.22 Scatter plot including all tests, in the same plot line [1]. . . . . . . . . . . . 100
4.23 Scatter plot including the final ITI Customs dataset [1]. . . . . . . . . . . . 100
4.24 Functional correctness of the LLM’s generated code plot [4]. . . . . . . . . 105
4.25 Plot depicting the functional correctness, by dataset [4]. . . . . . . . . . . . 106
4.26 Plot for the functional correctness scores by the queries’ complexity levels [4].107
4.27 Plot for the distribution of readability scores across the tests conducted [4]. 108
4.28 Plot depicting the average readability scores by dataset [4]. . . . . . . . . . 109
4.29 Plot displaying the average readability by query complexity level [4]. . . . . 109
4.30 Distribution of the LLM server’s response time during code generation [4]. . 110
4.31 Distribution of the LLM server’s CPU usage during code generation [4]. . . 111
4.32 Distribution of the LLM server’s memory usage during code generation [4]. 111
4.33 Distribution of the LLM server’s GPU usage during code generation [4]. . . 112
4.34 Distribution of the LLM server’s GPU memory usage during code generation
[4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.35 The LLM server’s average response time, by code readability [4]. . . . . . . 113
4.36 The LLM server’s average response time, by query complexity level [4]. . . . 114
4.37 Average CPU usage of the LLM server, by readability and query complexity
[4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.38 The LLM server’s average GPU usage, by readability and query complexity [4].115
4.39 Average response times of the LLM server, by readability and query com-
plexity [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.40 Average GPU memory usage of the LLM server, by readability and query
complexity [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
14
LIST OF FIGURES
4.41 The amount of automated and semi-automated tests, based on human in-
tervention to their code [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.42 Comparing Functional Correctness results with Automation occurences [4]. 119
4.43 Presenting the automation occurrences by the query complexity levels [4]. . 119
4.44 Comparing Functional Correctness results with error counts [4]. . . . . . . 120
4.45 The number of errors found in the tests, per dataset [4]. . . . . . . . . . . 121
4.46 The total count of errors, grouped by query complexity levels [4]. . . . . . . 122
6.1 High-level overview of the Transactional Area Network’s implementation
within an indoor environment [5]. . . . . . . . . . . . . . . . . . . . . . . . 135
6.2 A focused overview of the user communication in a TAN, showcasing the
data-exchange pipelines between two users, as well as the existence of ex-
ternal hardware for optional assistance [5]. . . . . . . . . . . . . . . . . . . 136
6.3 A Transactional Area Network operates across multiple devices in indoor
environments. Each device transmits and receives BLE beacon signals,
initiating peer-to-peer sessions, by also broadcasting and receiving Wi-Fi
signals [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.4 Swift code snippet that verifies and ensures the uniqueness of Major/Minor
values [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.5 Swift code snippet that checks the device’s ability to transmit as a beacon [5].142
6.6 ServiceAdvertiser receives an invitation to join another peer’s session [5]. . 143
6.7 ServiceBrowser locates a peer and checks if they already exist in the device’s
list with known peers [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.8 The Major/Minor pairing process, and examining for matches in the known
Peers list [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.9 TAN prototype software’s internal components’ architecture [5]. . . . . . . 146
6.10 A data sample from the TAN’s proof-of-concept software [5]. . . . . . . . . 147
6.11 Establishment of TAN between two peers, as seen from Daniel Adams’ device
(iPhone X) [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.12 David Jones’s device (iPhone 8) showing the distance between him and
Daniel Adams [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.13 Alex Lopez with his device (iPad Pro 10.5 2017) entering TAN [5]. . . . . . . 149
6.14 David Jones’s new distance from the users, as shown in his device’s screen
[5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.15 Daniel Adams enters TAN through his iPhone X, viewing two peers nearby,
John Doe and Catherine Smith [5]. . . . . . . . . . . . . . . . . . . . . . . 151
6.16 Daniel Adams’s new distance from users John Doe and Catherine Smith [5]. 151
6.17 John Doe takes one final look at his iPad’s screen, observing the distance
between himself and users Daniel Adams and Catherine Smith [5]. . . . . 153
6.18 Proposed improvement in TAN software’s internal components Architecture,
highlighting the addition of Kalman filtering and outlier detection methods,
along with a signal direction calculation process [5]. . . . . . . . . . . . . . 154
15
Prelude
In the era of digital transformation, data has emerged as one of the most valuable
assets across industries, driving innovation, strategic decision-making, and operational
efficiency. The rapid expansion of digital services, IoT deployments, and interconnected
systems has led to an exponential increase in data generation. Organizations today are
faced with an unprecedented volume, velocity, and variety of data, making efficient data
management and analytics critical requirements for maintaining competitiveness, and
achieving business and research objectives.
Traditional approaches to data analytics often require specialized knowledge of pro-
gramming and query languages, data processing frameworks, and of course the data
itself. This creates a barrier for non-technical users who wish to extract meaningful in-
sights from datasets. Moreover, as datasets grow in complexity, ensuring data quality,
managing storage efficiently, and executing large-scale analytical tasks in a timely man-
ner, present formidable challenges. It should also be noted that organizations tend to
adhere to strict data privacy regulations, highlighting the need for solutions that ensure
data integrity, while preventing exposure to external cloud-based services.
Furthermore, current big data environments rely heavily on distributed computing
frameworks (such as Apache Spark) to process large data sets efficiently. Such frame-
works have proven to be robust approaches to big data management. However, the
process of defining queries, writing scripts, and optimizing execution pipelines remains
a bottleneck, especially for end-users, analysts and decision-makers who are unfamiliar
with programming or data technologies. The introduction of offline LLMs into this land-
scape could provide a transformative shift in how users interact with data, automating
the translation of human language into structured queries and execution plans.
The aforementioned challenges call for innovative approaches to data management
and analytics. Lately, Large language models (LLMs), such as GPT or Gemini, have
demonstrated remarkable proficiency in understanding and generating human-like text.
A future data management & analytics framework could benefit by an offline version of
such LLMs, since the model’s advanced natural language processing capabilities would
provide several advantages to data analysis, while also ensuring data privacy and security.
The LLM’s role should be to interpret users’ natural language queries and translate them
into system executable code, which could then be automatically executed within the
framework itself. This translation process could bridge the gap between non-technical
users and complex data analytics tasks, democratizing access to powerful data insights.
This thesis proposes a scalable data management framework, which is augmented
by the capabilities of an offline large language model (LLM). This framework aims to
17
Prelude
simplify data interaction by allowing users to issue natural language queries, which the
LLM translates into executable code for the framework itself. Its architecture supports
various data volumes, sources and formats, enabling seamless integration with existing
data infrastructures. The core innovation of this framework lies in its outlined ability
to leverage offline LLMs and translate natural language queries into executable code,
enabling users even those with minimal technical expertise to interact with complex
datasets in an intuitive manner. Also, the final framework ensures data security and
compliance, allowing organizations to retain full control over their datasets, all while
maintaining computational efficiency.
In summary, the proposed framework not only bridges the gap between end-users
and data management & analytics technologies, but also enhances analytical workflows
by i)enabling natural language querying for complex data operations, ii)automating the
generation of optimized executable code for data management systems iii), providing a
privacy-preserving alternative by using an offline LLM version instead of cloud-based AI
services, iv)ensuring scalability and adaptability across different industries and data do-
mains. However, there are challenges to consider. The accuracy of the LLM in translating
natural language queries into effective code is critical. Misinterpretations could lead to
incorrect or suboptimal data analyses. Therefore, continuous refinement of the LLM’s
prompt engineering commands are necessary to improve its performance. Furthermore,
maintaining the system’s efficiency as data volumes grow requires careful evaluation of
the underlying infrastructure.
Indeed, the proposed scalable data management framework represents a potential ad-
vancement in data analytics accessibility and efficiency. By enabling non-technical users
to perform complex data analyses through simple queries, the framework democratizes
access to powerful data insights. The indoor localization data management use case
acts as an example of the practical benefits and potential of this approach, highlighting
its applicability across various domains. This novel integration of offline LLMs and data
management solutions not only enhances accessibility, but also optimizes data workflows,
making analytics more efficient and secure. Through this thesis, a systematic approach
to designing, implementing, and evaluating this framework is presented, showcasing its
potential to assist the field of scalable data management and analytics.
18
Extended PhD Thesis Summary in Greek
Πλαίσιο της ∆ιατριβής
Στο σηµερινό ταχέως εξελισσόµενο τεχνολογικό τοπίο, τα δεδοµένα ξεχωρίζουν ως ένα από τα
πιο πολύτιµα αγαθά [6]. Καθώς οι εξελίξεις στην τεχνολογία λαβαίνουν χώρα µε αυξανόµενη
συχνότητα [7], η δηµιουργία και διάδοση των δεδοµένων έχουν γίνει αναπόσπαστα στοιχεία
των αναδυόµενων προϊόντων, υπηρεσιών, και συστηµάτων. Κατά συνέπεια, ο όγκος των δεδο-
µένων που παράγονται καθηµερινά παγκοσµίως είναι εντυπωσιακός. Σύµφωνα µε πρόσφα-
τες εκτιµήσεις, παράγονται περίπου 328,77 εκατοµµύρια τεραµπάιτ δεδοµένων καθηµερινά
[8], περιλαµβάνοντας νεοδηµιουργηµένα, καταγεγραµµένα, αντιγραµµένα ή καταναλωµένα
δεδοµένα. Αυτή η τεράστια ϱοή δεδοµένων έχει ήδη αποδείξει τη σηµαντική τους αξία σε
διάφορους τοµείς.
Η διατήρηση της υψηλής ποιότητας των δεδοµένων είναι απαραίτητη για όλους τους
καταναλωτές τους. Υπάρχει έντονη ανάγκη για την εξερεύνηση µιας κλιµακούµενης - ή ανε-
ξάρτητης από τον όγκο - λύσης ανάλυσης δεδοµένων, η οποία να επιτρέπει στους χρήστες να
ορίζουν εύκολα δικά τους ερωτήµατα (queries), προσαρµοσµένα στις συγκεκριµένες αναλυ-
τικές τους ανάγκες. Η παρούσα διδακτορική διατριβή προτείνει ένα λογισµικό σύστηµα σχε-
διασµένο να χειρίζεται δεδοµένα µε κλιµακούµενο τρόπο, εκτελώντας εκτενείς αξιολογήσεις
ποιότητας σε ολόκληρα σύνολα δεδοµένων, µε τη ϐοήθεια τεχνητής νοηµοσύνης και µεγάλων
γλωσσικών µοντέλων (Large Language Models). Το προτεινόµενο σύστηµα εναρµονίζεται µε
τις καθιερωµένες πρακτικές της ανάλυσης δεδοµένων, εξάγοντας ποιοτικές πληροφορίες από
τα σύνολα δεδοµένων, ενώ προσφέρει στους χρήστες τη δυνατότητα να καθορίζουν τα κρι-
τήρια ποιότητας τους. Είναι ικανό να εφαρµόζει τους µηχανισµούς προφίλ σε ολόκληρα
σύνολα δεδοµένων (data profiling), ανεξάρτητα από τον όγκο αυτών, και να λειτουργεί απο-
δοτικά ακόµα και σε περιβάλλοντα υπολογιστών µε περιορισµένους πόρους, εξασφαλίζοντας
τόσο την κλιµάκωση όσο και την απόδοση του συστήµατος. Ο εύκολος τρόπος µε τον οποίο
πραγµατοποιείται η ανάλυση δεδοµένων από τους τελικούς χρήστες, αποτελεί επίσης έναν
κύριο στόχο της παρούσας διατριβής.
Συγκεκριµένα, παρατηρείται ανάγκη για λύσεις κλιµακούµενης διαχείρισης δεδοµένων
σε διάφορους τοµείς. ΄Ενας εξ΄ αυτών είναι ο τοµέας των Κρίσιµων Υποδοµών (ΚΥ) Critical
Infrastructures - CIs). Ο όρος αυτός αναφέρεται στα συστήµατα, τα δίκτυα και τις υποδοµές
που είναι Ϲωτικής σηµασίας για τη λειτουργία της σύγχρονης κοινωνίας, καθώς και των
ϐασικών της τοµέων. Αυτοί περιλαµβάνουν, µεταξύ άλλων, την Ενέργεια, τις Μεταφορές,
το Σύστηµα Υδρεύσης, τις Τηλεπικοινωνίες, το Σύστηµα Υγείας, τα Χρηµατοοικονοµικά, τη
Γεωργία, και τις Κρατικές / Κυβερνητικές Εγκαταστάσεις [9]. ∆εδοµένου του ϑεµελιώδη
ϱόλου τους, οποιαδήποτε διακοπή αυτών των υποδοµών µπορεί να οδηγήσει σε κοινωνικές
και οικονοµικές συνέπειες [10]. Εποµένως, είναι κατάλληλο να ϑεωρούνται οι ΚΥ ως οι
19
Extended PhD Thesis Summary in Greek
πυλώνες πάνω στους οποίους στηρίζεται η παγκόσµια οικονοµία.
Από την πλευρά των δεδοµένων, οι ΚΥ είναι παραγωγοί µεγάλου όγκου πληροφορίας,
παράγοντας τεράστιες ποσότητες δεδοµένων καθηµερινά σε τοµείς όπως οι λειτουργίες (των
υποδοµών), η παρακολούθηση (διαδικασιών), η συντήρηση (συστηµάτων), η ασφάλεια (των
δικτύων), και άλλα. Καθώς η ψηφιακή µετάβαση και η διασυνδεσιµότητα συνεχίζουν να
εξελίσσονται, ο ϱυθµός παραγωγής δεδοµένων σχετικών µε τις ΚΥ αυξάνεται ϱαγδαία. Η
αποτελεσµατική διαχείριση και αξιοποίηση αυτών των δεδοµένων είναι κρίσιµη για την ϐελ-
τιστοποίηση της απόδοσης, την ενίσχυση της ασφάλειας, και την υποστήριξη ενηµερωµένων
αποφάσεων. Λύσεις που καλύπτουν µε επιτυχία τις σύνθετες απαιτήσεις των ΚΥ, αποδει-
κνύουν εν δυνάµει την ικανότητά τους να ξεχωρίσουν και σε λιγότερο απαιτητικά επιχειρη-
σιακά περιβάλλοντα. Αυτό καθιστά τις ΚΥ ιδανικό δοκιµαστικό πεδίο για την πρόοδο των
τεχνολογιών διαχείρισης δεδοµένων, όπως είναι ενα κοµµάτι της παρούσας διατριβής.
Η αντιµετώπιση των προκλήσεων της κλιµακούµενης ανάλυσης δεδοµένων απαιτεί µια
πολυδιάστατη προσπάθεια που ϕέρνει µαζί εξειδικευµένη γνώση στην επιστήµη των υπολο-
γιστών, τη µηχανική δεδοµένων, τη µηχανική µάθηση και τη γνώση του εκάστοτε τοµέα, ενώ
ταυτόχρονα προάγει τη συνεργασία µεταξύ διαφορετικών ϕορέων. Καθώς η τεχνολογία προ-
χωρά, αναµένονται νέες προκλήσεις, κάνοντάς το απαραίτητο για ερευνητές και επαγγελµα-
τίες να παραµένουν ενηµερωµένοι και προσαρµοστικοί στον εξελισσόµενο τοµέα των µεγάλων
δεδοµένων. Το χάσµα µεταξύ της ϐέλτιστης κλιµακούµενης διαχείρισης δεδοµένων και της
αξιοποίησής τους από οργανισµούς (από τους οποίους παράγονται), καθώς και τρίτους /
ωφελούµενους, µπορεί να γεφυρωθεί από την προτεινόµενη µελέτη της διατριβής, µέσω της
αποδοτικής και κλιµακούµενης διαχείρισης δεδοµένων.
Εννοιολογικά, το σύστηµα της διατριβής είναι ένα µεσαίο λογισµικό πλαίσιο (middleware),
το οποίο αποτελεί µια υποδοµή λογισµικού που επιτρέπει τη διαχείριση δεδοµένων ανε-
ξαρτήτως όγκου. Αρχικά προσαρµοσµένο για να ανακουφίσει τις ΚΥ από την χρονοβόρα
εργασία της σωστής και πλήρους διαχείρισης των δεδοµένων που παράγονται στους τοµείς
τους, εξελίχθηκε σε ένα λογισµικό εργαλείο µε εφαρµογές σε διάφορους τοµείς. Η πρώιµη
εννοιολογική του υιοθέτηση παρουσιάστηκε το 2022 [3]. Αυτή η αρχική προσέγγιση έχει
επεκταθεί, αναδοµηθεί, ϐελτιωθεί και υλοποιηθεί, οδηγώντας στο σηµερινό κλιµακούµενο
πλαίσιο διαχείρισης δεδοµένων.
Αναφορικά µε την ϐέλτιστη ανάλυση δεδοµένων, µε κύριους αναλυτές τους τελικούς
χρήστες, η τεχνητή νοηµοσύνη αποτελεί το κύριο πλαίσιο ανάπτυξης. Συγκεκριµένα, η
ταχεία ανάπτυξη των Μεγάλων Γλωσσικών Μοντέλων (LLMs) έχει δηµιουργήσει νέες ευκαι-
ϱίες για έρευνα και πρακτικές εφαρµογές σε ποικίλους τοµείς. Οι προηγµένες δυνατότητες
αυτών των µοντέλων έχουν προκαλέσει σηµαντικό ενδιαφέρον στην παγκόσµια ερευνητική
κοινότητα, οδηγώντας σε αυξηµένες επενδύσεις που στοχεύουν στην προώθηση εφαρµογών
ϐασισµένων σε LLM. Γνωστά LLMs, όπως το GPT-4 της OpenAI [11] και το Gemini της Google
[12], έχουν ϕτάσει σε επίπεδο πολυπλοκότητας όπου µπορούν να κατανοούν, να δηµιουργο-
ύν και να χειρίζονται τη γλώσσα του ανθρώπου σε εξαιρετικό ϐαθµό. Ως αποτέλεσµα, αυτά
τα µοντέλα εφαρµόζονται πλέον σε ενα ευρύ ϕάσµα τοµέων, όπως (για παράδειγµα) η υγειο-
νοµική περίθαλψη και η µηχανική. Αυτές οι εξελίξεις έχουν ϑέσει νέα πρότυπα τόσο στην
έρευνα όσο και στις ϐιοµηχανικές εφαρµογές των γλωσσικών µοντέλων, µε κύρια έµφαση
στη ϐελτίωση της ακρίβειας των αποτελεσµάτων (των µοντέλων) [13] [14].
20
Extended PhD Thesis Summary in Greek
Επιπλέον, τα µεγάλα γλωσσικά µοντέλα έχουν δείξει σηµαντικό δυναµικό στη ϐελτίωση
διαφόρων πτυχών της Επιστήµης ∆εδοµένων. Για παράδειγµα, το GPT-4 µπορεί να εκτελεί
εργασίες όπως καθαρισµό δεδοµένων, εξαγωγή χαρακτηριστικών, και ακόµη και εκπαίδευση
άλλων µοντέλων, µε ελάχιστη ανθρώπινη παρέµβαση [15] [16]. Στην ανάλυση δεδοµένων, τα
LLMs προσφέρουν το πλεονέκτηµα της αυτοµατοποιηµένης δηµιουργίας πληροφοριών από
σύνολα δεδοµένων. Πρακτικά, αυτά τα µοντέλα µπορούν να εκτελούν παραδοσιακές εργασίες
ανάλυσης δεδοµένων, να ανιχνεύουν ανωµαλίες και να παρέχουν ακόµη και προγνωστική
ανάλυση, για µελλοντική χρήση από µηχανικούς δεδοµένων και άλλους επαγγελµατίες. Κα-
τά συνέπεια, τα LLMs µπορούν να εφαρµοστούν στην επεξεργασία δεδοµένων από διάφορες
πηγές, όπως για παράδειγµα είναι τα δεδοµένα ανατροφοδότησης (feedback) πελατών, χρη-
µατοοικονοµικών εκθέσεων και δεδοµένων κοινωνικών µέσων, µε τελικό στόχο την εξαγωγή
αξιοποιήσιµων πληροφοριών. Η ικανότητά τους να κατανοούν και να ερµηνεύουν διάφορους
τύπους συνόλων δεδοµένων τα καθιστά πολύτιµα εργαλεία για τους επιστήµονες δεδοµένων
[17] [18].
Η σηµασία των τοπικών γλωσσικών µοντέλων (local LLMs) για τη διατήρηση της ακεραι-
ότητας και της ιδιωτικότητας των δεδοµένων πρέπει να σηµειωθεί. Οργανισµοί που διαχει-
ϱίζονται ευαίσθητα δεδοµένα ϑα πρέπει να είναι σε ϑέση να εκµεταλλευτούν τις δυνατότητες
των µεγάλων γλωσσικών µοντέλων. Για παράδειγµα, οι χρήστες τέτοιων οργανισµών ϑα µπο-
ϱούσαν να ωφεληθούν σηµαντικά από τη διεπαφή µε γλωσσικά µοντέλα, απλώς κάνοντας
ερωτήµατα ανάλυσης δεδοµένων χρησιµοποιώντας ϕυσική γλώσσα, και λαµβάνοντας οπτι-
κοποιηµένα αποτελέσµατα άµεσα δείγµατα δεδοµένων) σε αντάλλαγµα. Για παράδειγµα,
αντί να επιλέγουν χειροκίνητα συγκεκριµένες στήλες και να εφαρµόζουν προκαθορισµένους
κανόνες µέσω ενός εξειδικευµένου περιβάλλοντος (όπως είναι το Microsoft Excel), οι τελικοί
χρήστες και οι επιστήµονες δεδοµένων ϑα µπορούσαν απλά να γράψουν ερωτήµατα σε ϕυ-
σική γλώσσα και να ϐλέπουν τα αποτελέσµατα στην οθόνη τους, χάρις στις δυνατότητες των
γλωσσικών µοντέλων.
Συνοπτικά, τα offline LLMs µπορούν να ϐοηθήσουν σηµαντικά τους επιστήµονες δεδο-
µένων στην ανάλυση δεδοµένων και την εξαγωγή πληροφοριών, απλοποιώντας τη διαδικασία
υποβολής ερωτηµάτων ανάλυσης και µειώνοντας την προσπάθεια των χρηστών. Αυτό οδηγεί
στο συµπέρασµα ότι οι επιχειρήσεις, οι οργανισµοί και οι εταιρείες ϑα πρέπει να εξετάσουν
την υιοθέτηση τοπικών µοντέλων. Το κύριο µέρος αυτής της διδακτορικής διατριβής εξε-
τάζει πώς τα offline LLMs µπορούν να ενισχύσουν την ανάλυση δεδοµένων, δηµιουργώντας
κώδικα για εργασίες ανάλυσης δεδοµένων ϐάσει ερωτηµάτων σε ϕυσική γλώσσα. ∆ιατη-
ϱώντας τα δεδοµένα τοπικά, εξαλείφονται οι κίνδυνοι που σχετίζονται µε την ακεραιότητα
και την ασφάλεια των δεδοµένων, καθώς εκείνα δεν µεταδίδονται σε online LLM µοντέλα.
Επιπλέον, αυτή η προσέγγιση αντιµετωπίζει τον περιορισµό των LLMs στην επεξεργασία µε-
γάλων συνόλων δεδοµένων, λόγω των αρχιτεκτονικών και υπολογιστικών περιορισµών τους.
Στη προτεινόµενη µέθοδο, το LLM δεν ϕορτώνει ολόκληρα σύνολα δεδοµένων. Αντίθετα,
εργάζεται µε µια συνοπτική περίληψη. Για κάθε ερώτηµα προφίλ / ανάλυσης δεδοµένων,
το µοντέλο δηµιουργεί εξειδικευµένο κώδικα, ο οποίος εκτελείται από µια εργασία λογι-
σµικού στο πλήρες σύνολο δεδοµένων (που τρέχει στο κλιµακούµενο λογισµικό διαχείρισης
δεδοµένων, κοµµάτι της παρούσας διατριβής) για την παραγωγή των τελικών αποτελεσµάτων.
΄Ετσι, τόσο η ακεραιότητα των δεδοµένων, όσο και οι περιορισµοί επεξεργασίας διαχειρίζονται
21
Extended PhD Thesis Summary in Greek
αποτελεσµατικά.
Η τελική προτεινόµενη λύση αποτελείται από ένα τοπικό γλωσσικό µοντέλο που είναι
ενσωµατωµένο στην περιγραφόµενη υποδοµή λογισµικού διαχείρισης δεδοµένων. Τα δεδο-
µένα παραµένουν υπό τον έλεγχο του οργανισµού, εξασφαλίζοντας την ασφάλεια τους. Για
κάθε σύνολο δεδοµένων, το LLM παρέχεται µε συνοπτικά µεταδεδοµένα, επιτρέποντάς του
να κατανοήσει το πλαίσιο των δεδοµένων υπο ανάλυση. Κάθε ϕορά που ο επιστήµονας δε-
δοµένων του οργανισµού υποβάλει ένα ερώτηµα σε ϕυσική γλώσσα, το µοντέλο δηµιουργεί
εξειδικευµένο κώδικα για ανάλυση των δεδοµένων. Αυτός ο κώδικας εφαρµόζεται στα δε-
δοµένα µέσω εξειδικευµένου λογισµικού, µε τα τελικά αποτελέσµατα να επιστρέφονται στον
χρήστη. ΄Οπως αναφέρθηκε προηγουµένως, αυτή η προσέγγιση όχι µόνο εξασφαλίζει την
ασφάλεια των δεδοµένων, αλλά και ενισχύει την αποτελεσµατικότητα της ανάλυσης τους, α-
ντιµετωπίζοντας τους ϐασικούς περιορισµούς των LLMs. Επιπλέον, απλοποιεί τη διαδικασία
για τους τελικούς χρήστες, εφόσον εκείνοι ειναι σε ϑέση να υποβάλουν τις προτιµήσεις τους
για ανάλυση δεδοµένων χρησιµοποιώντας ϕυσική γλώσσα.
Σχετική Επιστηµονική Βιβλιογραφία
Αναφορικά µε την αποδοτική διαχείριση δεδοµένων, µέχρι σήµερα έχουν υπάρξει πολλές
προτάσεις και υλοποιήσεις. Οι κυριότερες που ταιριάζουν µε το πλαίσιο της παρούσας διατρι-
ϐής παρουσιάζονται παρακάτω. Αρχικά, Ο Baek (2014) [19] πρότεινε ένα πλαίσιο ϐασισµένο
στο cloud για τη διαχείριση µεγάλων δεδοµένων σε έξυπνα δίκτυα ηλεκτρικής ενέργειας,
δίνοντας έµφαση στον ϱόλο των ιεραρχικών κέντρων cloud στην επεξεργασία και ανάλυση
δεδοµένων από έξυπνους µετρητές. Αν και τονίστηκαν τα οφέλη του cloud computing, όπως
η κλιµακωσιµότητα, η ενεργειακή αποδοτικότητα και η µείωση κόστους, το έργο δεν περι-
λάµβανε τελική αξιολόγηση αποδοτικότητας. Παροµοίως, ο Luckow (2015) [20] επεκτάθηκε
στο Ϲήτηµα εµφάνισης µεγάλου όγκου δεδοµένων στην αυτοκινητοβιοµηχανία, και τα πιθα-
νά οφέλη υιοθέτησης του Apache Hadoop [21], εξετάζοντας την κλιµακωσιµότητά του, τις
διάφορες µηχανές επεξεργασίας του, και την ενσωµάτωσή του σε υπάρχοντα συστήµατα λο-
γισµικού που χρησιµοποιούνται στην αυτοκινητοβιοµηχανία. Ο Dinov (2016) [22] µελέτησε
τις πολυπλοκότητες της διαχείρισης µεγάλων συνόλων δεδοµένων στον τοµέα της υγειονο-
µικής περίθαλψης, υπογραµµίζοντας τις προκλήσεις στην ενσωµάτωση ετερογενών τύπων
δεδοµένων, και τις απόπειρες προηγµένων τεχνικών ανάλυσης πάνω τους. Τελικά, η οµάδα
πρότεινε την υποστηρίξη υβριδικών λύσεων που περιλαµβάνουν κυρίως το cloud computing.
ΟKaur (2019) [23] πρότεινε ένα ενεργειακά αποδοτικό σύστηµα, µε δυνατότητες διαχε-
ίρισης µεγάλων δεδοµένων για τα Software-Defined Data Centers (SDDCs) σε περιβάλλοντα
IoT. Η έρευνα εστίασε στη ϐελτιστοποίηση της ανάπτυξης εικονικών µηχανών (Virtual Mach-
ines) που ϑα µπορούσαν να ¨τρέχουν¨ το προτεινόµενο σύστηµα διαχείρισης δεδοµένων, µε
στόχο τη µείωση του ενεργειακού κόστους, διατηρώντας παράλληλα την απόδοση του. Αν και
δεν επικεντρώνεται στην κλιµακούµενη ενοποίηση δεδοµένων, το έργο είναι σχετικό µε τον
σχεδιασµό αποδοτικών υποδοµών λογισµικού. Πιο πρόσφατα, ο Donta (2023) [24] διερεύνη-
σε τα Distributed Computing Continuum Systems (DCCS) για την επεξεργασία δεδοµένων
IoT. Το προτεινόµενο πλαίσιο διακυβέρνησης δεδοµένων παρουσιάστηκε αναλυτικά, χωρίς
όµως να αντιµετωπίζει πλήρως τις προκλήσεις της διαλειτουργικότητας και της ενοποίησης
22
Extended PhD Thesis Summary in Greek
δεδοµένων.
΄Οσον αφορά τη ϐέλτιστη ανάλυση δεδοµένων µε χρήση τεχνητής νοηµοσύνης, και συ-
γκεκριµένα γλωσσικών µοντέλων για παραγωγική εκτελέσιµου κώδικα ανάλυσης δεδοµένων,
πρόσφατες µελέτες έχουν διερευνήσει τις δυνατότητες παραγωγής κώδικα από τα Large Lan-
guage Models (LLMs) σε διάφορα πλαίσια. Ο Feng (2023) [25] πρότεινε ένα κλιµακούµενο
σύστηµα που αξιοποιεί δεδοµένα από µέσα κοινωνικής δικτύωσης για την αξιολόγηση της
απόδοσης του ChatGPT, σε εργασίες όπως η επίλυση ακαδηµαϊκών ασκήσεων και η α-
ποσφαλµάτωση κώδικα, χωρίς ωστόσο να εστιάζει στην ανάλυση δεδοµένων ή στη χρήση
τοπικώνLLMs. Ο Gu (2023) [26] παρουσίασε µια προσέγγιση δοκιµών µεταγλωττιστών χρη-
σιµοποιώντας µοντέλα τύπου encoder-decoder, και µια στρατηγική καθαρισµού κώδικα για
τη ϐελτίωση της ποιότητας των συνόλων δεδοµένων, σε ευθυγράµµιση µε τον προσανατολισµό
της παρούσας διατριβής στην παραγωγή κώδικα µέσω LLMs. Ο Ross (2023) [27] παρουσίασε
ένα πρωτότυπο σύστηµα που εξετάζει πώς οι επαγγελµατίες αλληλεπιδρούν - µέσω συνοµι-
λίας - µε LLMs που κατανοούν κώδικα, προτείνοντας τη χρησιµότητά τους ως εργαλεία
αύξησης παραγωγικότητας για µηχανικούς λογισµικού. Αντίστοιχα, ο Soliman (2024) [28]
µελέτησε υβριδικά LLMs για αυξηµένη ακρίβεια στην παραγωγή κώδικα, αξιολογώντας τα
µοντέλα τους σε τυπικά σύνολα δεδοµένων [29] [30].
ΟPinna (2024) [31] µελετά την παραγωγή κώδικα µέσω LLMs από περιγραφές προ-
ϐληµάτων, αντιµετωπίζοντας την πρόκληση των λανθασµένων αποτελεσµάτων των µοντέλων,
και παρουσιάζοντας ϐελτιωµένη ποιότητα κώδικα µέσω υβριδικών προσεγγίσεων. Η παρούσα
διατριβή συνάδει µε αυτή την ιδέα, ενσωµατώνοντας LLMs µε επιπλέον µεθοδολογίες λογισµι-
κού. Συµπληρωµατικές προσπάθειες περιλαµβάνουν την δηµοσίευση του Yu (2024) [32] για
την αξιολόγηση µοντέλων παραγωγής κώδικα, καθώς και την µελέτη του Omari (2024) [33]
για τη χρήση του ChatGPT στην ανίχνευση και διόρθωση σφαλµάτων σε απλά προγράµµατα
Python. Και οι δυο µελέτες τονίζοουν τη χρησιµότητα των LLMs στην ανάπτυξη λογισµικού.
Αν και τα έργα αυτά δεν επικεντρώνονται σε τοπικά γλωσσικά µοντέλα ή συγκεκριµένα στην
ανάλυση δεδοµένων (τα οποία αποτελούν καίρια στοιχεία της διατριβής), συνολικά εµπλου-
τίζουν την παρούσα διατριβή, η οποία επεκτείνει τα ευρήµατά τους εφαρµόζοντας παραγωγή
κώδικα µε τοπικά LLMs σε κλιµακούµενα σενάρια ανάλυσης δεδοµένων.
ΟFan (2023) [34] διερεύνησε τη χρήση των µεγάλων γλωσσικών µοντέλων στη µηχανική
λογισµικού, εντοπίζοντας προκλήσεις και επισηµαίνοντας την ανάγκη για υβριδικές προ-
σεγγίσεις, οι οποίες ϑα συνδυάζουν τα LLMs µε παραδοσιακές µεθοδολογίες για εργασίες
όπως η συγγραφή κώδικα, ο σχεδιασµός και η διόρθωση σφαλµάτων. Ο Wong (2023) [35]
ανέλυσε τις τεχνικές NLP, στις οποίες ανήκουν τα LLMs που εκπαιδεύονται σε µεγάλα σύνο-
λα δεδοµένων κώδικα, και τους ϱόλους τους στον προγραµµατισµό υποβοηθούµενο από AI,
καλύπτοντας εφαρµογές όπως η παραγωγή κώδικα και η ανίχνευση σφαλµάτων. Επίσης,
γίνεται αναφορά σε εργαλεία τεχνητής νοηµοσύνης, όπως τα GitHub Copilot [36] και Deep-
Mind AlphaCode [37]. Ο Wang (2023) [38] και ο Liu (2024) [39] εξέτασαν περαιτέρω την
παραγωγή κώδικα µέσω LLMs, τονίζοντας την ανάγκη για ϐελτιωµένες τεχνικές αξιολόγησης
των αποτελεσµάτων που παράγονται από τα µοντέλα.
Το prompt engineering αποτελεί ένα κρίσιµο στοιχείο του συστήµατος που προτείνεται
στην παρούσα διατριβή. Περιλαµβάνονει το σχεδιασµό και τη ϐελτίωση των στοιχείων εισόδου
στα µοντέλα AI για να διασφαλιστούν ακριβή, σχετικά και ποιοτικά αποτελέσµατα εξόδου [40]
23
Extended PhD Thesis Summary in Greek
[41]. Με την αποτελεσµατική διαµόρφωση των prompts, οι προγραµµατιστές µπορούν να
καθοδηγήσουν τα γλωσσικά µοντέλα έτσι ώστε εκείνα να κατανοούν καλύτερα τη γλώσσα,
το πλαίσιο και την πρόθεση τους, ϐελτιώνοντας την ποιότητα των απαντήσεων σε διάφορες
εφαρµογές, όπως η ανάπτυξη λογισµικού [42] [43]. Αυτή η µέθοδος µπορεί να µειώσει
την ανάγκη για εκτενή µετα-επεξεργασία των απαντήσεων ων µοντέλων, οδηγώντας σε πιο
αποδοτικά workflows. ΄Εχουν αναπτυχθεί διάφορες τεχνικές prompting, όπως το zero-shot,
το few-shot και το chain-of-thought prompting, µε στόχο να οδηγήσουν σε - όσο το δυνατόν
- καλύτερες αλληλεπιδράσεις µε ένα µοντέλο, έχοντας πάντα υπόψιν τις ανάγκες της κάθε
περίπτωσης. Η συνεχιζόµενη έρευνα γύρω από το prompting ϑα προτείνει νέες στρατηγικές
για τη ϐελτίωση της αποτελεσµατικότητας αυτού [44] [45] [46]. Καθώς τα µεγάλα γλωσσικά
µοντέλα εξελίσσονται, το prompt engineering ϑα διαδραµατίζει όλο και πιο σηµαντικό ϱόλο
στην ϐελτιστοποίηση των δυνατοτήτων τους.
Περιγραφή Συστήµατος ∆ιατριβής και Στοιχεία Υλοποίησης
΄Οπως σηµειώθηκε παραπάνω, το προτεινόµενο σύστηµα της παρούσας διατριβής απο-
τελείται από δυο σκέλη : Το πρώτο είναι µια ϐάση λογισµικού ϐέλτιστης διαχείρισης δε-
δοµένων. Το δεύτερο (και κυριότερο) είναι το σύστηµα έξυπνης ανάλυσης δεοδοµένων, µε
χρήση γλωσσικών µοντέλων. Για να λειτουργήσει το δεύτερο, στηρίζεται στο πρώτο.
Ξεκινώντας µε το λογισµικό διαχείρισης δεδοµένων, εκείνο έχει σχεδιαστεί για να υπο-
στηρίξει τους στόχους διαλειτουργικότητας δεδοµένων της τελικής πλατφόρµας. Η κύρια
ευθύνη του είναι να ικανοποιήσει τις απαιτήσεις ποιότητας δεδοµένων που είναι συγκεκρι-
µένες για κάθε πηγή, είτε πρόκειται για µια ΚΥ, είτε για µια επιχείρηση. Το λογισµικό
συµβάλει στην σωστή προετοιµασία των δεδοµένων από διάφορες πηγές, διατηρεί τα σχετικά
metadata για όλες τις εισερχόµενες ϱοές δεδοµένων, και καθιστά τα καθαρισµένα και επε-
ξεργασµένα σύνολα δεδοµένων διαθέσιµα. Η πλήρης αρχιτεκτονική και η ϱοή δεδοµένων
του λογισµικού ϐέλτιστης διαχείρισης δεδοµένων απεικονίζονται στο Σχήµα 3.2.
Ως τεχνολογία πάνω στην οποία ϐασίζεται το λογισµικό διαχείρισης δεδοµένων, επιλέχθη-
κε το Apache Spark [47], το οποίο είναι σύµφωνο µε τις ανάγκες της πλατφόρµας. Το Spark
υποστηρίζει µια ποικιλία γλωσσών προγραµµατισµού, όπως Java,Scala,Python και R, και
υποστηρίζεται από εκτενές documentation και µια µεγάλη κοινότητα χρηστών. Αυτά τα
χαρακτηριστικά συµβάλλουν στη δηµοτικότητά του και στην ευκολία υιοθέτησής του. Ε-
πιπλέον, το Spark είναι ικανό να παρέχει προηγµένες αναλυτικές πληροφορίες (analytics),
κάνοντάς το κατάλληλο για περίπλοκες εργασίες επεξεργασίας δεδοµένων. Ενώ άλλα πλαίσια
προσφέρουν κάποια πλεονεκτήµατα και περιορισµούς, το Spark ξεχωρίζει για την ευκολία
χρήσης, την επεκτασιµότητα και τους αποδοτικούς χρόνους εκτέλεσης του [48]. Το λογι-
σµικό διαχείρισης δεδοµένων είναι δοµηµένο γύρω από τρία κύρια υποσυστήµατα, τα οποία
συνεργάζονται για να εκπληρώσουν τους στόχους του : Το Pre-Processing and Filtering Tool,
το Virtual Data Repository και το Virtual Data Container.
Το Pre-Processing and Filtering Tool είναι το πρώτο υποσύστηµα του λογισµικού ϐέλ-
τιστης διαχείρισης δεδοµένων, υπεύθυνο για την προετοιµασία των συνόλων δεδοµένων που
λαµβάνονται από εξωτερικές πηγές. Σχεδιασµένο µε µια γενική και προσαρµόσιµη δοµή,
το εργαλείο αυτό µπορεί να χειριστεί ένα ευρύ ϕάσµα τύπων δεδοµένων αποτελεσµατικά.
24
Extended PhD Thesis Summary in Greek
Μετά τη λήψη ενός πλήρους συνόλου δεδοµένων, κατασκευάζει ένα dataframe, το οποίο
χρησιµεύει ως πίνακας αναπαράστασης όλων των συλλεγµένων δεδοµένων. Για να διατηρη-
ϑεί η συνέπεια των δεδοµένων, το εργαλείο εξετάζει τα metadata του εκάστοτε συνόλου, ώστε
να προσδιορίσει τους κατάλληλους τύπους δεδοµένων για κάθε στήλη. ΄Οταν εντοπίζονται
ασυµβατότητες, εκτελεί διορθώσεις συγκεκριµένες για κάθε τύπο δεδοµένων, ένα σηµαντικό
ϐήµα για να διασφαλιστεί η αξιοπιστία των επόµενων εργασιών επεξεργασίας, που εξαρτώνται
από την δοµική ακεραιότητα του συνόλου. Αφού ευθυγραµµιστούν οι τύποι δεδοµένων, το
εργαλείο ξεκινά τη διαδικασία καθαρισµού και ϕιλτραρίσµατος, εφαρµόζοντας ένα σύνολο
τυπικών τεχνικών προ-επεξεργασίας για να προετοιµάσει το σύνολο δεδοµένων για περαιτέρω
χρήση :
Αφαίρεση των κενών διαστηµάτων από όλα τα κελιά που περιέχουν τιµές τύπου string,
εξασφαλίζοντας συνεπή µορφοποίηση.
Μετατροπή κενών κελιών και περιπτώσεων µε την συµβολοσειρά ’NULL’ σε τιµές ΝαΝ
(Not a Number) σε όλες τις στήλες.
∆ιαγραφή εγγραφών (γραµµών) που είτε δεν περιέχουν τιµές τύπου datetime, είτε
περιέχουν µη έγκυρες εγγραφές datetime.
Μετατροπή έγκυρων τιµών datetime στη µορφή UTC, για να εξασφαλιστεί η συνέπεια
και να υποστηριχθεί η τυποποίηση.
Το δεύτερο υποσύστηµα της επεκτάσιµης στρώσης διαχείρισης δεδοµένων είναι το Virtual
Data Repository (VDR) , το οποίο λειτουργεί ως προσωρινή µονάδα αποθήκευσης για όλα
τα σύνολα δεδοµένων που έχουν υποστεί προεπεξεργασία, καθαρισµό και ϕιλτράρισµα από
το Pre-Processing and Filtering Tool. Μόλις ένα σύνολο δεδοµένων ολοκληρώσει αυτές τις
διαδικασίες, αποθηκεύεται στο VDR µαζί µε το αντίστοιχο συσχετιστικό του πίνακα, ο οποίος
καταγράφει τις διασυνδέσεις µεταξύ των στηλών.
Για να καλυφθούν οι απαιτήσεις απόδοσης και επεκτασιµότητας του λογισµικού διαχε-
ίρισης δεδοµένων, το VDR έχει υλοποιηθεί χρησιµοποιώντας την MongoDB [49], µια ευρέως
χρησιµοποιούµενη ϐάση δεδοµένων. Η MongoDB επιλέχθηκε για την συµβατότητα της σε
περιπτώσεις αυτόµατης κλιµάκωσης, τη λειτουργία σηαρδινγ, και την ευέλικτη διαµόρφωση
της. Επιπλέον, η συµβατότητα της MongoDB µε το Kubernetes [50], µια πλατφόρµα ορ-
χήστρωσης κοντέινερ, ενισχύει την ικανότητά της να λειτουργεί σε δυναµικά, κατανεµηµένα
περιβάλλοντα. Ως αποτέλεσµα, το VDR αναπτύχθηκε σε ένα σύµπλεγµα (cluster) Kuber-
netes, καθώς και σε ένα σύµπλεγµα Docker, εκµεταλλευόµενο τις προηγµένες δυνατότητες
εξισορρόπησης ϕορτίου, αναπαραγωγής, και κλιµάκωσης της πλατφόρµας, διασφαλίζοντας
έτσι συνεπή και αποδοτική διαχείριση δεδοµένων.
Το τρίτο και τελευταίο υποσύστηµα είναι το Virtual Data Container (VDC), το οποίο πα-
ίζει καθοριστικό ϱόλο στην πρόσβαση των χρηστών στα αποθηκευµένα δεδοµένα του Virtual
Data Repository. Το VDC είναι ένα ευέλικτο και γενικής χρήσης υποσύστηµα, υπεύθυνο
για περαιτέρω επεξεργασία και ϕιλτράρισµα των δεδοµένων σύµφωνα µε τις συγκεκριµένες
ερωτήσεις που ορίζονται από τους καταναλωτές των δεδοµένων. Οι κανόνες ϕιλτραρίσµατος
25
Extended PhD Thesis Summary in Greek
εξυπηρετούν δύο κύριες λειτουργίες. Πρώτον, επιτρέπουνσε έναν χρήστη την εξαγωγή συ-
γκεκριµένων πληροφοριών, δηµιουργώντας έτσι µια προσαρµοσµένη ¨δεξαµενή δεδοµένων¨
που ευθυγραµµίζεται µε τις συγκεκριµένες απαιτήσεις του χρήστη. ∆εύτερον, οι ερωτήσεις
ανάλυσης του χρήστη ϐοηθούν στο πρόσθετο ϕιλτράρισµα εσφαλµένων δεδοµένων, που ε-
ίναι γνωστά στους χρήστες από την εµπειρία τους - όπως ακραία εξωγενή δεδοµένα (π.χ.,
ϑερµοκρασίες εξωτερικού χώρου -100 ϐαθµοί Κελσίου) - και τα οποία ϑα µπορούσαν να
υποδηλώνουν δυσλειτουργία αισθητήρων.
Προχωρώντας στο δεύτερο και κύριο σκέλος του προτεινόµενου συστήµατος, όπως ήδη
αναφέρθηκε, ο τελικός στόχος είναι ο σχεδιασµός ενός επεκτάσιµου εργαλείου λογισµικο-
ύ για ανάλυση δεδοµένων που να µπορεί να χειρίζεται αποτελεσµατικά σύνολα δεδοµένων
διαφόρων µεγεθών. Η τελική υλοποίηση του συστήµατος σε αυτή τη διδακτορική εργα-
σία στοχεύει να επιτρέπει στους χρήστες να υποβάλλουν ερωτήµατα ανάλυσης δεδοµένων,
χρησιµοποιώντας ϕυσική γλώσσα. Αυτά τα ερωτήµατα ερµηνεύονται και µετατρέπονται σε
εκτελέσιµο κώδικα από ένα τοπικό µεγάλο γλωσσικό µοντέλο (offline LLM). Ο παραγόµενος
κώδικας αποστέλλεται στο προαναφερθέν λογισµικό διαχείρισης δεδοµένων για εκτέλεση.
Αυτή η διαδικασία επιτρέπει στους χρήστες να καταθέτουν δικούς τους κανόνες ανάλυσης,
απλώς περιγράφοντας τους µε ϕυσική γλώσσα, ϐελτιώνοντας την προσβασιµότητα και την
ευχρηστία του συστήµατος.
Συνοπτικά, το τελικό µέρος της µελέτης αρχίζει µε την καθιέρωση του κύριου στόχου
του : Την αξιολόγηση της ικανότητας ενός τοπικού LLM να δηµιουργεί έγκυρο κώδικα -
µε ϐάση ερωτήµατα ανάλυσης διατυπωµένα σε ϕυσική γλώσσα - για ϐέλτιστη ανάλυση δε-
δοµένων. Το επόµενο ϐήµα της µελέτης περιλαµβάνει την επιλογή κατάλληλων συνόλων
δεδοµένων για δοκιµή και αξιολόγηση. Ακολουθεί ο προσδιορισµός πέντε µοναδικών ερωτη-
µάτων ανά σύνολο δεδοµένων. Στη συνέχεια, αναπτύσσεται ένα σχέδιο υλοποίσης, το οποίο
καθορίζει τον αριθµό των δοκιµών και ορίζει τη διαδικασία επικοινωνίας µεταξύ του LLM
και του λογισµικού διαχείρισης δεδοµένων. Καθορίζονται επίσης οι δείκτες αξιολόγησης, µε
αιτιολόγηση για τη επιλογή τους. Ακολουθεί η συλλογή και ανάλυση των αποτελεσµάτων για
την αξιολόγηση της απόδοσης του LLM. Τέλος, διατυπώνονται τα συµπεράσµατα µε ϐάση τα
ευρήµατα, και παρουσιάζονται προτάσεις για µελλοντικές ϐελτιώσεις.
Αναλυτικά, η µεθοδολογία που ακολουθήθηκε σε αυτή τη µελέτη παρουσιάζεται ως εξής :
1. Ορισµός Στόχου : Το τελικό και κύριο µέρος αυτής της µελέτης στοχεύει στην αξιολόγη-
ση της αποτελεσµατικότητας και αποδοτικότητας ενός τοπικού µεγάλου γλωσσικού µο-
ντέλου (offline LLM) για τη δηµιουργία κώδικα για εργασίες ανάλυσης δεδοµένων. Τα
LLMs που χρησιµοποιούνται σε αυτήν την αξιολόγηση είναι το Codestral της Mistral
AI και το Qwen 2.5 Coder της Alibaba. Αυτά τα µοντέλα ελέγχονται για την ικανότητά
τους να παράγουν κώδικα Python χρησιµοποιώντας το πλαίσιο PySpark, εστιάζοντας
σε εργασίες ανάλυσης δεδοµένων. Οι λόγοι για την επιλογή των Codestral,Qwen και
PySpark συζητούνται στην υποενότητα 3.4.1.
2. Επιλογή ∆εδοµένων : Επιλέχθηκαν πέντε δηµοσίως διαθέσιµα σύνολα δεδοµένων για
να δοκιµάσουν και να επικυρώσουν την προτεινόµενη προσέγγιση. Αυτά τα σύνολα
δεδοµένων καλύπτουν µια σειρά από ϑεµατικούς τοµείς, όπως η ανάλυση κοινωνι-
κών µέσων, τα δεδοµένα καιρού και οι καταγραφές πωλήσεων σουπερµάρκετ. ΄Ολα
26
Extended PhD Thesis Summary in Greek
τα επιλεγµένα σύνολα δεδοµένων παρουσιάζονται στην υποενότητα 3.3.2. Η χρήση
πολλαπλών συνόλων δεδοµένων αντί για ένα µόνο ϐοηθά στην αξιολόγηση της ικα-
νότητας των LLMs να γενικεύουν τις επιδόσεις τους σε διάφορους τύπους δεδοµένων,
εξασφαλίζοντας ότι η απόδοσή τους δεν περιορίζεται σε έναν τοµέα.
3. Σχεδίαση Ερωτηµάτων : Για κάθε ένα από τα πέντε σύνολα δεδοµένων, έχει δηµιουρ-
γηθεί ένα σύνολο πέντε ερωτηµάτων, µε αποτέλεσµα 25 συνολικά ερωτήµατα που χρη-
σιµοποιούνται για τις δοκιµές. Αυτά τα ερωτήµατα είναι οµαδοποιηµένα σε τρεις κατη-
γορίες ϐάσει της πολυπλοκότητάς τους : «Βασικά», «Μεσαία» και «Προχωρηµένα». Αυτή
η κατηγοριοποίηση αντανακλά το επίπεδο δυσκολίας των απαιτούµενων εργασιών δη-
µιουργίας κώδικα από το LLM. Η ορισµός τους είναι ο εξής :
Βασικά : Αυτά τα ερωτήµατα επικεντρώνονται σε απλές εργασίες, όπως ϕιλτράρι-
σµα δεδοµένων, καταµέτρηση εγγραφών ή εξαγωγή διακριτών τιµών. Απαιτούν
ϑεµελιώδεις λειτουργίες που ϐοηθούν στην αποκάλυψη της ϐασικής δοµής και
περιεχοµένων του συνόλου δεδοµένων.
Μεσαία : Αυτά τα ερωτήµατα αφορούν µετρίως σύνθετες εργασίες, όπως οµαδο-
ποίηση δεδοµένων, υπολογισµός αθροισµάτων ή εκτέλεση ϐασικών αριθµητικών
συναρτήσεων. Αν και πιο σύνθετα από τα ϐασικά ερωτήµατα, εξακολουθούν να
στηρίζονται σε τυπικές µεθόδους επεξεργασίας δεδοµένων.
Προχωρηµένα : Αυτά τα ερωτήµατα αφορούν πιο εξελιγµένες λειτουργίες, όπως
πολυεπίπεδη οµαδοποίηση, µετασχηµατισµούς στηλών (π.χ., συγχώνευση ή ε-
ξάπλωση) και εκτέλεση στατιστικών αναλύσεων. Σχεδιάστηκαν για να αξιολο-
γήσουν την ικανότητα του LLM να παράγει κώδικα για προχωρηµένο χειρισµό
δεδοµένων, και τελικά να εξαγάγει σηµαντικά συµπεράσµατα από τα δεδοµένα.
Το πλήρες σύνολο των 25 συγγραφέντων ερωτηµάτων παρουσιάζεται στην υποενότητα
3.4.3.
4. Σχέδιο Εκτέλεσης : Κάθε ένα από τα 25 ερωτήµατα ϑα υποβληθεί στο τοπικό LLM
µοντέλο δέκα ϕορές. Κατά τη διάρκεια κάθε επανάληψης, ϑα εκτελείται ολόκληρη η
διαδικασία, µε τα αποτελέσµατα να καταγράφονται. Συνολικά, ϑα πραγµατοποιηθούν
δέκα δοκιµές για κάθε ερώτηµα. Αυτή η επανάληψη αποσκοπεί στην αξιολόγηση
της συνέπειας του µοντέλου, παρατηρώντας τις παραλλαγές στον παραγόµενο κώδικα,
καθώς και τη σταθερότητα και αναπαραγωγιµότητα των εξόδων. Αυτή η προσέγγιση
είναι σύµφωνη µε καθιερωµένες πειραµατικές πρακτικές, όπου οι επαναλαµβανόµενες
δοκιµές ενισχύουν την αξιοπιστία των ευρηµάτων [51]. Συνολικά, ϑα εκτελούνται 250
δοκιµές δέκα για κάθε ένα από τα 25 ερωτήµατα ϐάσει των συνόλων δεδοµένων
που επιλέχθηκαν για αυτή τη µελέτη.
5. ∆είκτες Αξιολόγησης : Το αποτέλεσµα κάθε δοκιµής ϑα αξιολογηθεί χρησιµοποιώντας
ένα σύνολο καθορισµένων δεικτών αξιολόγησης. Μόλις ολοκληρωθούν όλες οι 250
δοκιµές, τα αποτελέσµατα ϑα συγκεντρωθούν σε ένα ενιαίο σύνολο δεδοµένων αξιο-
λόγησης. ∆εδοµένου ότι η µελέτη πραγµατοποιείται ανεξάρτητα για δύο τοπικά LLMs,
27
Extended PhD Thesis Summary in Greek
ο συνολικός αριθµός των δοκιµών ϑα ϕτάσει τις 500. Η απόδοση κάθε LLM ϑα αναλυ-
ϑεί ϐάσει των εξής κριτηρίων αξιολόγησης :
Λειτουργική Ορθότητα : Αυτός ο δείκτης αξιολογεί αν ο παραγόµενος κώδικας
οδηγεί στο αναµενόµενο αποτέλεσµα. Με άλλα λόγια, εξετάζει αν ο κώδικας
παραδίδει µε επιτυχία το αποτέλεσµα που είχε κατά νου ο χρήστης, ϐάσει του
αρχικού ερωτήµατος στη ϕυσική γλώσσα.
Αναγνωσιµότητα : Αυτός ο δείκτης αναφέρεται στο πόσο εύκολα µπορεί να δια-
ϐάσει και να κατανοήσει ένας άνθρωπος τον παραγόµενο κώδικα. Η ϐαθµολογία
υπολογίζεται χρησιµοποιώντας µια προσαρµοσµένη συνάρτηση που έχει υλοποι-
ηθεί µέσα στην πλατφόρµα διαχείρισης δεδοµένων, η οποία παρουσιάζεται στο
Παράρτηµα Α.1. Η συνάρτηση εξετάζει τρία ϐασικά σηµεία : Το µήκος της γραµ-
µής κώδικα, το ϐάθος των αλυσίδων συναρτήσεων, και το επίπεδο εµφωλευµένων
κοµµατιών κώδικα. Εφαρµόζονται ποινές σε κώδικα µε γραµµές που ξεπερνούν
τους 80 χαρακτήρες, αλυσίδες µεθόδων µε περισσότερες από τρεις κλήσεις, και
εµφωλευµένα επίπεδα πέρα από το δεύτερο. Η συνάρτηση επιστρέφει ϐαθµολογία
από το ΄1΄ έως το ΄3΄, όπου η ϐαθµολογία ΄3΄ υποδεικνύει υψηλή αναγνωσιµότητα.
Αποδοτικότητα : Αυτό αναφέρεται στο πόσο αποδοτικά χρησιµοποιεί η διαδικα-
σία δηµιουργίας κώδικα τους υπολογιστικούς πόρους. Κύριοι δείκτες απόδοσης
περιλαµβάνουν τον χρόνο απόκρισης, τη χρήση GPU και CPU, και τη χρήση
µνήµης για τις δύο µονάδες επεξεργασίας. Αυτοί παρακολουθούνται χρησιµο-
ποιώντας ένα εργαλείο παρακολούθησης πόρων, αναπτυγµένο σε Python. Αυτός
ο δείκτης ϐοηθά στην αξιολόγηση του υπολογιστικού κόστους που σχετίζεται µε
κάθε δοκιµή.
Απόδοση ανά ∆υσκολία Ερωτήµατος : Αυτός ο δείκτης αξιολογεί το πόσο καλά
το LLM χειρίζεται ερωτήµατα διαφορετικών επιπέδων πολυπλοκότητας Βασικά,
Μεσαία και Προχωρηµένα όπως ορίστηκαν προηγουµένως. Τα αποτελέσµατα
παρέχουν πληροφορίες για την ικανότητα του µοντέλου να προσαρµόζεται σε
διάφορες απαιτήσεις ερωτήσεων ανάλυσης, και επιτρέπουν συγκρίσεις απόδοσης
ανάµεσα σε αυτές τις κατηγορίες.
Αυτοµατοποίηση : Αυτός ο δείκτης αξιολογεί αν ο παραγόµενος κώδικας µπορεί
να εκτελεστεί αυτόµατα ή µε ελάχιστες χειροκίνητες προσαρµογές. Σε ορισµένες
περιπτώσεις, το µοντέλο µπορεί να περιλαµβάνει εξηγήσεις σε ϕυσική γλώσσα
εκτός από τα σχόλια σε Python, παρά τις ϱητές οδηγίες για την αποφυγή τέτοιων
εξόδων. Επίσης, το LLM µπορεί να χρησιµοποιήσει διαφορετικά ονόµατα µετα-
ϐλητών από αυτά που καθορίζονται στο prompt. ΄Οταν συµβαίνουν τέτοιες απο-
κλίσεις, απαιτείται χειροκίνητη παρέµβαση για να τροποποιηθεί ο παραγόµενος
κώδικας, πριν τελικά συνεχιστεί η διαδικασία.
∆ιαχείριση Σφαλµάτων : Αυτός ο δείκτης ελέγχει αν ο παραγόµενος κώδικας ε-
κτελείται χωρίς σφάλµατα. Εντοπίζει τυχόν προβλήµατα που προκύπτουν κατά
την εκτέλεση, από µικρές προειδοποιήσεις έως κρίσιµα σφάλµατα που µπορεί να
διακόψουν ή να τερµατίσουν τη διαδικασία.
28
Extended PhD Thesis Summary in Greek
6. Συλλογή και Ανάλυση ∆εδοµένων : Το τελικό ϐήµα της µεθοδολογίας περιλαµβάνει τη
συλλογή των αποτελεσµάτων από όλες τις 250 δοκιµές (ανά µοντέλο), τη συγχώνευσή
τους σε ένα ενιαίο σύνολο δεδοµένων και τη διεξαγωγή διερευνητικής ανάλυσης. Ε-
ποµένως, κάθε LLM ϑα διαθέτει ένα σύνολο δεδοµένων που περιέχει 250 γραµµές. Η
ανάλυση αυτή ϑα προσφέρει πολύτιµες πληροφορίες για την απόδοση των LLMs, ϐάσει
των διάφορων δεικτών αξιολόγησης. Τελικά, ϑα αξιολογηθεί ο ϐασικός στόχος αυτής
της ενότητας της διατριβής : Αν τα τοπικά γλωσσικά µοντέλα µπορούν να υποστηρίξουν
αποτελεσµατικά την επιστήµη δεδοµένων, µέσω της παραγωγής κώδικα για εργασίες
ανάλυσης αυτών.
Το Σχήµα 3.5 παρουσιάζει µια επισκόπηση υψηλού επιπέδου της τελικής αρχιτεκτονικής
που αναπτύχθηκε στη µελέτη αυτής της διατριβής. Η διαδικασία ξεκινά όταν ο τελικός
χρήστης υποβάλλει ένα ερώτηµα σε ϕυσική γλώσσα. ΄Ενα σχετικό σύνολο δεδοµένων έχει
προεπιλεχθεί, επιτρέποντας τον προσδιορισµό τόσο της περίληψης του συνόλου δεδοµένων,
όσο και του κύριου αρχείου (του συνόλου) δεδοµένων που απαιτείται, για την εκτέλεση του
κώδικα εντός της πλατφόρµας διαχείρισης δεδοµένων. Το ερώτηµα του χρήστη συνδυάζεται
µε την περίληψη του συνόλου δεδοµένων σε ένα ενιαίο µήνυµα, το οποίο αποστέλλεται στο
LLM κατά τη ϕάση της σχεδίασης του (prompt engineering) µηνύµατος, όπως περιγράφεται
στην υποενότητα 3.4.2. Το prompt αυτό δεν ενώνει µόνο το ερώτηµα µε την περίληψη, αλλά
παρέχει και κρίσιµες οδηγίες στα µοντέλα Codestral και Qwen.
Αφού το LLM λάβει το prompt µήνυµα, ξεκινά η διαδικασία παραγωγής απάντησης απο
εκείνο. Κατά τη διάρκεια αυτής της ϕάσης, ένα απλό Python script παρακολουθεί τους υπο-
λογιστικούς πόρους του server του LLM. Μετά τη δηµιουργία της απάντησης από το µοντέλο,
αυτή αποστέλλεται πίσω µέσω ενός λογισµικού - διαύλου επικοινωνίας, µαζί µε τα δεδοµένα
παρακολούθησης πόρων. Ο χρήστης στη συνέχεια εξετάζει την απάντηση, για να διαπιστώσει
αν ο κώδικας µπορεί να µεταβεί απευθείας στην επόµενη ϕάση (υποδεικνύοντας πλήρη αυ-
τοµατοποίηση), ή αν απαιτεί µικρές προσαρµογές (όπως η αφαίρεση περιττού κειµένου που
δεν έχει επισηµανθεί από το LLM), κατατάσσοντας έτσι τη διαδικασία ως ηµι-αυτόµατη και α-
πορρίπτοντας την πλήρη αυτοµατοποίηση. Μετά από αυτό το ϐήµα, ο παραγόµενος κώδικας
και τα αποτελέσµατα παρακολούθησης πόρων ενοποιούνται σε ένα ενιαίο αντικείµενο τύπου
JSON (ϐλ. Παράδειγµα 3.1). Η διεπαφή επικοινωνίας µε το LLM ολοκληρώνεται µε τη µε-
τάδοση του νεοσχηµατισµένου αντικειµένου JSON στην πλατφόρµα διαχείρισης δεδοµένων,
η οποία ενεργοποιεί την εκτέλεση µιας νέας εργασίας PySpark. Στη συνέχεια, το προκαθορι-
σµένο σύνολο δεδοµένων ανακτάται και επεξεργάζεται αναλόγως, ενώ το ληφθέν αντικείµενο
JSON αναλύεται για την εξαγωγή του παραγόµενου κώδικα, καθώς και των αποτελεσµάτων
παρακολούθησης πόρων.
Το επόµενο ϐήµα περιλαµβάνει την εφαρµογή του παραχθέντος κώδικα ανάλυσης δεδο-
µένων, ο οποίος εκτελεί τις καθορισµένες εντολές επί του ϕορτωµένου συνόλου δεδοµένων,
µε σκοπό την εξαγωγή των επιθυµητών αποτελεσµάτων ανάλυσης. Μόλις ολοκληρωθεί αυτό
το στάδιο, τα αποτελέσµατα της ανάλυσης εξάγονται. ∆ηµιουργείται ένα τελικό αντικείµενο
τύπου JSON, το οποίο λειτουργεί ως πρότυπο αποθήκευσης αποτελεσµάτων κάθε δοκιµής,
και περιλαµβάνει τις µετρικές παρακολούθησης πόρων του server του LLM, τον παραγόµενο
κώδικα, και άλλα σχετικά χαρακτηριστικά της διαδικασίας. Το εν λόγω αντικείµενο, µαζί µε
29
Extended PhD Thesis Summary in Greek
τα αποτελέσµατα της ανάλυσης από το στάδιο εκτέλεσης του κώδικα, αποθηκεύεται τοπικά.
Και τα δύο αρχεία εξετάζονται από τον χρήστη, ο οποίος αξιολογεί τα αποτελέσµατα της α-
νάλυσης για να διαπιστώσει εάν είναι ακριβή και συνάδουν µε το επιδιωκόµενο αποτέλεσµα,
ϐάσει του περιεχοµένου του αρχικού του ερωτήµατος. Σε κάθε περίπτωση, προστίθεται µια
διευκρίνιση αναφορικά µε τη λειτουργική ορθότητα στο JSON αντικείµενο αποτελέσµατος
της δοκιµής, µε την τιµή ’True’ για σωστά αποτελέσµατα και ’False’ για λανθασµένα. Το
συγκεκριµένο ϐήµα ολοκληρώνει τη ϱοή της δοκιµής, όπως παρουσιάζεται στο Σχήµα 3.5.
΄Οπως αναφέρθηκε προηγουµένως, πέντε ερωτήµατα ϑα δοκιµαστούν για κάθε ένα από τα
πέντε σύνολα δεδοµένων, µε δέκα επαναλήψεις ανά ερώτηµα. Η διαδικασία αυτή ϑα επα-
ναληφθεί δύο ϕορές, µία για κάθε µοντέλο LLM, οδηγώντας σε συνολικά 500 δοκιµές : 250
για κάθε µοντέλο.
΄Οπως ήδη σηµειώθηκε σε διάφορα σηµεία προηγουµένως, στο πλαίσιο αυτής της µε-
λέτης επιλέχθηκαν πέντε διαφορετικά σύνολα δεδοµένων για τη διαδικασία αξιολόγησης.
Κάθε σύνολο δεδοµένων περιγράφεται συνοπτικά στο µήνυµα προτροπής (prompt) που α-
ποστέλλεται στο LLM, όπως συζητείται στην υποενότητα 3.4.2. Ο κώδικας που παράγεται
από το LLM εκτελείται µέσω µιας εργασίας PySpark, η οποία ϕορτώνει το αντίστοιχο σύνολο
δεδοµένων και εκτελεί τον παραγόµενο κώδικα. Αν και το Apache Spark η υποκείµενη
τεχνολογία της προτεινόµενης πλατφόρµας έχει σχεδιαστεί για τη διαχείριση δεδοµένων
µεγάλης κλίµακας, το µέγεθος των συνόλων δεδοµένων δεν αποτέλεσε κριτήριο επιλογής
για το παρόν σκέλος της αξιολόγησης της διατριβής. Αυτό οφείλεται στο γεγονός ότι το
επίκεντρο του συγκεκριµένου µέρους είναι η ικανότητα του LLM να παράγει κώδικα για
εργασίες ανάλυσης δεδοµένων, και όχι η πλήρης επεξεργασία ολόκληρων συνόλων δεδο-
µένων. Τα δεδοµένα δεν ϕορτώνονται στο ίδιο το LLM, γεγονός που καθιστά το µέγεθός τους
(κατά ϐάση) άνευ σηµασίας για την αξιολόγηση. Ο όγκος των δεδοµένων είχε σηµασία για
την αξιολόγηση του πρώτου σκέλους της παρούσας διατριβής, του λογισµικού αποδοτικής
διαχείρισης δεδοµένων. Τα παρόντα σύνολα δεδοµένων προήλθαν από το Kaggle και το Da-
ta Playground της Maven Analytics, δύο πλατφόρµες που παρέχουν δωρεάν και δηµόσια
προσβάσιµα δεδοµένα.
Επιπρόσθετα, όπως ήδη σηµειώθηκε, τα εκτός σύνδεσης LLMs που επιλέχθηκαν για τη
µελέτη αυτή είναι το Codestral της Mistral AI [52] και το Qwen 2.5 Coder της Alibaba [53].
Στο παρόν πλαίσιο, το κάθε τοπικό γλωσσικό µοντέλο έχει σχεδιαστεί ώστε να επιστρέφει
απαντήσεις µε τη µορφή κώδικα, οι οποίες εκτελούνται στη συνέχεια εντός της πλατφόρµας
διαχείρισης δεδοµένων. Κατά τη διάρκεια αυτής της µελέτης, ισχυροποιήθηκε η προτίµηση
του Apache Spark [47] ως το πλέον κατάλληλο εργαλείο για τη διαχείριση και εκτέλεση των
εργασιών επί των δεδοµένων. ΄Οπως έχει ήδη τεκµηριωθεί, το Apache Spark αποτελεί µια
ισχυρή πλατφόρµα, ϐελτιστοποιηµένη για υπολογισµούς υψηλής απόδοσης. Υποστηρίζει
επεξεργασία εντός µνήµης (in-memory computation), γεγονός που επιταχύνει σηµαντικά τον
χρόνο εκτέλεσης, ιδίως σε εργασίες µε επαναλαµβανόµενους υπολογισµούς. Το Spark είναι
ιδιαίτερα επεκτάσιµο και µπορεί να διαχειριστεί αποδοτικά µεγάλα σύνολα δεδοµένων µέσω
κατανεµηµένων υπολογιστικών συστηµάτων. Επιπλέον, η ευελιξία του αποτελεί σηµαντικό
πλεονέκτηµα, καθώς υποστηρίζει ένα ευρύ ϕάσµα λειτουργιών, όπως επεξεργασία παρτίδων
(batch processing), ϱοές πραγµατικού χρόνου, και ανάλυση γράφων, καθιστώντας το µια
ολοκληρωµένη λύση για επεκτάσιµες ϱοές εργασιών δεδοµένων.
30
Extended PhD Thesis Summary in Greek
Για τη ϐελτίωση των δυνατοτήτων δηµιουργίας κώδικα από τα γλωσσικά µοντέλα Co-
destral και Qwen, αναπτύχθηκε το προαναφερθέν ειδικά σχεδιασµένο µήνυµα προτροπής
(prompt message), το οποίο χρησιµοποιείται για την έναρξη της αλληλεπίδρασης µε κάθε
µοντέλο, όπως ϕαίνεται στην Καταχώρηση 3.2. Κάθε ϕορά που ο χρήστης υποβάλλει ένα
ερώτηµα σε ϕυσική γλώσσα, αυτό συνδυάζεται µε το προαναφερθέν µήνυµα, προκειµένου
να καθοδηγηθεί το µοντέλο στην παραγωγή ακριβούς και σχετικού κώδικα. Η προσέγγι-
ση αυτή ϐασίζεται στη στρατηγική µηχανικής προτροπών τύπου ’few-shot’, κατά την οποία
παρέχονται στο µοντέλο ϐασικά χαρακτηριστικά του συνόλου δεδοµένων που πρόκειται να
αναλυθεί. Το µήνυµα περιλαµβάνει συνοπτική παρουσίαση της δοµής, της µορφής και των
στηλών του συνόλου δεδοµένων, καθώς και περιγραφές για κάθε στήλη. Με αυτόν τον τρόπο,
διασφαλίζεται ότι ο παραγόµενος κώδικας είναι κατάλληλα προσαρµοσµένος στα ειδικά χα-
ϱακτηριστικά του εκάστοτε συνόλου δεδοµένων.
Αποτελέσµατα και Συµπεράσµατα
Η παρούσα πρόταση της διδακτορικής διατριβής, που αφορά την επεκτάσιµη διαχείριση
δεδοµένων και αναλυτική επεξεργασία - ενισχυµένη από τεχνητή νοηµοσύνη - µπορεί να
αποτελέσει µια πολλά υποσχόµενη εναλλακτική προσέγγιση στον τοµέα της ανάλυσης, της
προτυποποίησης και της αξιολόγησης ποιότητας δεδοµένων. Η δυνατότητά της να αναλύει
ολόκληρα σύνολα δεδοµένων, να υποστηρίζει κανόνες ποιότητας και ανάλυσης, οριζόµενους
από τον χρήστη, και να λειτουργεί ανεξάρτητα από τον όγκο των δεδοµένων, την καθιστά
ένα χρήσιµο πλαίσιο τόσο για αναλυτές δεδοµένων όσο και για µηχανικούς. Επιπλέον,
τα συµπεράσµατα που προκύπτουν από τη χρήση του αναµένεται να αναδείξουν τα οφέλη
της ανάλυσης συνόλων δεδοµένων µε χρήση τεχνητής νοηµοσύνης, µέσω της διατύπωσης
ερωτηµάτων σε ϕυσική γλώσσα και της αυτόµατης µετατροπής τους σε εκτελέσιµο αναλυτικό
κώδικα από ένα τοπικό LLM. Η ανάπτυξη του εν λόγω πλαισίου αποτελείται από µια αρχική
υλοποίηση που παρουσιάστηκε σε δηµοσίευση του 2023 [1], καθώς και µια εκτενή µελέτη
που δηµοσιεύτηκε το 2025 [4].
Αναφορικά το λογισµικό ϐέλτιστης διαχείρισης δεδοµένων, διαπιστώθηκε σαφής γραµµι-
κή σχέση µεταξύ του χρόνου ολοκλήρωσης και του µεγέθους των εισερχόµενων δεδοµένων.
Συγκεκριµένα, παρατηρείται ότι για κάθε λεπτό που περνά, επεξεργάζεται, ϕιλτράρεται, κα-
ϑαρίζεται και αποθηκεύεται περίπου 1GB δεδοµένων. Καθώς αυξάνεται το µέγεθος του
συνόλου δεδοµένων (σε GB), αυξάνεται αναλογικά και ο χρόνος ολοκλήρωσης (σε λεπτά).
Αυτή η γραµµική αύξηση ορίστηκε επίσης ως το «κατώφλι αποδοχής» για µια επιτυχηµένη
υλοποίηση. Το πλαίσιο αναµενόταν να λειτουργεί µε ϱυθµό «1GB ανά λεπτό» (δοθέντων των
περιορισµών στους υπολογιστικούς πόρους), στόχος ο οποίος επιτεύχθηκε, ικανοποιώντας
τις προσδοκίες. Επιπρόσθετα, αξίζει να σηµειωθεί ότι το εν λόγω λογισµικό διαχείρισης δε-
δοµένων δεν παρουσίασε καµία υποβάθµιση στην απόδοση του, ολοκληώνοντας µε συνέπεια
όλα τα καθήκοντα και τις δοκιµές στις οποίες υπεβλήθη. Ανεξαρτήτως του µεγέθους εισόδου,
το σύστηµα παρέµεινε πλήρως λειτουργικό. Επιπλέον, κάθε Spark driver αξιοποίησε µόνο
5GB µνήµης RAM και 6 πυρήνες επεξεργαστή από το σύστηµα, όπως είχε οριστεί εξάρχής
ως ¨περιορισµός¨, αναδεικνύοντας την αποδοτικότητα της προτεινόµενης αρχιτεκτονικής.
΄Οσον αφορά το σύστηµα ανάλυσης δεδοµένων µε χρήση τοπικών µεγάλων γλωσσικών
31
Extended PhD Thesis Summary in Greek
µοντέλων, ως το κύριο και τελικό συστατικό στοιχείο του πλαισίου της παρούσας διατριβής,
εκείνο έδειξε σαφή δείγµατα λειτουργικής αποδοτικότητας. Τα αποτελέσµατα ανέδειξαν την
υψηλή απόδοση των δύο δοκιµαζόµενων µοντέλων, κυρίως στην παραγωγή λειτουργικών
σεναρίων κώδικα που ευθυγραµµίζονται µε τους καθορισµένους στόχους. Στην περίπτωση
του Codestral, το 87% των 250 περιπτώσεων δοκιµής ήταν πλήρως επιτυχείς, υποδεικνύο-
ντας τόσο λειτουργική ορθότητα, όσο και εκτέλεση χωρίς σφάλµατα. Αντίστοιχα, το Qwen
2.5 Coder πέτυχε ποσοστό πλήρους επιτυχίας 80% στο ίδιο σύνολο δοκιµών. Επιπλέον, ο
παραγόµενος κώδικας από τα δύο µοντέλα έλαβε γενικά ϐαθµολογίες αναγνωσιµότητας µε-
ταξύ ΄2΄ και ΄3΄, γεγονός που υποδηλώνει ότι ήταν εύκολος στην κατανόηση από ανθρώπους.
Μόνο 2 από τις 250 δοκιµές του Qwen έλαβαν τη χαµηλότερη ϐαθµολογία αναγνωσιµότη-
τας ΄1΄. ΄Οσον αφορά την αυτοµατοποίηση, το Codestral πέτυχε αυτοµατοποίηση στο 80%
των εξόδων του, ενώ το Qwen ξεπέρασε αυτό το ποσοστό µε 96.5%, µειώνοντας την ανάγκη
για ανθρώπινη παρέµβαση. Συνολικά, λαµβάνοντας υπόψη τη λειτουργική ορθότητα ανε-
ξαρτήτως µικρών σφαλµάτων ή χειροκίνητων τροποποιήσεων, το Codestral έφτασε ποσοστό
επιτυχίας 91%, ενώ το Qwen πέτυχε 90%. Τα ευρήµατα αυτά καταδεικνύουν την αξιόπιστη
ικανότητα παραγωγής κώδικα και από τα δύο µοντέλα, ακόµη και στο πλαίσιο δηµιουργίας
κώδικα PySpark, ο οποίος είναι συνήθως πιο πολύπλοκος από τον απλό Python.
Η µελλοντική µελέτη και επέκταση της παρούσας διατριβής ϑα επικεντρωθεί στη ϐελτίω-
ση της διαδικασίας παραγωγής κώδικα, η οποία αποτελεί τον πυρήνα της µελέτης, ως ϐασικό
µέσο για την απρόσκοπτη υποβολή ερωτηµάτων στο µοντέλο από τον τελικό χρήστη. Κύρια
προτεραιότητα ϑα αποτελέσει η ϐελτιστοποίηση των µοντέλων Codestral και Qwen 2.5 Co-
der, µε στόχο την επίτευξη ακόµα υψηλότερης απόδοσης. Σε περίπτωση που αναδυθούν νέα
τοπικά (LLMs) µε αντίστοιχη ή ανώτερη αποτελεσµατικότητα, ενδέχεται επίσης να εξεταστο-
ύν για αξιολόγηση. Ο ϐασικός στόχος της διαδικασίας προσαρµογής ϑα είναι η περαιτέρω
ενίσχυση του επιπέδου αυτοµατοποίησης, µέσω της ελαχιστοποίησης παραγωγής περιττο-
ύ κειµένου, και της διασφάλισης ότι τα τελικά αποτελέσµατα ϑα αποθηκεύονται σταθερά
σε µια προκαθορισµένη µεταβλητή, η οποία αναγνωρίζεται από το λογισµικό διαχείρισης
δεδοµένων, όπως υλοποιήθηκε στην παρούσα µελέτη.
Παρόλο που και τα δύο µοντέλα παρουσίασαν υψηλή ακρίβεια στην παραγωγή εξόδων
χωρίς σφάλµατα και στην επίτευξη ορθών τελικών αποτελεσµάτων, εκτιµάται ότι το fine-
tuning µπορεί να ϐελτιώσει περαιτέρω τις επιδόσεις αυτές. ΄Ενα πλήρως προσαρµοσµένο
µοντέλο που εκτελείται τοπικά σε πιο ισχυρή κάρτα γραφικών ενδέχεται να αποδώσει κα-
λύτερα σε σχεδόν όλους τους δείκτες αξιολόγησης, στο πλαίσιο µιας µελλοντικής µελέτης
που ϑα συνεχίζει την παρούσα διατριβή. Ωστόσο, είναι απαραίτητο να ληφθούν κατάλλη-
λα µέτρα ασφαλείας κατά την προσαρµογή του µοντέλου, ώστε να αποφευχθεί η εισαγωγή
προκαταλήψεων, διατηρώντας έτσι την ικανότητά του µοντέλου να παράγει αξιόπιστα και
γενικεύσιµα αποτελέσµατα σε διαφορετικά συµφραζόµενα δεδοµένων.
Παράλληλα, η επέκταση του προτεινόµενου συστήµατος ώστε να περιλαµβάνει ένα αφιε-
ϱωµένο περιβάλλον διεπαφής χρήστη (UI) ϑα µπορούσε να ενισχύσει σηµαντικά την πρακτική
του αξία. Επί του παρόντος, το ερευνητικό πλαίσιο (όπως απεικονίζεται στο Σχ. 3.5) έχει
σχεδιαστεί κυρίως για την αξιολόγηση των δυνατοτήτων των µοντέλων σε ένα ελεγχόµενο
περιβάλλον. Η µετάβαση προς µια λύση έτοιµη για χρήση στην αγορά ϑα απαιτήσει την
ανάπτυξη ενός ϕιλικού προς τον χρήστη περιβάλλοντος διεπαφής, που να εξυπηρετεί τόσο
32
Extended PhD Thesis Summary in Greek
τεχνικούς όσο και µη τεχνικούς χρήστες. Μια τέτοια διεπαφή ϑα ενίσχυε τη δυνατότητα
πραγµατικής εφαρµογής και χρηστικότητας. Εκτός από την ανάπτυξη UI, η µελλοντική
εργασία ϑα πρέπει να εξετάσει την ενσωµάτωση πρόσθετων λειτουργιών και στοιχείων αξιο-
λόγησης, τα οποία είναι απαραίτητα για τη µετατροπή της παρούσας µελέτης σε µια πλήρως
ολοκληρωµένη λύση (ένα ψηφιακό προϊόν) για επιχειρήσεις και οργανισµούς. Η ανεξαρτη-
σία του συστήµατος από συγκεκριµένα µοντέλα και η επεκτάσιµη σχεδίασή του παρέχουν
ένα σαφές µονοπάτι για την εξέλιξη του υπάρχοντος πρωτοτύπου σε µια πλήρως λειτουργική
πλατφόρµα, κατάλληλη για επιχειρησιακές εφαρµογές.
Εν κατακλείδι, η παρούσα διδακτορική διατριβή ϑα µπορούσε να αποτελέσει τη ϐάση
για µελλοντικές ερευνητικές προσπάθειες, οι οποίες ϑα επιδιώξουν να καθιερώσουν τα το-
πικά µεγάλα γλωσσικά µοντέλα ως µια ϐιώσιµη εναλλακτική για τις διεργασίες ανάλυσης
δεδοµένων, ϐασισµένες σε παραγωγή κώδικα µέσω ερωτηµάτων σε ϕυσική γλώσσα. Αντί
να µεταφέρονται τα δεδοµένα στο LLM, η παρούσα προσέγγιση προτείνει τη µεταφορά του
κώδικα του LLM προς τα δεδοµένα. Η διερεύνηση της εφαρµοσιµότητας των LLMs στο πεδίο
της Επιστήµης ∆εδοµένων αναµένεται να συνεχιστεί. Οι µελλοντικές εφαρµογές είναι πιθανό
να αξιοποιούν πλήρως τη δύναµη των γλωσσικών µοντέλων για ϐέλτιστη επεξεργασία και
ανάλυση δεδοµένων. Με την πάροδο του χρόνου, η παρούσα διατριβή ϑα µπορούσε να απο-
τελέσει µία από τις καθιερωµένες µεθοδολογίες ανάλυσης και διαχείρισης δεδοµένων. ΄Ενα
είναι ϐέβαιο : Τα µεγάλα γλωσσικά µοντέλα ϑα συνεχίσουν να επεκτείνουν την επιρροή τους
σε όλους τους επιστηµονικούς τοµείς, µε την Επιστήµη ∆εδοµένων να αποτελεί - πιθανώς -
ένα ϐασικό πεδίο ενσωµάτωσης. Το πόσο ϐαθιά ϑα ενσωµατωθούν τα LLMs στις σύγχρονες
τάσεις της ανάλυσης δεδοµένων µένει να ϕανεί.
33
Part I
Introduction
35
Chapter 1
Introduction: Towards Efficient Data Manage-
ment and Analysis
1.1 Scalable Data Profiling for Quality Analytics Extraction
Portions of the following context, originally published in [2], have been adapted for
inclusion in the current PhD thesis, to ensure coherence with its overall structure and
narrative.
In today’s rapidly evolving technological landscape, data stands out as one of the most
valuable assets [6]. As advancements in technology occur with increasing frequency [7],
the creation and dissemination of data have become integral components of emerging
products, services, systems, and frameworks. Consequently, the volume of data gen-
erated on a daily basis worldwide is staggering. Recent estimates suggest that around
328.77 million terabytes of data are produced each day [8], including newly created,
captured, duplicated, or consumed information. This immense flow of data has already
demonstrated substantial value across various domains.
Within the industrial sector, organizations regardless of size or specialization
are progressively acknowledging the critical role data plays in supporting evidence-based
decisions and maintaining a competitive advantage. In fields such as finance, healthcare,
and manufacturing, data drives innovation by powering advanced analytics, predictive
modeling, and automation technologies that streamline operations. Similarly, the market-
ing and retail industries utilize data to analyze consumer behavior, deliver personalized
experiences, and enhance targeting strategies. Beyond the private sector, government
agencies also capitalize on data to improve public services, design data-driven policies,
and promote economic growth. As the digital realm continues to expand, data emegres
as a vital resource that fuels development, increases efficiency, and supports progress
across a wide array of sectors.
Maintaining high data quality is essential for all data consumers. In the context of
data-driven decision-making, characteristics such as accuracy, reliability, and complete-
ness play a pivotal role in shaping the outcomes and insights produced by analytical
processes. Regardless of their specific role or domain (whether in business, research,
policymaking, analysis, science, or engineering), data consumers must depend on high-
quality information to make sound decisions and establish effective strategies. When
37
Chapter 1. Introduction: Towards Efficient Data Management and Analysis
data contains errors or inconsistencies, it can lead to misleading analyses and incorrect
conclusions, which may have serious implications. As such, gaining insight into the
quality of a dataset is not merely a best practice, but a fundamental necessity in today’s
data-centric environment. Considering these factors, there is a clear need for solutions
capable of efficiently handling datasets, while simultaneously providing comprehensive
assessments of data quality.
To address this need, various methodologies have been developed for extracting quality-
related insights from data. Among the most widely recognized is data profiling. While
definitions may vary slightly among sources, a common understanding exists. For exam-
ple, IBM defines data profiling as the process of reviewing and / or cleaning data, in order
to enhance the understanding of its structure, and also uphold data quality standards
within an organization [54]. The core aim of data profiling is to develop a detailed un-
derstanding of the dataset’s quality by applying techniques that summarize, assess, and
evaluate the data. This process typically involves measuring key quality dimensions such
as accuracy, consistency, and timeliness, in order to detect issues such as inconsisten-
cies, errors, or missing values. However, users might wish to obtain further insights on a
given dataset, deepening their understanding on the set’s quality.
Considering all the aforementioned factors, there is a strong need to explore a scal-
able, or volume-independent, data analytics solution that empowers users to define cus-
tom quality queries with ease, tailored to their specific analytical needs. This PhD thesis
introduces a software system designed to handle datasets in a scalable manner, and to
perform comprehensive quality assessments across entire datasets with the assistance
of AI language models. The proposed system aligns with established definitions of data
profiling by extracting quality-related insights from datasets, while offering the flexibil-
ity for users to specify their own quality criteria. It is capable of applying its profiling
mechanisms across full datasets, regardless of volume, and operates efficiently even in
low-resource computing environments, ensuring both scalability and performance. The
intuitive manner in which the data analysis is being carried out, is also the main highlight
of the current thesis.
The following two sections present an introduction to the system’s scalable data man-
agement infrastructure, as well as its AI-powered data analysis approach.
1.2 A Scalable Data Management and Interoperability Solution
The content presented here is based on work published in [1], with a series of adjust-
ments made to maintain consistency throughout the dissertation.
A need for scalable and big data management solutions is observed in the field of
Critical infrastructures (CIs). This term refers to the systems, networks, and assets
that are vital to the functioning of modern society and its key sectors. These include,
but are not limited to, energy, transportation, water supply, telecommunications, emer-
gency response, finance, healthcare, agriculture and food supply, government facilities,
and information technology [9]. Given their foundational role, any disruption to these
infrastructures could lead to societal and economic consequences [10]. It is therefore ap-
38
1.2 A Scalable Data Management and Interoperability Solution
propriate to regard CIs as the pillars upon which the global economy is built. From a data
perspective, CIs are prolific producers of information, generating large volumes of data
daily across domains such as operations, monitoring, maintenance, security, and more.
As digital transformation and interconnectivity continue to advance, the rate at which
CI-related data is generated, is increasing rapidly. Effectively managing and utilizing
this data is essential for optimizing performance, strengthening security, and supporting
informed decision-making.
To properly handle - but also comprehend - the sheer volume and complexity of data
produced within the CI sector, the integration of data-driven intelligence seems imper-
ative. This approach requires aggregating diverse data streams, originating both from
the infrastructure systems themselves and from the stakeholders involved in their oper-
ation and oversight. Many CI domains are characterized by a multitude of data sources,
as illustrated in Figure 1.1. A prominent example can be found in port infrastructures,
which belong to the transportation sector. Furthermore, the adoption of Internet of Things
(IoT) technologies has accelerated the implementation of smart logistics and sensor-based
monitoring systems within CI environments. These advances contribute to the continu-
ous production of high-volume heterogeneous data that includes communications, sensor
readings, and control signals.
Figure 1.1. The 16 critical infrastructure sectors of global industry and economy [1].
Such data streams support real-time decision-making, and also enable efficient infor-
mation exchange among various entities in the CI supply chain. Notably, data in CIs is
collected through a variety of mechanisms and is stored in multiple formats, including
structured, semi-structured, and unstructured forms. Considering that there are sixteen
designated CI sectors, it becomes evident that the scale of data generated is immense.
Consequently, ensuring the secure and effective management & analysis of this data is
39
Chapter 1. Introduction: Towards Efficient Data Management and Analysis
both a critical challenge and a necessity. Exploring the need of efficient and scalable
data management solutions through the prism of Critical Infrastructure environments is
a solid approach, since it serves as a demanding proving ground for data management
solutions. Solutions that successfully meet the complex requirements of CI sectors, in-
herently demonstrate their capability to excel in less demanding operational contexts,
making CI an ideal test case for advancing data management technologies.
1.2.1 Data Management Challenges
Before delving into research focused on scalable data solutions, applied to CIs, it is
essential to first acknowledge and address the broader challenges associated with scalable
data management and analytics. Extracting valuable insights from large-scale datasets
is a complex task, presenting a range of obstacles that both researchers and industry
professionals should navigate, in order to support data-informed decision-making. These
challenges span the entire data lifecycle, including data acquisition, storage, processing,
analysis, and interpretation. What follows is an overview and classification of the key
challenges encountered in the tasks of data management and analysis:
Volume, Velocity, Variety, Veracity, and Value: Also known as the famous ’5 Vs’ of
big data, since they highlight its key characteristics and challenges. Volume reflects
the size of data sets, which may span terabytes to exabytes, often requiring signifi-
cant storage capacity and specialized processing systems. Velocity refers to the high
speed at which data are generated, particularly in domains like IoT or social media,
making real-time or near-real-time analysis a demanding task. Variety points to
the diverse formats of data, including structured, semi-structured, and unstruc-
tured types. Examples of such types are text, images, video, and sensor data.
This diversity sometimes complicates integration and analysis. Veracity relates to
the reliability and quality of data, as inconsistencies, noise, or missing information
can lead to inaccurate results, requiring thorough validation and cleaning. Finally,
Value represents the potential to extract meaningful insights from data, though
doing so is not guaranteed and often depends on advanced analytical techniques,
such as machine learning, data mining, and other novel approaches (such as the
current thesis’s study).
Data Management and Infrastructure Challenges: These challenges involve ad-
dressing the rapid increase in data volume, while ensuring that systems can scale
effectively. Managing resources like CPU, GPU, memory, and storage requires care-
ful planning and optimization to maintain performance. Storing, indexing, and
retrieving large amounts of data across distributed storage systems adds further
complexity, especially at scale. A key ongoing challenge is balancing the need for
high-performance data processing with the costs and limitations of the supporting
infrastructure.
Data Integration and Processing Challenges: These arise from the need to apply
advanced algorithms and models for analyzing volume-agnostic data, which often
40
1.2.1 Data Management Challenges
requires specialized knowledge in areas like machine learning, statistics, and data
science. Integrating data from various sources, each with different formats and
structures, demands reliable tools and methods to ensure consistency and compat-
ibility. Extracting useful insights from large and complex datasets through explo-
ration and visualization adds another layer of difficulty. Moreover, ensuring smooth
interoperability among the different tools and systems used in big data workflows
remains an ongoing challenge.
Data Security, Ethics, and Governance Challenges: These involve managing the
risks associated with storing and processing sensitive information, which requires
strong safeguards to protect data confidentiality, integrity, and compliance with
regulations such as the GDPR. Ethical and legal issues, including data privacy,
algorithmic bias, and the responsible use of data, must also be carefully addressed.
Implementing effective data governance practices is essential to ensure data quality,
security, and adherence to legal standards. In addition, the ongoing shortage of
skilled professionals capable of working with scalable data frameworks adds another
layer of difficulty to managing these challenges effectively.
Environmental and Long-term Challenges: These challenges relate to the energy
demands of large-scale data centers and computational resources used in big data
processing, raising concerns about environmental sustainability. Managing the
long-term preservation of big data is also difficult, as evolving technologies and
changing data formats require reliable strategies to ensure that data remains ac-
cessible and usable for future analysis.
Addressing the challenges of big data analysis requires a multidisciplinary effort that
brings together expertise in computer science, data engineering, machine learning, and
domain-specific knowledge, while also encouraging collaboration among different stake-
holders. As technology advances, new issues are likely to arise, making it crucial for both
researchers and professionals to remain informed and adaptable in the evolving field of
big data.
To better understand the scale of data produced by domains like critical infrastruc-
tures, it is helpful to look at real-world examples. In the energy sector, for instance, the
power grid in the United States alone is estimated to generate between 100 and 1,000
petabytes of data annually [55, 56]. Globally, the energy sector produces an even greater
volume, with estimates ranging from 100 to 200 exabytes per year [57]. This data in-
cludes information on power generation, transmission, distribution, weather conditions,
equipment performance, and consumer usage patterns. Similarly, transportation systems
produce several petabytes of data each year, capturing traffic flow, vehicle emissions, and
travel behavior [58, 59]. The telecommunications sector also generates vast amounts
of information, including call records, messaging activity, and internet usage [60]. As
critical infrastructure sectors increasingly adopt sensors, smart devices, and other IoT
technologies, the volume of data they produce will continue to grow. These developments
offer new opportunities to improve the efficiency, reliability, and security of vital systems
41
Chapter 1. Introduction: Towards Efficient Data Management and Analysis
through data-driven insights.
Effectively managing scalable volumes of data is an interesting field of research, which
aids organizations, companies and critical infrastructures to maintain their security and
resilience. By collecting, storing, and analyzing several amounts of data, these domains
can enhance their operational efficiency, identify and mitigate potential risks, and improve
their responses to incidents. Applying scalable data techniques to process such data offers
the potential to automate decision-making, and manage tasks more efficiently within the
domains’ workflows. For instance, in any given critical infrastructure sector, integrating
not only operational but also global data from various stakeholders (along its value chain)
could significantly support its growth and development. The existing literature provides
several solutions that address the harmonization, interoperability, processing, filtering,
cleaning, and storage of data. Although several scalable data management systems have
been proposed, there is still a gap in research for a system that combines management
with harmonization and AI-enabled analytics operations. This gap in existing scientific
work has motivated the development of a new proposal, aimed at unlocking the potential
of data, enabling end-users to make profling and analytics queries with ease, while also
optimizing resource use and improving infrastructure management.
1.2.2 Efficient Data Management Layer Proposal
The gap between proper scalable data handling and their utilization by both organi-
zations (from which they are generated) and external beneficiaries can be bridged by this
study’s efficient & scalable data management layer proposal. Conceptually, it is a mid-
dleware framework, consisting of a software infrastructure that enables volume-agnositc
data management. Initially tailored to relieve CIs from the time-consuming task of prop-
erly and wholly managing the data generated within their sectors, it has evolved into a
software tool with applications across various domains. Its early conceptual adoption
was presented in 2022 [3]. The theoretical model proposed was based on the use case of
smart ports (and therefore the CI transportation sector, in which ports belong to), studied
within the context of a European Research Project [61]. This original approach has been
expanded, restructured, improved, and implemented, leading to the current scalable data
management framework.
The proposed software tool has the potential to enhance the benefits extracted from
the data generated by organizations, critical infrastructures and other domains. As men-
tioned previously, the framework can hold promise for the business sector as well. The
advantages of improving data interoperability and management extend beyond the eco-
nomic gains of CI authorities, benefiting a wide range of stakeholders and leading to
improved services. Telecommunications operators, for example, play a crucial role in
this growing data-driven market, as they can leverage their extensive data resources. By
providing data or services through APIs, they offer valuable insights that aid decision-
making for external stakeholders, including public authorities, municipalities, shipping
companies, transportation agencies, cultural and trade associations, and more. Thus,
this study’s scalable data management layer can assist beneficiaries coming from several
42
1.3 Large Language Models and their Impact in Modern Applications
kinds of fields.
1.3 Large Language Models and their Impact in Modern Appli-
cations
This research, previously published in [4], has been selectively modified to align with
the thematic and scientific requirements of the present PhD thesis.
1.3.1 Applications and Impact of LLMs
The rapid development of Large Language Models (LLMs) has created new opportu-
nities for research and practical applications across a variety of fields. The advanced
capabilities of these models have sparked significant interest in the global research com-
munity, leading to increased investments aimed at advancing LLM-based applications.
Notable LLMs, such as OpenAI’s GPT-4 [11] and Google’s Gemini [12], have reached a
level of sophistication where they can understand, generate, and manipulate human lan-
guage to an extraordinary degree. As a result, these models are now being applied in a
wide range of domains, including healthcare and engineering. These advancements have
set new standards in both research and industrial applications, with a primary focus on
enhancing the accuracy of the models’ outputs [13] [14].
The scope of LLM applications extends well beyond traditional natural language pro-
cessing. In scientific research, for instance, LLMs are accelerating discoveries in fields
such as medicine, biology, materials science, and forensics. In drug discovery, LLMs can
help expedite the development of new therapeutic agents by predicting molecular proper-
ties and simulating molecular interactions. In computational chemistry, these models are
being used to improve the description of chemical processes, while in materials science,
LLMs assist in designing new materials with customized properties [62]. Additionally,
integrating LLMs into forensic science could enhance modern investigations and aid law
enforcement agencies in identifying suspects [63]. These are just a few examples of how
LLMs are revolutionizing scientific research and industrial innovation, offering the po-
tential for a significant leap in more intelligent and efficient problem-solving across all
sectors [64].
Moreover, Large Language Models have shown significant potential in enhancing var-
ious aspects of Data Science. For instance, OpenAI’s GPT-4 can perform tasks such as
data cleaning, feature extraction, and even model training with minimal human involve-
ment [15] [16]. In data analysis, LLMs offer the advantage of automating the generation
of insights from datasets. These models can carry out traditional data profiling tasks,
detect anomalies, and even provide predictive analytics for future use by data engineers
and other professionals. Consequently, LLMs could be applied to processing customer
feedback, financial reports, and social media data to extract actionable insights. Their
ability to understand and interpret various types of datasets makes them valuable tools
for data scientists [17] [18].
Despite their impressive capabilities and vast potential in Data Science (as well as
43
Chapter 1. Introduction: Towards Efficient Data Management and Analysis
other domains), LLMs face considerable challenges in providing quality analytics on large
datasets. A major limitation is their inability to process entire datasets in one go, due
to memory and computational constraints. Most LLMs can only handle a limited context
window, which can lead to the loss of context, incomplete insights, and inaccurate results
[65]. Additionally, LLMs may struggle to comprehend complex data relationships within
a dataset, as their training has primarily focused on understanding natural language
rather than complex data structures [66] [67]. These limitations can affect the quality
and precision of the analytics produced by an LLM. Therefore, while models like GPT-
4 exhibit advanced natural language processing capabilities, their application in data
analytics requires further development, particularly for complex tasks such as detailed
data profiling, in-depth analysis, and quality insight extraction [68].
In addition, the sensitive nature of many datasets further complicates the use of on-
line LLMs such as OpenAI’s GPT for data analysis and profiling. For instance, some
datasets contain personal information, proprietary business data, or sensitive research
findings, which are subject to strict regulations, including the General Data Protection
Regulation (GDPR) and various data privacy laws [69]. The security risks involved in pro-
cessing such data online raise significant concerns, including unauthorized access, data
breaches, and the potential misuse of information [70]. Moreover, ethical issues related to
the use of sensitive data require strict adherence to consent and usage guidelines, which
can be difficult to enforce when using online LLMs. For these reasons, using online LLMs
for profiling and analyzing sensitive data should be avoided.
1.3.2 Exploring Offline LLMs in Data Analysis
The concerns highlighted above stress the importance of secure, on-premise solutions
to maintain data integrity and privacy. Organizations that handle sensitive data should
still be able to take advantage of the capabilities of large language models. End-users
could greatly benefit by simply querying data in natural language and receiving visualized
outputs or direct data samples in return. For example, instead of manually selecting
specific columns and applying predefined rules through a specialized interface, end-users
and data scientists could simply write queries in natural language and view the results
on their screens as they normally would. In short, offline LLMs can significantly assist
data scientists in data analysis and insight extraction, streamlining the querying process
and reducing effort.
This leads to the conclusion that businesses, organizations, and enterprises should
consider adopting offline LLM solutions. A main part of this PhD thesis explores how
offline LLMs can enhance data analysis, by generating code for data analytics tasks based
on natural language queries. By keeping the data on-premise, the risks associated with
data integrity and security are eliminated, as data is not transmitted to online LLM models.
Moreover, this approach addresses the limitation of LLMs in processing large datasets,
due to their architectural and computational constraints. In the proposed method, the
LLM does not load entire datasets. Instead, it works with a comprehensive summary.
For each data profiling / analysis query, the LLM generates specialized code, which is
44
1.3.2 Exploring Offline LLMs in Data Analysis
executed by a software job on the full dataset (running at the scalable data management
layer) to produce the final results. Thus, both data integrity and processing limitations
are effectively managed.
The proposed on-premise solution consists of an offline LLM that is integrated with the
outlined data management software infrastructure. The data remains within the organi-
zation’s control, ensuring security. For each dataset, the LLM is provided with concise
metadata, allowing it to understand the context of the data. Whenever the organization’s
data scientist submits a query in natural language, the LLM generates specialized code
for data profiling and analysis. This code is then applied to the data through a pipeline,
with the final results returned to the user. As previously mentioned, this approach not
only ensures data security but also enhances the efficiency of data analysis, addressing
two key limitations of LLMs. Additionally, it simplifies the process for end-users to submit
their data analysis preferences, as they can do so using natural language.
45
Part II
Main Analysis
47
Chapter 2
Literature Review: Assessing Similar and Ex-
isting Approaches to Data Management and AI-
Powered Analysis
2.1 Generic Approaches to Scalable Data Management and An-
alytics
2.1.1 Data Profiling and Quality Analysis
The topic of data profiling and analysis has been widely explored, with numerous
implementations, studies, and software solutions available. Data sampling has proven
to be an effective and accurate method, supporting exploratory analysis, profiling, and
model prototyping on manageable data subsets. This enables the extraction of useful
insights and the validation of predefined methods, before scaling up to full datasets.
However, when it comes to profiling large-scale datasets and applying profiling techniques
to entire tabular data collections, research efforts are more limited, with less comparable
approaches available.
One of the most relevant works in the area of full-scale data profiling for big data
is the approach presented by Dai in 2016 [71]. In their study, the authors conducted
an extensive review of the existing literature, exploring how different works define data
profiling, and offering updated classifications for profiling tasks. They also evaluated
various profiling tools available at the time, including both open-source and commercial
solutions. Moreover, the paper introduced a set of new data quality metrics, as well as a
method for computing data quality scores. Finally, the authors proposed a framework for
data profiling tailored to big data environments.
In their 2017 publication, Abedjan [72] underscored the vital role of data profiling
and analysis in a wide range of data-centric applications. The authors provided an in-
depth overview of existing profiling techniques and systems, while also addressing key
challenges in the field, such as developing algorithms for discovering data dependencies
and adapting profiling methods to dynamic data streams. They also stressed the grow-
ing relevance of data profiling in the big data era, a point that has become even more
significant in today’s data-driven landscape. However, their study mainly concentrated
49
Chapter 2. Literature Review: Assessing Similar and Existing Approaches to Data Management and
AI-Powered Analysis
on the conceptual and technical aspects of data profiling, without specifically examining
its application to large-scale big data environments. In fact, they identified this gap as a
direction for future investigation.
An important contribution in this area is the work by Taleb in 2019 [73], who presented
a comprehensive framework for data quality profiling, designed specifically for big data
contexts. The proposed model included multiple components, such as sampling, profiling,
exploratory quality analysis, and a repository for quality profiles. A series of experiments
were conducted to evaluate various parts of the framework, aiming to improve the overall
quality of big data.
In their 2020 survey, Liu and Zhang [74] [75] explored the application of sampling
techniques for data profiling in big data environments, analyzing their use across various
profiling categories. Their experimental results demonstrated that insights derived from
sampled data can closely approximate, or even outperform, those obtained from profiling
entire datasets. As a result, the study emphasized the growing importance of sampling
technologies in the big data domain, suggesting that sampling will likely become a core
component of scalable data processing in the future. This perspective highlights the
efficiency and practicality of sampling-based profiling.
Complementing this line of research, Elbaghazaoui conducted an extensive theoretical
analysis in 2021 [76], where the authors highlighted the importance of data profiling, and
examined its role specifically in big data contexts. Their work reviewed a wide range of
use cases, systems, and methodologies, while identifying key challenges to be addressed
in future research. Their findings offer valuable theoretical foundations for developing
scalable profiling solutions. It should be noted that the researchers did not present a
comprehensive method for effectively profiling entire large-scale datasets.
Finally, a comparable analysis to that of Elbaghazaoui’s was conducted by Couto in
2022 [77]. The authors emphasized the critical need for data scientists to develop a
deep understanding of the data they work with: An understanding that must be contin-
uously updated in response to the dynamic and evolving nature of big data ecosystems.
Motivated by this observation, they carried out a thorough review of the existing litera-
ture, focusing on the application of data profiling within big data contexts. Their survey
examined emerging trends in scalable data profiling, classifying and evaluating recent
advancements in the field. This included an analysis of widely adopted tools, real-world
scenarios, representative datasets, and the types of metadata and information extracted
through profiling. The study concluded with a discussion on future research challenges
and directions. As with the work of Elbaghazaoui, Couto’s survey provides a valuable
theoretical foundation that can significantly support the initial phases of developing a
robust and scalable data profiling framework, aimed at quality analytics extraction.
2.1.2 User-Defined Quality Rules
When the data sources are stable and the use case is well-known, quality measure-
ment can be effectively handled using pre-configured quality requirements, based on
common dimensions and metrics. However, in cases where different users work with
50
2.2 Related Scalable Data Management Proposals and Solutions
diverse data sources and have varying needs, it is more beneficial to allow the creation of
custom analytics rules and metrics in a flexible, dynamic manner.
Thus, it is important to consider the ability of the end-user —whether a company, data
engineer, or consumer to define or submit their own quality standards and rules with
ease, thereby gaining deeper insights into specific quality aspects they deem crucial. In
a 2020 publication, Anastasija Nikiforova [78] introduces a data object-driven approach
to evaluating data quality. She proposes replacing the concept of "quality dimensions"
with "quality requirements," which effectively adds an additional layer to the traditional
approach. These requirements are user-defined, tailored to the specific purposes of the
data users.
Similarly, in a 2024 study by Altendeitering [79], the researchers present a software
reference architecture for data quality tools, based on a comprehensive review of ten differ-
ent (and pre-existing) data quality tools. This architecture includes a module for creating
"user-defined rules" within an interaction layer. These rules serve as an extension of
standard data quality rules, but incorporate domain-specific constraints. The "Rule Gen-
eration" module is responsible for executing these user-defined constraints, enhancing
the flexibility and customization of data quality evaluation.
Allowing users to contribute their perspective to the data quality evaluation process is
a clear advantage, as it enables the achievement of the quality levels needed for specific
scenarios. When developing a comprehensive and dynamic software tool, it is crucial
to consider the user as a key stakeholder in the data quality evaluation process. The
definition of user-defined quality rules can be formalized and standardized, so that they
become an integral part of the data quality analysis extraction pipeline, submitting analyt-
ics queries they can be transparently understood and processed by the software platform.
2.2 Related Scalable Data Management Proposals and Solu-
tions
2.2.1 EU Research Projects and Initiatives
The primary testing of the proposed scalable data management framework has been
conducted in ports. Thus, related projects should be linked to the maritime industry.
Numerous initiatives have been undertaken at the European level, with some remaining
actively operational, all aimed at creating a comprehensive ecosystem centered around
ports. European entities and associations, such as ESPO [80], IAPH [81], and AIVP [82],
have led efforts to establish connections and advocate for port authorities, while promot-
ing relationships with the European Union and other nations. Their critical role in global
trade places them at the forefront of smart port development. Additionally, in collabora-
tion with several EU ports, ENISA [83] has published a report providing valuable insights
into the cybersecurity strategies employed by port authorities and terminal operators,
further enhancing their contributions.
Moreover, the European Union (EU), especially through Horizon research programs,
has allocated significant funding to a range of projects focused on the future of EU ports.
51
Chapter 2. Literature Review: Assessing Similar and Existing Approaches to Data Management and
AI-Powered Analysis
These projects are designed to create comprehensive management platforms [84], tailored
to maritime and port environments, with the overarching goal of promoting interoper-
ability and advancing ports into cognitive, intelligent entities. Notable examples, such as
the SmartCities project, have culminated in the Marketplace of the European Innovation
Partnership on Smart Cities and Communities [85]. Projects like e-Mar, FLAGSHIP, and
INMARE are dedicated to improving maritime transportation, while the MASS initiative
focuses on enhancing human conduct aboard ships, particularly in emergency scenarios.
MARINE-ABC showcases the potential of mobile ship-to-shore communication.
Meanwhile, the BigDataStack project [86] aims to streamline cluster management
for data-related operations. However, it does not provide a comprehensive solution for
big data interoperability, harmonization, and management. The SmartShip initiative
[87] focuses on developing data analytics-based decision support systems, along with
an optimization platform built on the principles of a circular economy. These collective
efforts, alongside other similar initiatives, highlight the shared goal among the research
community, port authorities, shipping entities, and supply companies: the creation of
an ecosystem enriched with advanced data-centric services, ultimately benefiting both
critical domains (acting as data providers) and local communities. In line with this mo-
mentum, the European maritime sector is advancing to deliver seamlessly integrated,
high-quality services as part of the broader European transportation network.
2.2.2 Research Work on Scientific Literature
Apart from initiatives of the European Union (through research projects), several pro-
posals for scalable data analysis and management in critical infrastructure sectors have
been published in recent years. In 2014, Baek [19] proposed a cloud-based framework
for managing big data in smart grids, aiming to optimize the generation, distribution,
and consumption of electricity. Smart grids, which integrate digital technology to en-
hance efficiency and sustainability, generate large data streams from intelligent devices
like power assets and smart meters. The framework introduced a hierarchical network of
cloud centers, in order to provide computing services for data management and analysis.
While cloud computing offers advantages such as scalability, cost-efficiency, and energy
conservation, the proposal did not include a performance evaluation, focusing mainly on
data handling and cleaning in smart grids.
Lockow (2015) [20] explored the generation of big data in the automotive industry,
which spans CI sectors like transportation and critical manufacturing, and the need for
effective management of large data volumes. The paper surveys use cases and applica-
tions of Apache Hadoop [21] in the automotive sector. Hadoop’s scalability in computing
and storage has made it a standard for scalable data processing, with a robust ecosystem
supporting parallel, in-memory, and stream processing, SQL/NoSQL engines, and ma-
chine learning. Key topics addressed include suitable applications for Hadoop, managing
diverse frameworks in a multi-tenant cluster, integration with relational systems, security
considerations, and performance benchmarks.
A study held by Dinov in 2016 [22] examined the challenges and opportunities in
52
2.2.2 Research Work on Scientific Literature
modeling and interpreting big healthcare data, particularly within the healthcare and
public health CI sectors. Managing and processing extensive healthcare data is costly
and complex. The study outlines the challenges of integrating complex healthcare data
with advanced analytical tools and distributed computing. Using examples like imaging,
genetic data, and healthcare information, the paper demonstrates how heterogeneous
datasets can be processed using cloud services, automated classification methods, and
open-science protocols. While advancements are made, the author emphasizes the need
for continuous development of innovative technologies to optimize data management,
with the study suggesting a multifaceted approach involving proprietary, open-source,
and community-driven technologies.
Kaur (2019) [23] proposed a big data-capable framework for energy-efficient, software-
defined data centers (SDDCs) in IoT environments. SDDCs use virtualization and intel-
ligent management to reduce energy consumption, while maintaining performance and
scalability. With the rapid growth of IoT and the resulting influx of big data, real-time data
analysis and processing are crucial for cloud platforms. However, data centers, essential
to cloud infrastructure, face high energy costs and environmental impact. The paper
introduces an SDDC-based model to optimize virtual machine deployment and network
bandwidth, aiming to reduce energy consumption while ensuring quality of service. The
proposal does not specifically focus on scalable data management and harmonization,
but the proposed approach is relevant to the current work.
Bhat’s publication in 2021 [88] highlights the challenges of managing agricultural and
food supply chain data using blockchain technologies. It proposes a future architecture
for this purpose. The study addresses issues like storage and scalability optimization,
interoperability, security, and privacy concerns, particularly in single-chain systems. It
also explores security threats associated with IoT infrastructure, as well as blockchain-
based defense mechanisms. The paper concludes by discussing the key features of the
proposed architecture and future considerations. While blockchain is recognized for
enhancing transparency in agri-food supply chains, the paper emphasizes the importance
of assessing the limitations of these technologies, such as IoT, RFID, and NFC, to ensure
their successful adoption and scalability.
Finally, Donta (2023) [24] explores the use of Distributed Computing Continuum Sys-
tems (DCCS) for big data processing, focusing on data from edge devices like IoT and
sensor nodes. The study addresses challenges related to diverse data formats and at-
tributes, leveraging big data analytics tools. It also outlines existing monitoring tools
and their features. The authors propose a governance and sustainable architecture for
DCCS, which includes three stages: system data analysis, knowledge-based monitoring
and prediction, and proactive issue resolution. Although the proposal is valuable, it does
not address data harmonization and interoperability issues, which have been discussed
in other studies.
Given the extensive initiatives undertaken by the European Union, particularly through
Horizon research programs, and of course other research teams globally, there is a grow-
ing need for the development of a framework focused on effectively managing data across
various domains. The maritime industry serves as one key use case. This need under-
53
Chapter 2. Literature Review: Assessing Similar and Existing Approaches to Data Management and
AI-Powered Analysis
scores the importance of creating advanced management platforms, designed to address
the complex dynamics of sectors like maritime and port environments, as well as institu-
tions and business entities. The introduction of a scalable data management framework
has the potential to enhance the data lifecycle in critical infrastructure sectors and cor-
porate organizations. By leveraging advanced data processing and analytics, this thesis’s
framework could play an assistive role in improving efficiency, resilience, and informed
decision-making within CI and business domains.
2.3 Applications of LLMs for Code Generation and Data Analy-
sis Operations
Recently, several proposals have been made to assess the code generation capabilities
of LLMs, as discussed in this section. However, the specific task of generating code
for data analysis and profiling using offline models has not been thoroughly explored
within the research community. As a result, direct comparisons with existing studies are
difficult, as no prior work has specifically focused on this approach. Alongside a review of
proposed systems and studies on LLMs and code generation, available market solutions
are also presented. Furthermore, works where LLMs are applied to data analysis tasks are
highlighted. Given the ongoing and extensive research into the capabilities of language
models, it is expected that additional studies will emerge in the future.
2.3.1 Code Generation Techniques
This subsection presents existing works relevant to the scope of the current study,
focusing primarily on code generation using LLMs. These studies highlight various pro-
posals and approaches for generating functional code, emphasizing the integration of
LLMs with additional software techniques, and demonstrating the increasing potential of
these models in diverse coding applications.
Feng (2023) [25] published a study on the code generation capabilities of ChatGPT.
Focusing on OpenAI’s LLM, the research presents a scalable framework for crowdsourcing
social data, in order to evaluate the code generation abilities of large language models.
The framework utilizes data from several social media platforms, and supports tasks such
as academic assignment solving, interview preparation, and code debugging. While this
work relates to the current thesis in terms of code generation by language models, it does
not focus on data analytics tasks or the use of offline LLMs. In another study, Gu (2023)
[26] introduces a code generation approach for compiler testing using a language model,
aiming to improve both the quality and quantity of the generated code. The technique
follows a filter strategy, cleaning the source code to create a high-quality dataset for model
training. This approach explores the capabilities of encoder-decoder models to generate
testing-oriented code, aligning with the current thesis’s focus on leveraging AI tools to
investigate code generation abilities.
An intriguing article by Ross (2023) [27] presents a prototype system designed to ex-
plore the effectiveness of conversational interactions between professionals and LLMs,
54
2.3.2 Literature Surveys
and to evaluate how software engineers engage in dialogue with a code-fluent LLM. The
findings suggest that future frameworks with LLM-powered features could become highly
valuable tools for software engineers, akin to the goals of the current thesis. In a pub-
lication by Soliman (2024) [28], the research team introduces hybrid models for code
generation through the integrated use of other pre-trained language models. The per-
formance of these hybrid models is evaluated using two widely used datasets [29] [30],
and benchmarked against existing state-of-the-art models. The study aims to assess the
feasibility of improving the precision and efficiency of LLM code generation, particularly
in complex coding scenarios. While this work does not specifically focus on offline LLMs,
it is relevant to the current study, as it addresses the broader context of code generation.
This current thesis extends the discussion by examining a use case scenario centered on
data analytics operations.
A paper by Pinna (2024) [31] examines the use of large language models for automatic
code generation, using problem descriptions as input queries. The study focuses on ad-
dressing the issue of LLMs generating incorrect code. The results show an improvement
in code quality, emphasizing the potential of combining LLMs with other techniques to en-
hance the accuracy of the generated code. Building on this idea, the current study adopts
the approach of integrating LLMs with additional software methodologies. Other signifi-
cant contributions include proposals for benchmarks to evaluate the code generation pro-
cess of LLMs. Yu (2024) [32] introduces a benchmark designed to assess code generation
models, particularly focusing on non-standalone functions, which are often overlooked
in existing benchmarks, but constitute the majority of functions in open-source projects.
Additionally, Omari (2024) [33] explores the capabilities of a large language model (mainly
ChatGPT) in detecting and fixing bugs at simple Python programs, further contributing
to the body of research on LLMs in code generation tasks.
2.3.2 Literature Surveys
This subsection presents key literature surveys that, while not directly aligned with
the specific scope of the current study, provide valuable context. These studies explore the
role of language models in transforming tasks such as broad-spectrum code generation,
debugging, and design, while also highlighting their strengths, limitations, and future
potential.
Fan (2023) [34] published a survey on the potential of large language models in soft-
ware engineering, identifying several research challenges related to using LLMs to assist
software engineers with various technical tasks, including coding, design, bug fixing, and
refactoring. The report emphasizes the creative capabilities LLMs bring to these tasks.
It also stresses the importance of future hybrid approaches that combine LLMs with tra-
ditional software engineering methodologies. Wong (2023) [35] conducted a review of
Natural Language Processing (NLP) techniques primarily LLMs trained on large code-
bases in AI-assisted programming operations. The review highlights the significant role
of LLMs in a variety of applications, such as code generation, completion, translation, re-
finement, summarization, defect detection, and clone detection. Additionally, it mentions
55
Chapter 2. Literature Review: Assessing Similar and Existing Approaches to Data Management and
AI-Powered Analysis
examples of AI-enhanced tools, including GitHub Copilot [36] and DeepMind AlphaCode
[37].
Another review on the code generation capabilities of large language models was con-
ducted by Wang (2023) [38]. This study examines recent research focused on code gener-
ation using LLMs, with particular attention to how the generated code is evaluated, and
how LLMs are applied in software engineering tasks. The review highlights the need for
further investigation to address current gaps in the evaluation of LLM-generated code.
Similarly, Liu (2024) [39] carried out a comprehensive literature review on recent devel-
opments in deep learning-based code generation. The study systematically evaluated a
range of recent scientific publications, and applied a structured methodology to analyze
them. Its goal was to provide insights into code generation using LLMs, identify existing
knowledge gaps, and offer guidance for future research in this evolving field.
2.3.3 Other Applications and Market Tools
This subsection reviews recent efforts to fine-tune LLMs, along with practical appli-
cations and commercial tools that build on these models to deliver innovative solutions.
These developments highlight the adaptability of LLMs across various areas, including
hardware design, database optimization, personalized content creation, and AI-driven
data analysis, areas closely related to the scope of this study.
Initial research on adapting LLMs specifically for code generation has started to
emerge. For instance, Thakur (2024) [89] investigates how LLMs can support hardware
design by completing incomplete Verilog code, a language commonly used in digital circuit
design. Similarly, Mu (2024) [90] introduces a method that enhances code generation,
by allowing an LLM to detect unclear requirements and ask follow-up questions before
producing code. Their results show that this approach improves the code generation
capabilities of online LLMs, like GPT-4, across several benchmark datasets.
Rau (2024) [91] revisits the Unsupervised Passage Retrieval (UPR) method initially
proposed by Sachan (2022) [92], aiming to validate and extend its effectiveness. UPR
leverages the generative capabilities of large language models to create zero-shot ques-
tions, which are then used to re-rank text passages and improve retrieval accuracy. Rau’s
study confirms the method’s strong performance on the BEIR benchmark [93], while also
broadening the evaluation to include the 2019 and 2020 TREC Deep Learning tracks [94].
In a related line of work, Qu (2024) [95] explores the use of large language models for
generating personalized product reviews in e-commerce. The proposed method introduces
a graph-enhanced prompt learning strategy that improves the diversity and relevance of
generated reviews, by utilizing a pre-trained language model (PLM). Meanwhile, Zhou
(2024) [95] presents a framework that applies LLMs to optimize database systems. This
work addresses key challenges, such as constructing effective prompts, modeling both
logical and physical aspects of databases, and preserving data privacy during optimiza-
tion.
In addition to existing research and published approaches, AI-powered tools for data
profiling and analysis are increasingly being integrated into modern data workflows.
56
2.3.4 Prompt Engineering
These tools are also becoming available as deployment options within major platforms.
One example is OpenAI’s Codex [96], a fine-tuned version of GPT-3 that powers GitHub
Copilot [36]. Codex demonstrates how large language models can assist in generating code
for data profiling and analysis, particularly in cloud-based environments. It operates by
processing user inputs on remote servers equipped with high-performance computing
resources, offering code suggestions that can enhance productivity.
However, this cloud-based approach may raise privacy concerns, especially when
sensitive data is transmitted to external servers for processing, potentially compromising
data confidentiality. Beyond Codex, other tools built on GPT models most of which
also rely on cloud infrastructure offer natural language processing capabilities that
support various data analysis tasks. Although these tools provide valuable functionality,
their reliance on cloud processing introduces potential risks related to data privacy and
security.
In contrast to cloud-only solutions, several data analytics platforms (that incorporate
AI-powered features) offer greater deployment flexibility, which can help address concerns
related to data privacy. Platforms such as DataRobot [97], ThoughtSpot [98], Tableau
[99], and Microsoft Power BI [100], provide options for both cloud-based and on-premises
deployment. For instance, DataRobot supports on-premises installations, allowing orga-
nizations handling sensitive information to benefit from AI capabilities, while keeping their
data within their own infrastructure. ThoughtSpot and Tableau also offer on-premises so-
lutions, giving users more control over their data by enabling local processing. Likewise,
Microsoft Power BI supports a hybrid model, allowing users to choose between cloud and
on-premises environments based on their specific privacy and security requirements.
2.3.4 Prompt Engineering
Prompt engineering is a key component of the system proposed in this thesis and
deserves particular attention. It involves designing and refining the inputs given to a gen-
erative AI model, in order to ensure that its outputs are accurate, relevant, and aligned
with the user’s intent [40] [41]. By carefully shaping prompts, developers can guide the
model to better understand not only the language used, but also the context and pur-
pose of each query. This approach has proven to be a practical method for improving
the quality of model responses in various applications, including software development
and customer support systems. Effective prompt engineering can significantly reduce the
need for extensive post-processing later in the workflow, contributing to more efficient
and streamlined AI operations [42] [43]. Well-constructed prompts can help AI systems
produce more accurate insights, generate useful code fragments, or even simulate cyber-
security scenarios, depending on the task at hand.
Various prompting techniques have been developed to enable more advanced inter-
actions with generative AI models. Zero-shot prompting involves assigning a task to the
model without providing examples, relying entirely on its pre-trained knowledge. Few-
shot prompting enhances accuracy by supplying a small number of examples, aiming
to guide the model’s responses when handling similar tasks. Chain-of-thought prompt-
57
Chapter 2. Literature Review: Assessing Similar and Existing Approaches to Data Management and
AI-Powered Analysis
ing encourages the model to break down complex problems into a sequence of logical
steps, supporting reasoning and multi-step problem-solving. These techniques allow AI
systems to perform tasks beyond their original training and to follow more intricate rea-
soning processes. Ongoing research continues to expand the field of prompt engineering,
with recent studies proposing new strategies and frameworks to improve its effectiveness
[44] [45] [46]. As generative AI models become more powerful and versatile, prompt engi-
neering is expected to play an increasingly important role in maximizing their capabilities.
In summary, the field of LLMs is rapidly advancing, with ongoing research exploring
a wide range of capabilities. Current studies address topics such as general-purpose
code generation, compiler testing, and the integration of hybrid model architectures,
highlighting the dynamic nature of AI-assisted software development. Surveys in the
literature also underline the potential of LLMs in software engineering tasks. Beyond
programming, applications extend to areas like hardware design automation and data
analytics powered by AI, supported by commercial tools that offer both cloud-based and
on-premises deployment options. Additionally, prompt engineering techniques continue
to enhance the performance of LLMs by improving their accuracy and efficiency. As this
field progresses (and as already mentioned), future research is expected to investigate
new directions and expand the practical use of LLMs across various domains.
58
Chapter 3
Study Design, Methodology, and Implementa-
tion: A Complete Analysis of the Framework
3.1 Initial Generic Overview
3.1.1 Outline
The proposed software framework is conceptualized to offer data professionals mean-
ingful insights into the quality of a chosen dataset. Users would be able select the dataset
they wish to analyze, after which the system would perform a series of operations to gen-
erate quality-related metrics. While data quality assessment is a common task in the
field of data science, the current study distinguishes itself from most existing tools, by
enabling end-users to submit their own analysis and profiling queries using natural lan-
guage. Additionally, it is capable of handling datasets of varying sizes efficiently, offering
a modern approach to traditional data profiling.
Figure 3.1 illustrates the conceptual architecture of this thesis’s framework. The pro-
cess would begins when the data recipient or end-user selects a dataset from a designated
data lake for quality analysis. Based on the user’s preference, the system would retrieve,
either the full dataset, or a sample from the backend. Once the dataset is successfully
loaded, the first processing stage, known as the “Data Profiler,” would be activated. This
component aims to performs three key tasks: i)it applies a set of predefined quality rules
to the dataset, ii)it generates quality-related metrics based on those rules, and iii)it
delivers the resulting analytics to the end-user.
At this point, the user gains an overview of the dataset’s quality through the applica-
tion of general rules. If a more tailored analysis is desired, the user can define custom
quality queries using a simple prompt interface, expressing their quality metric in nat-
ural language. These user-defined rules, once configured, would be sent back to the
middleware, triggering the system’s second processing stage, the “Data Evaluator.” This
component aims to extend the initial profile generated by the Data Profiler, by incorpo-
rating the newly provided, user-specific quality queries. If necessary, the Data Evaluator
would reprocess the dataset and then: i)apply the custom rules, ii)perform additional fil-
tering based on those rules, and iii)return the enhanced quality insights to the frontend,
in the form of a structured output.
The initial objective during the conceptual phase of this study was to design a software
59
Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework
Figure 3.1. The initial architecture of the proposed scalable data management and analysis
framework, along with its two operational flows, ’Data Profiler’ and ’Data Evaluator’ [2].
asset as a lightweight and easily deployable tool for end-users. To achieve this, the system
should operate efficiently, while requiring minimal computational resources, such as
those typically available on consumer-grade machines. The development target includes
support for systems with an average CPU of up to six cores and 5 GiB of RAM, hardware
commonly found in budget-friendly personal computers.
A performance benchmark has been established, setting the goal for the software to
process and generate a quality profile for big data at a rate of 1 GB per minute (1GB/min).
This benchmark, when met under the defined hardware constraints, will demonstrate the
tool’s capacity to efficiently analyze complete large-scale datasets, reinforcing its suitabil-
ity for widespread deployment. To meet this performance target, the study expanded to
leverage insights from a scalable processing framework, being tested in critical infras-
tructure datasets [1], as presented in the next section. As mentioned before, both the
present work and the implemented data management system are conceptually grounded
in a previous research [3].
60
3.1.2 Framework Operational Flows
3.1.2 Framework Operational Flows
Data Profiler
The Data Profiler aims to serve as the initial component in the proposed framework’s
operational workflow. Its execution would begin as soon as the end-user selects a dataset
for profiling. Depending on the user’s preference whether to analyze the complete
dataset or a sample the system would retrieve the corresponding data from a desig-
nated repository. The Data Profiler then would initiate its process, following a series of
structured steps, as illustrated in Figure 3.1:
Perform basic pre-processing on the incoming dataset to prepare it for analysis.
Load a predefined set of general quality rules or key performance indicators (KPIs),
by parsing a configuration file containing these definitions.
Apply each rule to the appropriate columns in the dataset. For example, timestamp-
related rules are applied to timestamp fields. Rules intended for all columns are
applied accordingly.
Generate profiling metrics based on the applied rules, compiling the necessary in-
formation to construct the dataset’s quality profile.
Assemble the profile template using the extracted quality analytics.
After generating and forwarding the profile template to the end-user, the Data Profiler
would conclude its operation. It is conceptualized to run once per dataset selection during
a given session, and would be reactivated only when the user initiated profiling on a new
dataset.
Data Evaluator
The Data Evaluator aims to serve as the second component in the proposed frame-
work’s processing workflow. It would be activated only after the Data Profiler had com-
pleted its operation, and the end-user oped to apply custom, user-defined queries, in
order to enhance the dataset’s quality analysis. When the user would decide to gain
deeper insights through personalized criteria, the Data Evaluator would assume control.
Upon receiving analysis queries expressed in natural language by the end-user, the
Data Evaluator would carry out the following tasks, as illustrated in Figure 3.1:
Determine whether the dataset requires re-processing. If so, perform a basic pre-
processing step, similar to that of the Data Profiler.
Retrieve the new quality queries defined by the user, which are sent to the compo-
nent via the platform’s frontend.
Apply each rule to the relevant fields in the dataset. As with the Data Profiler, if a
rule is applicable to all columns, it will be applied universally. However, the queries
61
Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework
would have to be transformed from their natural language form, to an executable
format first.
Perform additional filtering based on the user-defined queries, and extract new
quality metrics to extend the existing dataset profile.
Update the previously generated profile template by incorporating the new analytics.
Unlike the Data Profiler, which would execute only once per session when the user
selected a dataset, the Data Evaluator could be invoked multiple times within the same
session. This flexibility would allow users to continuously refine the quality analysis, by
submitting new rules based on feedback received after each evaluation.
3.1.3 Technologies To Use
To ensure efficient and reliable management of the entire process, the proposed data
management and analysis asset should be developed on top of the Apache Spark frame-
work. This decision supports one of the study’s objectives: To perform profiling operations
across a wide range of data volumes, from small datasets and samples to large-scale data.
The Python programming language should be used to implement both the Data Profiler
and the Data Evaluator, leveraging its extensive ecosystem of data science libraries and
processing tools.
The implementation should involve setting up a standalone, but scalable, Apache
Spark instance on a single physical machine. Within this environment, the Data Profiler
and Data Evaluator will be executed as PySpark jobs, enabling seamless communication
with end-users, and maintaining high availability. This setup is designed to provide
fast processing times, even for larger datasets. Importantly, the software infrastructure
should be developed as a general-purpose module, capable of supporting tabular datasets
of various formats and structures.
During the selection of the underlying processing framework, the study considered
Apache Spark [47], Apache Flink [101], and Apache Storm [102]. All three are well-
established tools in the data processing domain. After evaluating their capabilities in
relation to the proposed system’s requirements, Spark was selected as the most appro-
priate choice. While Storm excels in real-time data processing [103], the current system
is intended for profiling existing or historical datasets, where batch processing is more
suitable.
Both Spark and Flink offer strong performance for batch processing tasks, with their
effectiveness varying depending on factors such as dataset size, data type, and workload
patterns [104] [105]. However, Spark stands out due to its better scalability, broader
analytical support, faster execution in many scenarios, and strong community and doc-
umentation support [48]. For these reasons, Spark should be ultimately chosen as the
foundation for the future system’s development.
In the development phase of the system, Python [106] should be chosen as the primary
programming language. Python’s clean and concise syntax promotes rapid development
and easy maintenance. Its extensive collection of libraries and frameworks such as
62
3.2 The Scalable Data Management Software Infrastructure
NumPy, Pandas, and SciPy offers powerful tools for data manipulation and analysis,
which are essential for Big Data processing and other related tasks [107]. Additionally,
Python’s compatibility with Apache Spark, via PySpark [108] (Python’s API for Spark),
enhances its suitability for parallel and distributed computing. This integration allows
developers to utilize Spark’s powerful capabilities for large-scale data processing, while
benefiting from Python’s simplicity and ease of use. Consequently, Python should be
selected as the preferred language for the development of the proposed system.
3.2 The Scalable Data Management Software Infrastructure
3.2.1 Data Processing and Virtualization
Once the desired data have entered the organization’s database, the next step is to
make these data accessible as a service. This capability should allow developers to build
cognitive and data-driven applications. To support this functionality, a data processing
and virtualization middleware is required. This middleware should serve as an inter-
mediary layer that connects data providers with data consumers, enabling efficient and
seamless interaction. This role is fulfilled by the proposed thesis’s scalable data manage-
ment layer.
Data virtualization is a method of data integration that provides access to information
through a virtual service layer, independent of the physical location of the underlying data
sources. This approach enables applications to retrieve data from multiple heterogeneous
systems through a single access point, offering a unified and abstracted view for querying.
In addition to unifying access, the virtualization layer also handles data transformation
and processing, ensuring that the data are appropriately prepared for use. One of the
key challenges in data virtualization involves managing the integration of diverse storage
systems, such as key-value stores, document databases, and relational databases. Each
system has distinct characteristics, making seamless integration complex. Furthermore,
data-intensive applications (that rely on virtualized sources) require guarantees regarding
performance, availability, and data quality, placing additional demands on the virtualiza-
tion system.
The proposed scalable data management layer is designed to address these challenges
and support the platform’s data interoperability goals. Its primary responsibility is to sat-
isfy the data quality requirements specific to each source, whether a critical infrastructure
or a business organization. The layer confirms proper preparation of data inputs from
various sources within the project’s architecture, maintains relevant metadata for all in-
coming data feeds, and makes the cleaned and processed datasets available to consumers.
Most of the layer’s workload originates from persistent data streams, which include pre-
viously collected and stored data. The complete architecture and data flow of the scalable
data management layer are illustrated in Figure 3.2.
63
Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework
Figure 3.2. Architectural view of the scalable data management layer, with its three
subcomponents [1].
For the data processing component of the layer, Apache Spark [47] was selected as
the core framework, which is on par with the initial planning. Spark supports a variety
of programming languages, including Java, Scala, Python, and R, and is backed by ex-
tensive documentation and a large user community. These characteristics contribute to
its popularity and ease of adoption. Moreover, Spark is capable of delivering advanced
analytical outcomes, making it suitable for complex data processing tasks. While other
frameworks offer distinct strengths and limitations, Spark distinguishes itself through its
flexibility, scalability, and efficient execution times [48], as previously stated.
The scalable data management layer is structured around three main subcomponents,
which work in coordination to fulfill its objectives: the Pre-Processing and Filtering Tool,
the Virtual Data Repository, and the Virtual Data Container.
3.2.2 Pre-Processing and Filtering Tool
The Pre-Processing and Filtering Tool is the first subcomponent of the scalable data
management layer, responsible for preparing datasets received from external sources.
Designed with a generic and adaptable structure, this tool can handle a wide range of
data types effectively. Upon receiving the complete dataset, it constructs a dataframe,
which serves as a tabular representation of all collected data. To maintain data consis-
tency, the tool examines the dataset’s metadata to identify the appropriate data types for
each column. When inconsistencies are detected, it performs data type corrections, an
64
3.2.2 Pre-Processing and Filtering Tool
essential step to ensure the reliability of subsequent data processing tasks that depend
on the structural integrity of the dataset. After aligning data types, the tool initiates the
cleaning and filtering process, applying a set of standard pre-processing techniques to
prepare the dataset for further use:
Removal of whitespaces from all cells containing string-type values, ensuring con-
sistent formatting.
Conversion of empty cells and instances with the literal string ’NULL’ to NaN (Not a
Number) values across all columns.
Deletion of records (rows) that either lack datetime values or contain invalid datetime
entries.
Transformation of valid datetime values to the UTC format, to ensure consistency
and support standardization.
Before completing its execution, the Pre-Processing and Filtering Tool performs two
additional key operations to enrich the dataset with analytical insights. First, it generates
a correlation matrix, which evaluates the statistical relationships among the dataset’s
columns. This correlation data is then stored alongside the cleaned dataset in the Virtual
Data Repository (described in a subsequent subsection), enabling further analysis at later
stages. In addition, the tool carries out outlier detection on all numerical columns. For
each column assessed, a new auxiliary column is created, containing "yes" or "no" values
to indicate whether the corresponding data point is classified as an outlier. The default
method for detecting outliers is based on the three standard deviations from the mean
rule, which serves as a threshold for identifying anomalous values:
A value xis considered an outlier if |xµ|>3σ
where:
x:the data point (observation)
µ:the mean of the dataset (for the column)
σ:the standard deviation of the dataset (for the column)
However, this threshold can be modified to suit specific requirements defined by users
or data consumers. It is important to note that the primary implementation of the Pre-
Processing and Filtering Tool is developed in Python, as a PySpark job, allowing seamless
integration with the Apache Spark framework while leveraging Python’s data processing
capabilities.
In summary, the Pre-Processing and Filtering Tool carries out thorough pre-processing,
cleaning, and filtering operations for each incoming dataset. During the initial phase,
the entire dataset is ingested and transformed into a structure compatible with Python,
adopting a tabular column-row format suitable for further processing. The cleaning stage
65
Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework
focuses on detecting and resolving data inconsistencies, including NULL values, empty
fields, outliers, and other invalid entries. Following this, a comprehensive filtering pro-
cess is applied to address all identified issues, either by replacing erroneous values or by
removing the affected records from the dataset or dataframe entirely.
3.2.3 Virtual Data Repository
The second subcomponent of the scalable data management layer is the Virtual Data
Repository (VDR), which functions as the temporary storage unit for all datasets that
have been pre-processed, cleaned, and filtered by the Pre-Processing and Filtering Tool.
Once a dataset has completed these operations, it is stored in the VDR together with its
corresponding correlation matrix, which captures inter-column relationships.
To meet the performance and scalability demands of the layer, the VDR is imple-
mented using MongoDB [49], a widely adopted document-oriented database. MongoDB
was selected for its robust support of auto-scaling, sharding, and flexible configuration,
all of which align well with the virtualization goals of the system. Furthermore, Mon-
goDB’s compatibility with Kubernetes [50], a powerful container orchestration platform,
enhances its ability to operate in dynamic, distributed environments. As a result, the VDR
is deployed within a Kubernetes cluster, benefiting from the platform’s advanced load bal-
ancing, replication, scaling, and scheduling capabilities, thereby ensuring consistent and
efficient data management.
The choice of MongoDB as the database for the Virtual Data Repository (VDR) was
made with relative ease. A document-based NoSQL database was deemed the most suit-
able solution for implementing temporary data storage within the scalable data man-
agement layer, given its alignment with the inherent characteristics of the data being
handled. This type of database is particularly well-suited for semi-structured data that
does not adhere to a rigid schema, yet follows specific formatting conventions, such as
XML, JSON, and BSON. In contrast, a relational database would require a comprehensive
understanding of the dataset structures in advance, limiting its flexibility to accommo-
date datasets with varied schemas. Given that the data used for testing is formatted in
JSON or CSV, MongoDB was chosen for its superior performance, flexibility, and scala-
bility. When compared to other key-value NoSQL databases, document-based databases
like MongoDB are more capable of supporting complex queries and diverse entity types,
critical features for the functionality of this study’s software infrastructure.
The resulting system is a modified MongoDB setup, configured with multiple replicas
to improve its resilience and ensure continuity in the face of system failures. The overall
robustness of the system depends on key factors, such as the number of replicas and
their distribution within the cluster. Since it utilizes more than one replica, the Virtual
Data Repository (VDR) can continue to function seamlessly even if one of the MongoDB
replicas fails. In such instances, the remaining replicas maintain uninterrupted opera-
tion, protecting the data and mitigating the risk of data loss or temporary unavailability.
All MongoDB replicas are treated as a unified database within the system.
The Kubernetes platform can play an essential role in this architecture, handling load
66
3.2.4 Virtual Data Container
balancing and distributing data efficiently according to optimal configuration strategies.
As a result, from the user’s perspective, the exact location of the queried data, whether
stored on a specific node or replica, remains abstracted, ensuring a smooth and consis-
tent user experience. The implementation of VDR has been thoroughly examined in a
publication from 2022 [109].
3.2.4 Virtual Data Container
The third and final subcomponent is the Virtual Data Container (VDC), which plays
a pivotal role in enabling data recipients to access the stored data in the Virtual Data
Repository (VDR). The VDC is a flexible and general-purpose subcomponent, tasked with
further processing and filtering the data according to specific queries defined by data
consumers. These filtering rules serve two main functions. First, they enable the selection
of relevant data for a particular user, effectively creating a tailored "data pool" that aligns
with the user’s specific requirements. Second, the user-based queries help identify and
remove erroneous data that are known to the users from experience, such as extreme
outliers (e.g., outdoor temperatures of minus 100 degrees Celsius), which could indicate
sensor malfunctions.
When a data consumer submits a request to the VDC, they can specify both the
filtering criteria and the desired data format, allowing for customized data transformation.
In addition, the VDC is responsible for providing essential metadata for each dataset
stored in VDR. This metadata includes details such as the dataset’s size, the number of
rows and variables, the timestamp of the last update, and more. The metadata is made
available through a RESTful API, ensuring easy access and integration for data recipients.
Figure 3.3. The information flow of the Virtual Data Container [3] .
Regarding the foundation of the Virtual Data Container (VDC), Apache NiFi [110] has
been chosen as the preferred platform for building data flows, allowing the scalable data
management layer to efficiently deliver the cleaned and processed data to data consumers.
The flow responsible for implementing the VDC Rules System (which manages data pro-
67
Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework
cessing and filtering according to user-defined queries) is also introduced to the user
through NiFi. The actual tasks of processing, filtering, and transforming the data are
carried out by Apache Spark.
When selecting the most suitable solution, several factors were considered. The three
primary tools examined were Apache Flume [111], Kafka [112], and NiFi, all of which are
known for their strong performance, capacity for horizontal scalability, and support for
plug-in frameworks that allow functionality extensions via custom elements. Ultimately,
the decision came down to Apache Flume and Apache NiFi. While Flume is configured
through configuration files, NiFi offers a more user-friendly experience with its graphical
interface for configuring and monitoring data flows. Given the advantage of having a
visual interface and NiFi’s proven efficiency and scalability, it emerged as the final choice
for the VDC foundation.
As for the VDC’s rule structure, it has been implemented with two different ap-
proaches. One approach, the one selected for the final version of the framework (lever-
aging offline large language models that interpret natural language quality queries and
convert them to executable code), is analysed in the next subsection. Another approach,
is designed to be more simple and straightforward, and is presented below. It consists
of three key elements for each rule / query: a “subject column”, an “operator”, and the
“object”. The expected format of the rules list is a JSON array, containing rules repre-
sented as JSON objects, each including these specific string values. The VDC interprets
and implements the rules from this list onto the requested dataset.The architecture of the
incoming rules JSON file is based on the following:
A JSON object, which includes:
A string field denoting the dataset’s name, and another one for the dataset’s
ID.
A JSON array containing the rules as individual JSON objects.
Each JSON object (rule) in the array contains:
A string field for its name.
Another JSON object representing the rule itself, comprising the “subject_-
column”, “operator”, and “object” fields.
If the “operator” is a disjunction (using the ”or” expression), the “object” field
should be a JSON array, containing two or more objects, each with single string
“operator” and “object” fields.
By following this procedure (apart from the main approach with utilization of AI mod-
els, as presented later on), the Virtual Data Container is able to accurately interpret and
apply the specified rules to the dataset, ensuring efficient filtering and transformation of
the data according to each data consumer’s requirements. Figure 3.4 presents the list of
supported operators that rule authors including data scientists, developers, and end-
users can utilize. Any rule object containing unrecognized operators is automatically
68
3.3 The Final System with the Utilization of Offline LLMs
excluded from the processing pipeline. The rules system is based on two core principles,
designed to guide users in understanding the intended structure and purpose of the rules:
1. Each rule is designed to filter a single subject (i.e., one dataset column) at a time,
without combining multiple columns within the same rule. If a rule targets more
than one column, it is better categorized as a pre-processing operation rather than
a straightforward filtering task. Likewise, low-level operations, such as altering
individual row values or removing rows based on specific value types, fall outside
the scope of this filtering-focused system.
2. To maintain the generic nature of the scalable data management layer, the rules
must conform to a standard subject–operator–object format. Introducing complex
rules that deviate from this structure would undermine the layer’s reusability across
diverse datasets. While dataset-specific pre-processing logic can be implemented
with minimal effort, such conditional handling (e.g., “if dataset X, apply logic Y”)
would compromise the layer’s core design principle of broad applicability and mod-
ularity.
Figure 3.4. Accepted operators by VDC’s Rules System [1].
3.3 The Final System with the Utilization of Offline LLMs
As previously mentioned, one main objective is to design a scalable framework for data
profiling and analysis that can effectively handle datasets of various sizes. This thesis’s
final system implementation aims to allow end-users to submit analytical queries using
natural language. These queries are then interpreted and converted into executable code
by an offline large language model (LLM). The resulting code is passed to the aforemen-
tioned data management and processing layer for execution. This process enables users
to perform analytical tasks simply by describing them in plain language, enhancing ac-
cessibility and ease of use. While the underlying data platform can be securely deployed,
the integration of an LLM in this context requires further validation. Although tradi-
tional statistical techniques are not applied in this study, methodological rigor is ensured
through a well-structured experimental design.
In brief, the final part of this study begins by establishing its primary aim: to assess
the capability of an offline LLM in generating valid code for data analysis tasks. The next
step involves selecting suitable datasets for testing and evaluation. This is followed by the
69
Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework
formulation of five unique queries per dataset. An execution plan is then developed, which
outlines the number of test iterations and defines the communication process between
the LLM and the data processing system. Evaluation metrics are also specified, with
justification provided for their use. Following this, results are collected and analyzed to
evaluate the LLM’s performance. Finally, conclusions are drawn based on the findings,
and suggestions for future improvements are presented.
In detail, the methodology followed in this study is outlined as follows:
1. Objective Definition: This final and main part of this study aims to evaluate the
effectiveness and efficiency of an offline large language model in generating code for
data analysis tasks. The LLMs used in this evaluation are Codestral by Mistral AI
and Qwen 2.5 Coder by Alibaba. These models are tested on their ability to produce
Python code using the PySpark framework, focusing on data profiling and analytical
operations. The reasons behind the selection of Codestral, Qwen, and PySpark are
discussed in subsection 3.4.1.
2. Dataset Selection: Five publicly available datasets were chosen to test and validate
the proposed approach. These datasets cover a range of domains, including social
media analytics, weather data, and supermarket sales records. All selected datasets
are introduced in subsection 3.3.2. Using multiple datasets instead of a single
one helps evaluate the LLMs’ ability to generalize across different data types and
contexts, ensuring that their performance is not limited to one domain.
3. Query Design: For each of the five datasets, a set of five queries has been created,
resulting in a total of 25 queries used for testing. These queries are grouped into
three categories based on their complexity: “Basic”, “Intermediate”, and “Advanced”.
This classification reflects the level of difficulty in the code generation tasks required
from the LLM. Their definition is as follows:
Basic: These queries focus on simple tasks such as filtering data, counting
records, or extracting distinct values. They require fundamental operations
that help reveal the basic structure and contents of the dataset.
Intermediate: These queries involve moderately complex tasks, including group-
ing data, calculating aggregates, or performing basic arithmetic functions. Al-
though more involved than basic queries, they still rely on standard data pro-
cessing methods.
Advanced: These queries address more sophisticated operations, such as
multi-level grouping, column transformations (e.g., merging or exploding), and
performing statistical analyses. They are designed to assess the LLM’s capabil-
ity to generate code for advanced data manipulation, and to extract meaningful
insights from the data.
The full set of 25 authored queries is presented in subsection 3.4.3.
4. Execution Plan: Each of the 25 queries will be submitted to the offline LLM ten times.
During each iteration, the entire process will be carried out, and the outcomes will
70
3.3 The Final System with the Utilization of Offline LLMs
be documented. In total, ten tests will be performed for each query. This repetition
aims to evaluate the model’s consistency by observing variations in the generated
code, as well as the stability and reproducibility of the outputs. This approach is in
line with established experimental practices, where repeated testing enhances the
reliability of findings [51]. Overall, 250 tests will be conducted—ten for each of the
25 queries—based on the datasets selected for this study.
5. Evaluation Metrics: The outcome of each test will be assessed using a set of defined
evaluation metrics. Once all 250 tests are completed, the results will be compiled
into a unified evaluation dataset. As the study is carried out independently for two
offline LLMs, the total number of tests will reach 500. Each LLM’s performance will
be analyzed based on the following evaluation criteria:
Functional Correctness: This metric evaluates whether the generated code
produces the expected output. In other words, it assesses if the code success-
fully delivers the result that the user intended, based on the original natural
language query.
Readability: This metric reflects how easily a human can read and understand
the generated code. The score is calculated using a custom function imple-
mented within the data processing platform, described in Appendix A.1. The
function considers three key aspects: line length, the depth of method call
chains, and the level of code nesting. Penalties are applied to code with lines
exceeding 80 characters, method chains with more than three calls, and nest-
ing deeper than two levels. The function returns a score from 1 to 3, where a
score of 3 indicates high readability.
Efficiency: This refers to how efficiently the code generation process utilizes
computational resources. Key performance indicators include response time,
GPU and CPU usage, and memory consumption for both processing units.
These are monitored using a Python-based resource tracking tool. This metric
helps evaluate the computational cost associated with each test.
Contextual Performance: This metric assesses how well the LLM handles
queries of varying complexity levels—Basic, Intermediate, and Advanced—as
previously defined. The results provide insight into the model’s ability to adapt
to different contextual demands, and allow for performance comparisons across
these categories.
Automation: This metric evaluates whether the code produced by the LLM
can be executed automatically or with minimal manual adjustments. In some
instances, the model may include natural language explanations outside of
Python comments, despite explicit instructions to avoid such outputs. Ad-
ditionally, the LLM may use different variable names than those specified in
the prompt. When such deviations occur, manual intervention is required to
modify or clarify the generated code before proceeding.
71
Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework
Error Handling: This metric checks whether the generated code runs without
errors. It identifies any issues encountered during execution, ranging from
minor warnings to critical errors that could interrupt or terminate the process.
6. Data Collection and Analysis: The final step of the methodology involves collecting
the results from all 250 tests, combining them into a unified dataset, and conducting
an exploratory analysis. Therefore, each LLM will have a dataset containing 250
rows. This analysis will offer valuable insights into the LLM’s performance across
the various evaluation metrics. Ultimately, the primary objective of this thesis’s
section will be assessed: Whether offline LLMs can effectively support data science
by generating code for data analysis tasks.
3.3.1 System Architecture
Figure 3.5. The proposed system’s final architecture overview [4].
Figure 3.5 provides a high-level overview of the final architecture utilized in this the-
sis’s study. The process begins when the end-user submits a query in natural language. A
relevant dataset is pre-selected, allowing for the determination of the appropriate dataset
summary, as well as and the main dataset file necessary for code execution within the
data processing platform. The user’s query is then combined with the dataset summary
into a single message, which is forwarded to the LLM during the prompt engineering
phase, as detailed in subsection 3.4.2. This prompt not only merges the query and the
summary, but also supplies crucial instructions to the Codestral and Qwen models.
Once the LLM receives the message, the response generation process begins. Dur-
ing this phase, a simple Python script monitors the computational resources of the LLM
server. After the model generates its response, it is sent back through the communication
72
3.3.1 System Architecture
pipeline, along with the monitoring data. The end-user then reviews the response to deter-
mine whether the code can proceed directly to the next phase (signifying full automation),
or if it requires minor adjustments, such as the removal of extraneous text not marked by
the LLM, thus classifying the process as semi-automated, and rejecting full automation.
Following this step, the generated code and monitoring results are consolidated into a
single JSON object (see Listing 3.1). The LLM communication pipeline then concludes by
transmitting the newly formed object to the data processing platform, which triggers the
execution of a new Python Spark job. Subsequently, the pre-defined dataset is retrieved
and processed accordingly, while the received object is parsed to extract the generated
code, along with the performance monitoring results.
Listing 3.1: Code from the LLM communication pipeline, preparing to forward the gener-
ated code and performance metrics to the data processing platform.
submit_spark_job_kpi(
spark_master_ip="<sparkiphere>",
code_repetition_id=crid,
dataset="netflix",
user_query=user_message,
generated_code= llm_response,
llm_response_cpu=absolute_cpu,
llm_response_memory=absolute_memory,
llm_response_gpu=avg_gpu,
llm_response_response_time=response_time,
llm_response_gpu_mem=avg_gpu_mem,
#llm_version="Codestral V0 1 22B Q6_K",
#llm_version="Qwen 2.5 Coder 14B Q6_K",
output_format="csv",
output_directory="<providedoutputdirectory"+str(repetition_number),
automated="true"
#automated="false
)
The next step involves the application of the generated data analysis code, which
executes the specified commands on the loaded dataset, in order to produce the desired
analysis results. Once this step is completed, the analysis results are extracted. A final
JSON object is created, as each test’s result template, which includes the LLM server’s
monitoring metrics, the generated code, and other relevant attributes of the process. This
object, along with the analysis results from the code application step, is stored locally.
Both files are reviewed by the end-user, who assesses the analysis results to determine
whether they are accurate and align with the intended outcome, based on the context of
the original query. In either case, a clarification regarding functional correctness is added
to the test result object, marking it as ’True’ for correct results, and ’False’ for incorrect
ones. This step concludes the test flow, as shown in Figure 3.5. As previously mentioned,
five queries will be tested for each of the five datasets, with ten iterations per query. This
process will be repeated twice, once for each offline LLM, resulting in a total of 500 tests:
250 for each model.
73
Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework
3.3.2 Dataset Selection
In this study, five different datasets were selected for the evaluation process. Each
dataset is briefly described in the prompt message provided to the LLM, as discussed in
subsection 3.4.2. The code produced by the LLM is then executed through a PySpark job,
which loads the corresponding dataset and runs the generated code. Although Apache
Spark (the underlying framework of the proposed platform) is designed to manage large-
scale data, the size of the datasets was not a selection criterion for the evaluation of this
thesis’s part. This is because the focus of this part is on the LLM’s ability to generate
code for data analytics tasks, rather than on processing entire datasets. The datasets are
not loaded into the LLM itself, which makes their size largely irrelevant to the evaluation.
The datasets were sourced from Kaggle and Maven Analytics’ Data Playground, both of
which provide free and publicly accessible data.
The first dataset used is titled “Netflix: Movies and TV Shows”, available on Kaggle
[113]. It includes detailed records of nearly 9,000 titles available on Netflix, covering both
movies and television shows. Each entry is identified by a unique ID and categorized as
either a ‘Movie’ or a ‘TV Show’. The dataset includes the title, director, and cast, offering
insights into the creative contributors behind each work. It also specifies the country or
countries of production for each entry. The dataset provides the date each title was added
to the platform, as well as its original release year. A rating field classifies the content
based on age suitability. The duration is noted in minutes for movies and in seasons
for TV shows. Furthermore, a genre column groups titles by content type, and a short
description summarizes the plot or main theme of each entry. This dataset can support
the exploration of patterns and trends in Netflix’s content offerings.
The second dataset selected is titled “COVID-19 Twitter” and is available on Kaggle
[114]. It contains a large collection of tweets related to the COVID-19 pandemic, gathered
during three time periods: April–June 2020, August–October 2020, and April–June 2021.
The dataset includes nearly one million tweets, each identified by a unique ID and times-
tamp. A column labeled ‘source’ indicates whether the tweet was posted using an Android
or iPhone device. For each tweet, both the original and a cleaned version of the text are
provided, along with the language in which it was written. The dataset also includes
engagement metrics, such as the number of likes (favorites) and retweets. Additional
fields identify the author of the tweet, as well as any hashtags or user mentions. A ‘place’
column records the location associated with the tweet. Furthermore, the dataset includes
sentiment analysis scores such as compound, negative, neutral, and positive values
which contribute to an overall sentiment classification. This dataset is well-suited for
testing data analysis queries in the context of social media content.
The third dataset, named “Shared Cars Locations,” is also sourced from Kaggle [115].
It contains over 20 million records related to the AutoTel shared car project, which was
implemented in Tel Aviv [116]. Each entry includes geographic coordinates (latitude and
longitude) that show the exact location of shared vehicles across the city. The dataset also
indicates how many cars are available at each location, and lists the specific car identifiers
present. A timestamp is included for every entry, showing when the data was collected.
74
3.3.2 Dataset Selection
This dataset enables location-based analysis, and can be used to examine patterns in car
availability and usage density across different urban areas.
The fourth dataset, titled “Madrid Daily Weather”, was obtained from Maven Analyt-
ics’ Data Playground [117]. It contains nearly 7,000 records of daily weather data for
the city of Madrid, covering the years 1997 to 2015. Each entry corresponds to a specific
date, recorded in Central European Time (CET), and includes various meteorological mea-
surements. These include the maximum, mean, and minimum temperatures in degrees
Celsius, as well as dew point values and their fluctuations. Humidity levels—maximum,
mean, and minimum—are also recorded. Additional variables include sea level pressure,
visibility, and wind speed data, with maximum gust speeds noted as well. The dataset
also reports precipitation amounts, cloud cover, and the presence of weather events such
as rain, snow, or fog. Wind direction is provided in degrees. This dataset offers a valuable
resource for studying weather patterns and trends in a specific urban region, over an
extended period.
The fifth dataset, named “Supermarket Sales”, is available on Kaggle [118]. It contains
1,000 records of individual sales transactions from a supermarket chain, collected over
a three-month period across three different branches. Each transaction is identified by
a unique invoice ID, and includes information about the branch and the city where the
sale took place. The dataset also records the customer type (‘Member’ or ‘Normal’) and
the gender of the customer. Products are grouped by product line, and each sale includes
the unit price, quantity purchased, and the amount of tax applied. The total amount of
each transaction is also calculated. Additionally, the dataset provides the date and time
of each purchase, the payment method used, and financial indicators such as cost of
goods sold (COGS), gross margin percentage, and gross income. Customer satisfaction is
represented through a rating score for each transaction. This dataset supports analysis
of consumer behavior, sales trends, and financial performance within the retail sector.
To strengthen the diversity and reliability of the evaluation process, the five datasets
were selected based on two main criteria. First, they needed to be open-source and
publicly accessible to support reproducibility. Second, they were chosen to represent dif-
ferent domains, ensuring a varied dataset collection. As previously mentioned, all selected
datasets are publicly available. Their domains cover a wide range of fields: entertainment
(Netflix: Movies and TV Shows), social media (COVID-19 Twitter), transportation (Shared
Cars Locations), weather (Madrid Daily Weather), and retail (Supermarket Sales). This
diverse set of datasets reflects the complexity of real-world data analysis tasks, and aims
to help reduce potential bias that may arise from using a narrow or homogeneous data
selection [119]. Such an approach is essential for fairly evaluating the performance of
offline LLMs across different application areas, while maintaining the reproducibility and
objectivity of the study.
These datasets form the foundation of the evaluation process for the proposed sys-
tem. From each dataset, five natural language queries will be selected. These queries,
presented in subsection 3.4.3, will be used to assess the ability of the offline LLM to
generate code for data profiling and analysis tasks.
75
Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework
3.4 Technical Components
As briefly discussed in the previous section ??, the study presented in this thesis’s
part involves the use of an offline large language model (LLM), selected for its ability to
generate code. Alongside the model, a data processing platform has been developed to
manage datasets effectively and execute the code produced by the LLM. This platform
is practically based on the scalable data management layer, which was presented in
section 3.2. The interaction between the language model and the processing platform is
facilitated through a carefully constructed prompt, designed as part of the study’s prompt
engineering process. The evaluation includes a set of dataset-specific queries, which are
organized by dataset and assessed based on their level of complexity.
3.4.1 LLMs and Data Processing Platform
The offline LLMs chosen for this study are Codestral by Mistral AI [52] and Qwen 2.5
Coder by Alibaba [53]. Codestral is a 22-billion parameter model developed specifically
for generating code. It has been fine-tuned on more than 80 programming languages,
ranging from widely used languages such as Python, Java, and C++, to more specialized
ones like Fortran and Swift. This broad training enables it to handle a wide range of
programming tasks across various environments. Codestral has demonstrated strong
performance on several standard benchmarks, including HumanEval [120], RepoBench
[121], and CruxEval [122], excelling in both code completion and error reduction. It
currently outperforms other major models, such as CodeLlama 70B [123], DeepSeek
Coder 33B [124], and LLaMA 3 70B [125], on most evaluation tasks [126]. Additionally,
Codestral offers a large context window of 32,000 tokens, which allows it to process longer
and more complex codebases. This feature is especially valuable for tasks involving full-
code repositories and long-range dependencies.
Qwen 2.5 Coder, developed by Alibaba, is a specialized model family for code-related
tasks. The 14-billion parameter version used in this study delivers competitive perfor-
mance across many code generation benchmarks, even outperforming larger models like
this study’s Codestral 22B and DeepSeek Coder 33B [124] in several areas [127]. Based
on the Qwen 2.5 architecture, this model has been trained on over 5.5 trillion tokens, col-
lected from public code repositories and web sources. It supports various programming
languages, and shows strong capabilities in both general reasoning and mathematical
problem-solving. Qwen 2.5 Coder 14B has achieved notable results in evaluations such
as HumanEval, MBPP, and BigCodeBench, demonstrating strengths in code generation,
completion, and repair. Its extensive training and instruction tuning make it suitable for
use in both coding assistants and practical applications. It is also worth mentioning that
the larger 32B version of Qwen 2.5 has outperformed even GPT-4o in some benchmark
tests [128]. However, the 14B variant was selected for this study, based on the hardware
limitations of the system used.
Codestral and Qwen exhibit strong performance characteristics, establishing a high
standard in code generation, while maintaining low response times. This is an essen-
76
3.4.1 LLMs and Data Processing Platform
tial factor for developers who rely on quick feedback during the development process.
Benchmark results highlight the effectiveness of both models in tasks such as Python
and SQL code generation, as well as fill-in-the-middle (FIM) operations, which are crucial
for efficiently completing partial code segments. Their open-weight availability, under the
Mistral AI Non-Production and Apache 2.0 licenses (respectively), enables straightforward
offline deployment, making them suitable for research and non-commercial applications
without requiring constant internet access. This is particularly advantageous for devel-
opers and institutions with data privacy concerns, or limited online connectivity. The
models’ combination of speed, precision, multilingual support, and offline accessibility,
makes them well-suited for this study’s focus on offline code-oriented language models.
Continuing from this analysis, despite having fewer parameters compared to some
larger models, both Codestral and Qwen consistently achieve strong results among offline
LLMs. Specifically, the Qwen 2.5 Coder 7B model frequently surpasses the performance
of CodeLlama 70B and, in some cases, outperforms Codestral 22B. Given these out-
comes, comparing them with smaller versions of already outperformed models such as
lower-parameter variants of CodeLlama would offer limited scientific value and could
introduce methodological inaccuracies. Furthermore, the hardware limitations in this
study restrict the evaluation of models exceeding 30 billion parameters. These factors
collectively justify the selection of Codestral 22B and Qwen 2.5 Coder 14B, as their pa-
rameter sizes and demonstrated capabilities enable a thorough and balanced evaluation
within the scope of the available computational resources.
It is important to acknowledge that, given the rapid advancements in LLM research,
newer models may eventually outperform Codestral and Qwen. As a result, ongoing
evaluation of emerging architectures remains essential. This thesis’s study investigates
the broader capabilities of offline LLMs in the context of data science, using Codestral
and Qwen as representative examples. Since the methodology is not tied to any specific
model, the findings and approaches presented here can be extended to future models as
well. The decision to focus on offline LLMs, as opposed to online alternatives, is driven
by several factors. Primarily, offline models allow local data processing, which enhances
privacy and security, while supporting compliance with data protection regulations. This
setup also reduces the risks associated with transmitting sensitive information to external
servers.
In addition, offline LLMs offer greater flexibility and control, as they can be fine-tuned
for specific tasks and seamlessly integrated into tailored software systems. These capabil-
ities are often restricted in commercial, cloud-based solutions. From a cost perspective,
offline models can also be more economical over time, as they do not incur recurring
subscription fees or usage-based charges. Their independence from internet access im-
proves reliability and ensures low latency in time-sensitive applications, while also insu-
lating users from potential disruptions caused by changes in third-party service offerings.
Moreover, offline deployment supports ethical and legal alignment, allowing institutions
to adapt models to comply with regional laws and standards. Finally, offline LLMs create
an open environment for experimentation and innovation, enabling researchers to explore
new ideas without the constraints imposed by proprietary platforms.
77
Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework
Although online LLMs are often recognized for their high accuracy and quick response
times benefits largely due to their cloud-based, dynamically updated infrastructure
these aspects were not the primary concern of this research. As previously discussed, a
key objective of this study is to ensure that both data (and associated metadata) remain
within a locally controlled environment. This approach prioritizes privacy, regulatory
compliance, and full control over the data processing pipeline. While cloud-hosted models
may offer performance benefits in certain scenarios, their dependence on external servers
and stable internet connectivity introduces uncontrolled variables that are beyond the
intended scope of this research.
As outlined earlier, the offline LLM in this framework is designed to return code-based
responses, which are then executed within a data processing platform, based on the
already presented scalable data management layer. For the purposes of this study,
Apache Spark [47] is selected as the most suitable tool for managing and executing data
operations. Based on the previous justifications provided, Apache Spark is a powerful
platform optimized for high-performance data processing. It supports in-memory compu-
tation, significantly improving execution speed, particularly for tasks involving repeated
iterations. Spark is highly scalable, capable of efficiently handling large datasets across
distributed computing clusters. Its flexibility is another major advantage, as it supports
a broad range of processing tasks, including batch processing, real-time data streaming,
machine learning, and graph analytics, making it a comprehensive solution for scalable
data workflows.
When integrated with Python through the PySpark interface, Apache Spark becomes
more user-friendly. PySpark allows users to leverage the simplicity and flexibility of
Python, while utilizing Spark’s powerful distributed computing capabilities. This integra-
tion enables efficient execution of a variety of tasks, including complex data transforma-
tions, machine learning workflows, and the processing of large-scale datasets: All within
the familiar Python ecosystem, which benefits from extensive libraries and strong com-
munity support. In the context of this thesis’s study, the code generated by the language
model is executed through PySpark jobs, running directly within the Spark environment.
Apache Spark has been widely adopted in numerous big data management frameworks,
due to its scalability, efficient resource allocation, and straightforward deployment pro-
cess. These features make it a reliable choice for enabling effective data management and
analysis across different domains [129] [1].
3.4.2 Prompt Formulation
In the offline setting of this study, the main optimization is achieved through the
careful design of prompts, which provides clear and detailed instructions to the language
model. Since the models used are Instruct variants, they are specifically built to respond
more effectively to well-structured prompts, which helps improve their performance and
the accuracy of the generated commands. Since it clearly states the task and offers
the necessary context, this approach ensures that the model’s output is both relevant
and suited to the dataset being used. As such, the emphasis on prompt design is a
78
3.4.2 Prompt Formulation
strategic choice for enhancing performance in an offline environment, where conventional
optimization methods are limited.
Beyond improving performance, the structured prompt design also helps reduce the
likelihood of biased outputs. This is achieved by supplying explicit guidance and sufficient
context to the model. Combined with the use of a varied dataset covering a range
of domains this approach reduces ambiguity and lowers the chances of unintended
results. Additionally, the current study’s repeated testing process allows for consistent
evaluation of the outputs, making it easier to detect and correct any deviations or potential
biases.
Based on this context, in order to improve the code generation capabilities of the
Codestral and Qwen language models, a specifically designed prompt message has been
developed to initiate interaction with each model, as shown in Listing 3.2. Each time
a user submits a natural language query, it is combined with this prompt message to
guide the model in producing accurate and relevant code. This method follows the ‘few-
shot’ prompt engineering strategy, which involves providing the model with key details
about the dataset being analyzed. The prompt includes a concise overview of the dataset’s
structure, format, and columns, along with descriptions of each column. This ensures
that the code generated is effectively tailored to the specific characteristics of the dataset.
The full prompt message is presented below:
Listing 3.2: The prompt message authored for this study’s interaction with each LLM.
system_message = (
"<s>[INST]YouareavirtualassistantspecializedingeneratingPythonSparkcommandsbased
onthecontextprovidedbelow.YourtaskistooutputonlyPythonSparkcommands.Donot
addadditionaltextorexplanation.StrictlyuseonlycommandsthatcanbeappliedtoaSpark
Dataframe.RefertotheSparkDataFrameas’df’,andstoretheresultsintheSparkDataframe
named’processeddf’.Context:[/INST]" \
+str(dataset_summary) \
+ "[INST]Exampleinput:’filterthedatawheretemperatureisinthebottom20%ofthetotal
temperaturevalues’.Exampleoutput:’processeddf=df.filter(df["Temperature"]<=df.
approxQuantile("Temperature",[0.2],0.0)[0])’.Ensurethatnobackslashesorescape
charactersareincludedinthegeneratedcommands.Generateefficientcommands,usingas
fewstepsaspossible.Separatecommandswithasemicolonornewline.Ifyouprovidetextor
explanation,doitusingpythoncommenting,with’#’.Iftheinputiscompletely
incomprehensible,respondwith’LLMERROR:’.Thisshouldberare.Otherwise,generate
onlytherequiredPythonSparkcommands.[/INST]</s>"
)
The prompt message presented includes both a series of instructions directed at the
language model, as well as contextual information (clearly specified as ‘context’) related
to the dataset for which the generated code is intended. The main components of the
prompt’s structure can be summarized as follows:
Objective Definition: The prompt should explicitly state the intended goal for the
language model. In this case, the objective is to generate Python Spark commands
without any explanatory text. This helps the model recognize the specific scope and
79
Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework
purpose of its task. It is important to highlight that generating PySpark code intro-
duces additional complexity, as PySpark DataFrame operations differ in subtle ways
from those used in Pandas. The model is also assessed on its ability to distinguish
between the two, and produce valid PySpark DataFrame code exclusively.
Context Provision: The prompt must supply the language model with adequate con-
text to perform the task accurately. This involves summarizing the dataset’s struc-
ture and content that the model will use. Through the inclusion of this background
information, the prompt equips the model to generate suitable outputs, specifically
targeting a Spark DataFrame named ‘df‘.
Instruction Clarity: The prompt must include clear and specific instructions, out-
lining both what the model is expected to do and what it should avoid. For instance,
it is instructed to generate only Python Spark commands, without any accompany-
ing explanations. The prompt also directs the model to separate commands using
either semicolons or newlines. Additionally, it provides guidance on how to handle
potential errors and emphasizes the need for concise and efficient code generation.
Examples Inclusion: Including illustrative examples within the prompt helps demon-
strate the expected input-output format. These examples assist the model in under-
standing the style, structure, and type of commands required. With the inclusion
of representative samples, the prompt helps align the model’s responses with the
intended output format and content.
Error Handling: The prompt should include clear instructions for situations where
the input cannot be understood. This provides the model with a predefined strat-
egy for managing unexpected or ambiguous input. By addressing error handling
within the prompt, the model is better equipped to respond appropriately when it
encounters input it cannot interpret, thus improving the reliability of its responses.
Message Formatting: The prompt should comply with any formatting rules required
by the specific language model or system in use. In this case, the prompt includes
formatting elements, such as the ‘[INST]’ tag, to ensure proper interpretation by the
model. Adhering to these conventions supports accurate parsing of the instructions,
and contributes to more consistent and reliable code generation.
As shown in the middle section of Listing 3.2, the prompt message also includes a
variable named ‘dataset_summary’. This variable holds a summary of the dataset being
used at each instance. To support this, five dataset summaries have been carefully
prepared, ensuring they accurately capture the key characteristics of each dataset. Each
summary follows a uniform format, beginning with an introductory statement about the
dataset, followed by a detailed description of each column. This includes the column
name, data type, a brief explanation, and representative sample values. Additionally,
a note is included to clarify that the sample data is synthetic and provided solely for
illustrative purposes. This structured approach is intended to help the language models
80
3.4.3 Dataset Queries
better understand the dataset’s format and content, facilitating more accurate analysis.
An example of a column from the “Shared Cars Locations” dataset is provided in Listing
3.3 below:
Listing 3.3: The ’timestamp’ column of the "Shared Cars Locations" dataset, part of the
dataset’s summary.
{
"name": "timestamp",
"type": "object",
"simplified_type": "datetime",
"description": "TimestampofthedatarecordinUTC.",
"sample_values": [
"2019120621:51:02UTC",
"2019112914:00:02UTC",
"2019092013:21:03UTC"
],
"datetime_format": "%Y%m%d%H:%M:%S%Z"
}
In the example presented, the ‘timestamp’ column records the exact date and time
at which each data entry was captured. It is stored as an object type, with a simplified
type label of ‘datetime’ to indicate that it contains timestamp values. The description
notes that all timestamps follow Coordinated Universal Time (UTC), and the format used
is “%Y-%m-%d %H:%M:%S %Z” (year-month-day hour:minute:second timezone). The full
dataset summary for the “Shared Cars Locations” dataset is available in Appendix A.2.
This level of detail is particularly important in an offline setting, as it allows the language
model to interpret the dataset structure accurately. When used as part of the overall
prompt, these summaries help the model generate code that is appropriately adapted to
the specific attributes and data formats involved.
In summary, the prompt is designed to guide the model toward producing only Python
Spark commands, ensuring that the output is directly executable within the Spark en-
vironment for data processing tasks. The focus is placed on clarity, precision, and the
exclusion of unnecessary text. Additional instructions are included to help the model
handle errors or ambiguous input effectively. This strategy enhances the quality of inter-
action with the language model, contributing to more efficient and reliable code genera-
tion. The described prompt message functions as a key component in the communication
flow between the language model and the Spark-based data management and analysis
platform.
3.4.3 Dataset Queries
As described in subsection 3.3, the queries developed for this study are grouped into
three categories: Basic, Intermediate, and Advanced. A total of 25 queries were created,
five for each of the five datasets. Each dataset includes one Basic, two Intermediate, and
two Advanced queries. This classification is based on the complexity of the operations
and data transformations required to answer each query. Due to the lack of a universally
81
Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework
accepted standard for defining query difficulty, existing research was reviewed to support
the development of a safe and practical three-level classification system [130] [131].
While this categorization involves some degree of subjectivity, it was refined through an
iterative process to ensure that each level accurately reflects a distinct degree of challenge
for the language model. As noted earlier, to mitigate potential bias, each query is executed
ten times, and its performance is assessed using a set of evaluation metrics. These
include functional correctness, readability, efficiency, contextual relevance, automation
capabilities, and error handling. This structured methodology aligns with established
experimental practices in the field, where tasks are grouped by complexity to allow for a
more comprehensive evaluation of system performance [132] [133]. The complete list of
queries, organized by dataset, is presented below:
For the “Netflix: Movies and TV Shows” dataset, the following queries were designed
to evaluate different aspects of the LLM’s code generation capabilities:
1. “Count the Number of TV Shows per Country.” (Basic): This query evaluates whether
the LLM can correctly produce code for group-by operations and basic aggregation.
2. “Find the Average Duration of only the Movies, in minutes.” (Intermediate): This
task checks the LLM’s ability to perform arithmetic calculations and manage mixed
data types, such as extracting numerical values from text.
3. “Return the Top 5 Most Frequent Genres for Movies only, Released After 2010.”
(Intermediate): This query involves filtering, counting, and sorting operations, along
with managing multiple values stored within a single column, adding moderate
complexity.
4. “Identify the top five directors who have worked with the greatest number of different
actors.” (Advanced): This query tests the LLM’s ability to handle more complex data
processing, including data transformation, aggregation, and sorting, to find the
most collaborative directors.
5. “Provide the top 10 most busy actors in solely American Movies, from 1995 on-
wards.” (Advanced): This query involves filtering American movies from 1995 on-
wards, transforming the cast column to separate actor names, and aggregating the
data to identify the ten most frequently appearing actors.
As for the “COVID-19 Twitter” dataset, the following queries were formulated to assess
different capabilities of the LLM in code generation:
1. “Determine the Number of Tweets Containing User Mentions.” (Basic): This query
evaluates the LLM’s ability to generate code that filters rows based on non-null
values in the ‘user_mentions‘ column, and performs a count.
2. “Calculate the top 7 authors with the highest retweet count.” (Intermediate): This
query involves grouping data by the original author, summing retweet counts, and
identifying the top 7 authors with the highest totals.
82
3.4.3 Dataset Queries
3. “Analyze the Daily Tweet Volume Over Time.” (Intermediate): This query checks
whether the LLM can produce code that groups tweets by date, and counts the
number of tweets per day (thus performing temporal analysis).
4. “Provide the names of the top 5 users mentioned in tweets, every month (also include
the corresponding month and year).” (Advanced): This query requires code that
converts timestamps to extract month and year, expands the list of user mentions,
groups by month and mentioned user, counts the mentions, and selects the top 5
users for each month.
5. “Provide the top 5 weeks (and their years) with the most dense tweets posted, in
terms of total clean words included” (Advanced): This query tests the LLM’s ability
to extract week and year from timestamps, compute word counts for each tweet,
group the data accordingly, and identify the top 5 weeks with the highest total word
count.
Regarding the “Shared Cars Locations” dataset, the queries crafted are the following:
1. “Filter Locations with More Than 3 Cars.” (Basic): This query evaluates the LLM’s
ability to generate code for simple filtering, based on numerical thresholds.
2. “Find the Top 5 Locations with the Most Cars Recorded.” (Intermediate): This query
involves computing the total number of cars per location and ranking them to iden-
tify the top five.
3. “List 100 Records at most, from December 2019 to January 2020.” (Intermediate):
This query checks the LLM’s handling of date-based filtering using a defined time
range, along with limiting the output to 100 entries.
4. “Provide the number of the most dense week, and year, in terms of total cars parked,
along with the number of total cars.” (Advanced): This query assesses the LLM’s
ability to process timestamps, extract week and year values, perform weekly aggre-
gations of parked cars, and identify the week with the highest total.
5. “Find the total number of (unique) cars that visited each location.” (Advanced): This
query tests the LLM’s capacity to preprocess data, expand lists of car identifiers,
group data by location, and calculate the number of unique cars associated with
each one.
For the “Madrid Daily Weather” dataset, it has been evaluated using the following
queries:
1. “Filter Days of 2006 with Max Temperature Above 30 °C.” (Basic): This query tests
the LLM’s ability to generate code for basic filtering, based on numerical temperature
values.
2. “Count the Number of Foggy Days per Year” (Intermediate): This query evaluates
the LLM’s capacity to extract the year from a date column, identify foggy days by
83
Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework
checking for the term “Fog” in the events column, and group the data by year to
count the occurrences.
3. “Calculate the monthly average of mean wind speed and mean sea level pressure,
per year” (Intermediate): This query involves extracting month and year from the
date, grouping the data accordingly, and calculating average values for wind speed
and sea level pressure.
4. “Analyze the Yearly Variation in Average Humidity and Identify the Top 5 Years
with the Highest Increase in Average Humidity Compared to the Previous Year.”
(Advanced): This query requires code that extracts the year, computes the yearly
average humidity, calculates the change from one year to the next, and identifies
the five years with the greatest increase.
5. “Analyze the Monthly Variation in Temperature Range and Identify the Top 3 Months
with the Highest Average Range.” (Advanced): This query tests the LLM’s ability
to calculate the daily temperature range, group the data by month, compute the
monthly variations, and determine the three months with the highest average range.
It thus tests the LLM’s ability to produce code for handling complex calculations,
group-by operations, and sorting.
When it comes to the “Supermarket Sales” dataset, the five queries crafted are pre-
sented below:
1. “Count the Number of Sales per Product Line” (Basic): This query tests the LLM’s
ability to generate code for grouping by product line, and applying an aggregate
function to count the total number of sales.
2. “Find the Average Rating for Each Payment Method” (Intermediate): This query
involves grouping the data by payment method and calculating the average rating
for each group, introducing slightly more complexity.
3. “Find the average quantity of Electronic accessories purchased by card, for each city”
(Intermediate): This query assesses the LLM’s capability to filter data for electronic
accessories bought with a credit card, group it by city, and compute the average
purchase quantity per group.
4. “Analyze the Sales Performance by Branch and Payment Method” (Advanced): This
query requires the generation of code that groups data by both branch and payment
method, calculates total sales, and computes average ratings, thereby testing multi-
level grouping and multiple aggregation operations.
5. “Calculate the Correlation Between Unit Price and Quantity Sold for Each Prod-
uct Line” (Advanced): This query evaluates the LLM’s ability to perform statistical
analysis within grouped data, by calculating the correlation between unit price and
quantity for each product line.
84
3.4.3 Dataset Queries
By reviewing the outlined set of queries, it becomes evident that task complexity
increases progressively, moving from basic data profiling to more advanced analytical
operations. The initial queries require the LLMs to perform simple code generation tasks,
while later ones demand more intricate logic and processing. Consequently, Codestral
and Qwen will be assessed based on their capability to generate code for a range of
data analysis tasks, all provided in natural language. Each model will be evaluated
independently. This approach aims to examine not only their effectiveness in handling
various levels of task complexity, but also their ability to adapt to different analytical
requirements.
85
Chapter 4
Testing and Results: Assessing the Study’s Per-
formance
4.1 Validating the Scalable Capabilities of the Underlying In-
frastructure
Regarding this thesis’s scalable data management layer, it has been carefully struc-
tured to ensure the completion of the entire workflow within a short timeframe, depending
on the size of the incoming datasets. The framework has been validated using the pre-
viously described architecture in section 3.2. Three main datasets were employed for
the evaluation process. The first is a small tabular dataset consisting of approximately
31,200 records, selected to demonstrate the layer’s efficiency in terms of processing speed
and data storage. It is important to note, however, that this dataset does not qualify as
“big data”. To address this, a second main dataset was utilized, comprising 64.1 million
JSON objects and totaling 55.4 GB in size. This dataset, titled “urn-ngsi-ld-ITI-Customs”,
contains records of items processed through the customs office at the port of Valencia,
Spain over specific time periods. In essence, each JSON object in the dataset represents
information about a particular item that passed through customs. The structure of a
typical JSON object from this dataset is illustrated in Figure 4.1 below:
87
Chapter 4. Testing and Results: Assessing the Study’s Performance
Figure 4.1. A JSON object from the “urn-ngsi-ld-ITI-Customs” dataset, containing informa-
tion for a given port’s customs item [1].
To thoroughly assess the performance of the scalable data management layer, exper-
iments are carried out using both the full Customs dataset and three of its subsets. The
first subset includes data only from November 2022, and is approximately 1.4 GB in size.
The second subset consists of 6 million records, totaling 5.2 GB, while the third contains
11 million records with a size of 9.5 GB. Finally, the layer is evaluated using the complete
Customs dataset, which comprises 64.1 million records and has a total size of 55.4 GB
(as previously outlined). An overview of these Customs datasets used in the evaluation is
presented in the screenshots below:
88
4.1 Validating the Scalable Capabilities of the Underlying Infrastructure
To further evaluate the scalable data management layer, an additional main dataset
from the critical infrastructure Communications sector was used. Specifically, the telecom-
munications provider OTE Group [134] supplied three subsets, each covering a full year
from 2019 to 2021, related to user mobility within the cellular network in the region
of Thessaloniki, Greece. As a result, the proposed framework is tested across both the
Transportation sector (ports) and the Communications sector (telecommunications). All
three subsets underwent thorough anonymization by OTE’s data engineering team. Each
JSON document in the subsets represents a specific instance of a user’s movement from
one network cell to another. Along with mobility data, the records include the number of
incoming and outgoing text messages and phone calls. They also include contextual de-
tails, such as whether the timestamp falls on a public holiday, and the number of distinct
users present in the current cell at that moment. An example JSON document is shown
in Figure 4.2.
Each of the three original subsets is labeled as “urn-ngsi-ld-OTE-MobilityData-XXXX”,
where “XXXX” corresponds to the respective year (2019, 2020, and 2021). The 2019
dataset includes 6.9 million documents and occupies 2.8 GB of disk space. The 2020
dataset contains 10.1 million documents and has a size of 4.1 GB, while the 2021 dataset
consists of 5.7 million documents totaling 2.4 GB. The main dataset has been created
by merging the three annual subsets. This combined dataset is used for performance
evaluation, allowing the scalable data management layer to be tested on a larger data
volume. It contains 22.8 million documents and requires 9.2 GB of disk space. Details
regarding the size and structure of each dataset are illustrated in the screenshots below,
following Figure 4.2.
89
Chapter 4. Testing and Results: Assessing the Study’s Performance
Figure 4.2. A JSON object that contains information for a random-anonymized-OTE cellular
network user in the area of Thessaloniki, Greece [1].
The testing process began with the the first main dataset consisting of 31,200 docu-
ments. Since the layer operates within an Apache Spark Cluster environment, the com-
puting resources it utilizes and the corresponding results can be monitored through the
cluster’s WebUI. Figure 4.3 presents an overview of the Spark driver’s ID, the execution
timestamp, and the resources allocated to the driver during the run of the Pre-Processing
and Filtering Tool. As explained in section 3.2, this tool serves as the initial subcomponent
of the layer, responsible for processing, filtering, cleaning, and storing the dataset.
90
4.1 Validating the Scalable Capabilities of the Underlying Infrastructure
Figure 4.3. Screenshot from scalable data management layer’s Apache Spark Cluster
WebUI [1].
As shown in Figure 4.3, the driver identified as “driver-20230301132908-0011” was
responsible for loading and executing the Pre-Processing and Filtering Tool, which han-
dled the processing, filtering, cleaning, and storing of the dataset. The driver is automat-
ically initiated through an API call, which can be triggered either by a software engineer,
or by an automated software agent, enabling a seamless and continuous workflow. Each
subcomponent or software module subsequently activates the next step in the pipeline.
For this specific task, the Spark cluster driver utilized 6 CPU cores and 5 GB of RAM. The
total execution time is available in the driver’s logs, as shown in Figure 4.4.
Figure 4.4. Screenshot from Spark cluster driver’s logs [1].
The log file’s title confirms that the information pertains to the relevant Spark driver.
Near the end of the log, an entry indicates the total execution time required by the Pre-
Processing and Filtering Tool to complete its tasks (processing, filtering, cleaning, and
storing the dataset). The entire process was completed in 0.76 minutes, approximately
45 seconds. This brief duration is notable given the range of operations performed by the
tool. Following this process, the dataset is securely stored and fully cleaned within the
Virtual Data Repository, making it readily accessible for any authorized data consumer.
Subsequently, testing continued with the Customs datasets, starting with the smaller
subset containing data from November 2022. As in the previous case, the Spark clus-
ter’s WebUI is used to monitor the operation of the layer during execution. As illus-
91
Chapter 4. Testing and Results: Assessing the Study’s Performance
trated in Figure 4.5, the system automatically generated a driver with the ID “driver-
20230901183325-0000” to load the Pre-Processing and Filtering Tool and carry out the
processing, filtering, cleaning, and storage of the Customs dataset’s subset. According
to Figure 4.6, the tool successfully completed the entire procedure and stored the subset
in the Virtual Data Repository within 2.1 minutes. Considering the dataset’s size (1.4
GB) and approximately 1.4 million records, this indicates that the layer’s subcomponent
performed the task efficiently and reliably.
Figure 4.5. A screenshot from the Apache Spark Cluster’s WebUI, for the November 2022
subset [1].
Figure 4.6. Screenshot from the logs of the Spark driver “driver-20230901183325-0000”,
for the November 2022 subset [1].
The next evaluation involves the subset containing 6 million records, from the com-
plete 64 million-object Customs dataset. As before, the process is monitored via the
Spark cluster WebUI. In this case, the Pre-Processing and Filtering Tool is tasked with
applying all previously described operations to a subset of 5.4 GB in size. The Spark
driver responsible for executing this job is identified as “driver-20230901184317-0001”,
as shown in Figure 4.7 below.
92
4.1 Validating the Scalable Capabilities of the Underlying Infrastructure
Figure 4.7. A screenshot from the Spark Cluster’s WebUI, for the Customs subset contain-
ing 6 million objects [1].
The entire process, executed by the driver responsible for the operation of the Pre-
Processing and Filtering Tool, was completed in 5.24 minutes, with the subset success-
fully stored in the Virtual Data Repository, as illustrated in Figure 4.8. Similar to the
previous test involving the November 2022 subset, the framework demonstrates a no-
tably short execution time, highlighting its efficiency in handling larger datasets, given
the computational resource constraints.
Figure 4.8. A screenshot from “driver-20230901184317-0001” Spark cluster driver’s logs,
for the 6 million subset [1].
Before proceeding to the full Customs dataset, the final subset comprising 11 mil-
lion JSON objects with a total size of 9.5 GB is evaluated. To process this subset,
the Spark cluster automatically generated the driver “driver-20230901185350-0002”, as
shown in Figure 4.9. Assuming the system operates without performance bottlenecks or
disruptions, the Pre-Processing and Filtering Tool is expected to complete the task within
a relatively short time frame.
93
Chapter 4. Testing and Results: Assessing the Study’s Performance
Figure 4.9. A screenshot from Spark Cluster’s WebUI, for the Customs subset comprising
11 million objects [1].
As shown in the Spark driver’s logs in Figure 4.10, the Pre-Processing and Filtering
Tool successfully completed all required operations on the subset and stored the results
within approximately 10.5 minutes. This outcome further supports the observation that
the framework’s execution time increases in a linear manner relative to the size and
volume of the input dataset.
Figure 4.10. A screenshot from the logs of the Spark driver “driver-20230901185350-
0002”, for the 11 million subset [1].
The next evaluation of the scalable data management layer involved the complete
Customs dataset, consisting of 64.1 million JSON objects and totaling 55.4 GB in size.
Upon activation through the previously described API call, the Spark cluster initiated
the process by generating the driver “driver-20230901191216-0003”, as shown in Figure
4.11. The WebUI confirms that the processing task was successfully launched and is
actively running.
94
4.1 Validating the Scalable Capabilities of the Underlying Infrastructure
Figure 4.11. A screenshot from the Spark Cluster’s WebUI, for the complete Customs
dataset [1].
As in previous cases, the Pre-Processing and Filtering Tool is responsible for process-
ing, filtering, cleaning, and ultimately storing the dataset in the Virtual Data Repository.
The progress of this operation, along with the final completion time, is available in the
driver’s logs, as illustrated in Figure 4.12.
Figure 4.12. A screenshot from the logs of the Spark driver “driver-20230901191216-
0003”, for the complete Customs subset [1].
The Pre-Processing and Filtering Tool successfully completed the entire process in
under one hour, requiring 58.48 minutes to receive, process, and store the full Customs
dataset in the Virtual Data Repository. Once again, the results support the observation
that processing time scales linearly with the size of the input data.
Following this, the layer was evaluated using the OTE Group Mobility datasets from
2019 to 2021, covering the region of Thessaloniki, Greece. The evaluation began with the
2019 dataset. As shown in Figure 4.13, the driver “driver-20231004114021-0000” was
responsible for executing the Pre-Processing and Filtering Tool on this dataset.
95
Chapter 4. Testing and Results: Assessing the Study’s Performance
Figure 4.13. A screenshot from the Spark Cluster’s WebUI, for the Mobility subset of 2019
with 6.9 million objects [1].
The layer completed the processing, filtering, cleaning, and storage of the dataset in
approximately 3.7 minutes, as indicated in the driver’s logs (Figure 4.14). This means that
2.8 GB of data, corresponding to 6.9 million documents, was successfully parsed, pro-
cessed, and stored in under four minutes, which is a notably short duration considering
the complexity and number of tasks involved.
Figure 4.14. A screenshot from the logs of the Spark driver “driver-20231004114021-
0000”, regarding the Mobility 2019 subset [1].
Next, the layer was tested using the 2020 Mobility subset. The driver assigned to
execute the entire process was “driver-20231004114951-0001”. As in previous cases,
the driver was automatically triggered via an API call, which intends to trigger the Spark
Job and initiate the framework’s operation. Figure 4.15 presents the corresponding status
of the Spark cluster’s WebUI.
96
4.1 Validating the Scalable Capabilities of the Underlying Infrastructure
Figure 4.15. The Spark Cluster’s WebUI, depicting the driver “driver-20231004114951-
0001” in running state, for the 2020 Mobility subset [1].
According to the driver’s logs (Figure 4.16), the 2020 Mobility subset comprising
10.1 million documents and totaling 4.1 GB was successfully preprocessed, filtered,
cleaned, and stored within 5.3 minutes. This is considered a highly reasonable processing
time. The results suggest that the scalable data management layer maintains consistent
efficiency regardless of the dataset’s specific content. However, it is important to note
that this observation currently applies to tabular-style datasets, which are commonly
produced across various critical infrastructure sectors and business organizations.
Figure 4.16. A screenshot from the “driver-20231004114951-0001” Spark driver’s logs,
for the Mobility 2020 subset [1].
Regarding the 2021 Mobility subset, it is the smallest among the Mobility sets, com-
prising approximately 5.7 million records and occupying 2.4 GB of storage space, as
previously mentioned. As usual, the set was processed using the Pre-Processing and
Filtering Tool through the execution of a specific Spark driver, identified as “driver-
20231004115659-0002”, which is illustrated in Figure 4.17.
97
Chapter 4. Testing and Results: Assessing the Study’s Performance
Figure 4.17. A screenshot from the Spark Cluster’s WebUI, where the driver “driver-
20231004115659-0002” is depicted in active state, for the Mobility 2021 subset [1].
The process was completed in approximately 3.5 minutes, once again demonstrating
the efficiency of the scalable data management layer (Figure 4.18). For the Mobility
subsets, testing followed a chronological order rather than an order based on dataset
size, similar to the approach taken with the Customs subsets. This sequence was chosen
to reflect the order in which the datasets were provided by OTE Group.
Figure 4.18. A screenshot from the “driver-20231004115659-0002” Spark driver’s logs,
which handled the Mobility subset of 2021 [1].
Finally, the scalable data management layer was evaluated using the custom fused
Mobility dataset, which combines all three previously utilized datasets. As a reminder,
this combined dataset contains 22.8 million records and occupies 9.2 GB of disk space.
The Pre-Processing and Filtering Tool was applied to the dataset via a Spark driver named
“driver-20231004120403-0003”, as shown in Figure 4.19.
98
4.1 Validating the Scalable Capabilities of the Underlying Infrastructure
Figure 4.19. A screenshot from the Spark Cluster’s WebUI, where the driver “driver-
20231004120403-0003” is depicted in active state, for the complete Mobility dataset [1].
According to the driver’s logs shown in Figure 4.20, the entire process was completed
in less than 11 minutes (specifically, 10.78 minutes). This outcome further underscores
the operational efficiency of the proposed scalable layer, and suggests that the framework
remains effective regardless of the source of the critical infrastructure or business domain
dataset.
Figure 4.20. A screenshot from the Spark driver’s “driver-20231004120403-0003” logs,
as it ceased its operation on the complete Mobility dataset [1].
With this, the performance evaluation phase is concluded. The results clearly indicate
that processing time increases linearly with the size of the input data. This trend is
illustrated in Figures 4.21, 4.22, and 4.23, which present the findings using three scatter
plots. The first plot displays the processing time (in minutes) for each of the ITI Customs
and OTE Mobility datasets, represented by separate lines, in relation to their respective
sizes (in GBs). The second plot consolidates all tests into a single line. It is important to
note that the final Customs dataset, which is 55.4 GB in size, is excluded from the first
and second plots, since its inclusion would drastically affect the depiction of the other
99
Chapter 4. Testing and Results: Assessing the Study’s Performance
sets’ smaller values. The third and final plot includes the final Customs dataset as well.
All three visualizations are provided below.
Figure 4.21. Scatter plot with the tests of ITI Customs and OTE Mobility subsets, in
different plot lines [1].
Figure 4.22. Scatter plot including all tests, in the same plot line [1].
Figure 4.23. Scatter plot including the final ITI Customs dataset [1].
100
4.2 Evaluating the Final System
The linear relationship between completion time and input size is clearly evident.
More specifically, it appears that for every minute that elapses, approximately 1 GB of
data is processed, filtered, cleaned, and stored. As the dataset size (in GB) increases,
the completion time (in minutes) increases proportionally. This linear growth was also
set as the “acceptance threshold” for a successful implementation. The framework was
expected to operate at a rate of “1 GB per minute”, a goal which was achieved, meeting
the expectations. Furthermore, it is worth noting that the scalable data management
layer experienced no performance degradation and consistently completed all tasks and
tests. Regardless of the input size, the system remained operational. Additionally, each
Spark driver utilized only 5 GB of RAM and 6 CPU cores from the system, underscoring
the efficiency of the proposed architecture.
4.2 Evaluating the Final System
Testing of the complete architecture was performed on a single physical machine to
ensure consistency and eliminate any potential hardware variability. Each of the 250
tests were closely monitored, carefully evaluating the generated results. Each query was
executed exactly 10 times, with a few exceptions caused by minor human errors that led
to the premature termination of some tests. For instance, there were occasions when
the current author of this PhD thesis (and responsible for experiments) inadvertently
failed to change the dataset in the test’s initial settings, applying a natural language
query intended for a different dataset. Additionally, some tests were repeated without
updating the test iteration number, resulting in new results overwriting the previous ones.
Regarding the generated code, it was executed on the retrieved dataset using the exec()
command, with the Processing Platform’s code specifically looking for a “processeddf”
variable, where the LLM was instructed to store the final analysis results. Intervention
in the LLM-generated code occurred only when the model’s response included text not
formatted as Python comments, or when the model failed to store the final analysis results
in the designated “processeddf” variable, despite the rest of the code being correct and
functional.
4.2.1 Physical Machine Specifications
The complete system specifications of the physical machine used to run the LLM
server are provided below. While the machine is powerful, it has certain limitations
when it comes to offline LLM testing, particularly with models like Codestral and Qwen.
Specifically, these models cannot be fully loaded onto the GPU, meaning that both the
GPU and CPU are utilized during the code generation process. Despite these hardware
limitations affecting speed and efficiency, the primary goal of this thesis’s section is to
assess the capabilities of Codestral and Qwen in code generation as offline LLMs. While
upgrading to more powerful hardware could improve the speed of the code generation
process, such a consideration is beyond the scope of this thesis, and could be addressed
in future research.
101
Chapter 4. Testing and Results: Assessing the Study’s Performance
Server Hardware Specifications
System Information:
Motherboard: TUF GAMING X570-PLUS
Processor: AMD Ryzen 7 5800X 8-Core Processor
Memory: 32 GiB DDR4
Storage:
* 1TB NVMe SSD (Samsung SSD 970 EVO Plus)
* 4TB HDD (ST4000DM004-2U91)
GPU: NVIDIA GeForce RTX 3080 with 10 GiB of dedicated memory
Software Information:
LLM Server: LM Studio 0.2.21
Large Language Models:
* Codestral v01 22B Q6_K
* Qwen 2.5 Coder Instruct 14B Q6_K
Partial GPU Offload:
* 17 layers for Codestral
* 25 layers for Qwen
LLM Response Temperature: 0.7
The Codestral v01 22B Q6_K and Qwen 2.5 Coder 14B Q6_K models were selected
from the Hugging Face hub [135] for use in this study. Both models were deployed us-
ing the LM Studio Server [136], which supports the deployment of multiple offline LLMs.
This capability, along with the fact that the architecture of this study is not tied to a
specific model, reinforces that the proposed methodology is model-agnostic and can be
generalized across various offline LLM platforms. This was demonstrated through eval-
uations using both Codestral and Qwen (see 4.2.3). The choice of Q6_K model versions
was driven by their efficient performance, enabled by 6-bit quantization. These quantized
models reduce memory usage and computational demands without substantially compro-
mising accuracy, making them well-suited for offline environments where computational
resources are often limited. The Q6_K versions’ ability to maintain high performance,
while being resource-efficient, makes them an optimal choice for the development context
of this study.
102
4.2.2 Testing Results Collection
Regarding the LLMs’ response temperature setting, it was set to 0.7. This choice was
made to strike a balance between creativity and reliability, which is crucial when assess-
ing a model’s code generation capabilities. Moderate temperature settings, such as 0.7,
allow the model to produce diverse outputs, while still maintaining sufficient coherence to
ensure the generated code is functional and relevant. This setting avoids overly determin-
istic responses that could limit the exploration of alternative coding solutions, enabling
more flexibility in the generated results [137].
4.2.2 Testing Results Collection
Each test produced a set of data, which was structured into a single result object,
as shown in Listing 4.1. For each of the two LLMs, 250 of these objects were created
and compiled into a dataset with 250 entries, where each entry represents one test case.
Every entry includes detailed information about the test, covering aspects such as the
accuracy, clarity, and execution performance of the generated code, along with system
resource usage metrics. The following bullet points describe each attribute in detail,
providing a thorough summary of the collected data during testing.
“correctness”: This column records either ‘True’ or ‘False’, indicating whether the
generated code successfully produced the intended output based on the given nat-
ural language query.
“readability”: This column presents numerical values that reflect how easily a hu-
man can understand the generated code. The scores are calculated using a custom
function, ranging from 1 to 3, where 3 represents the highest readability.
“code_execution_errors”: This column captures any errors encountered during the
execution of the generated code. If no error occurred, the value is ‘None’. If an
error is present, the column includes the corresponding error message to explain
the issue.
“executed_command”: This column contains the complete code that was executed
for each test case. It may also include Python comments.
“code_repetition_id”: This column indicates the repetition number for each test case.
Since each query was tested ten times, this field takes values from 1 to 10, showing
which iteration the entry corresponds to.
“dataset”: This column lists one of the five datasets used in the study. Each entry
names a single dataset: ‘supermarket’, ‘netflix’, ‘shared-cars-locations’, ‘covid19-
twitter’, or ‘madrid-daily-weather’. The dataset named in each entry identifies which
dataset was used for the corresponding test.
“user_query”: This column holds the exact query entered by the user and forwarded
to the offline LLM for code generation. The corresponding generated code for each
query is available in the “executed_command” column.
103
Chapter 4. Testing and Results: Assessing the Study’s Performance
“llm_response_cpu”: This column shows the CPU usage percentages recorded dur-
ing the LLM’s code generation process.
“llm_response_memory”: This column reports the memory usage of the LLM server
while generating code. The values are expressed as percentages.
“llm_response_gpu”: This column reflects the GPU usage percentages observed on
the LLM server during code generation.
“llm_response_gpu_mem”: This column displays the percentage of GPU memory
used by the LLM server during the code generation phase.
“llm_response_response_time”: This column records the response times of the LLM
server, measured from when the query was received to when the generated code
was returned. The times are given in seconds.
“automated”: This column indicates whether the code for each test was executed
automatically or with minimal human intervention. A value of ‘True’ means full
automation, while ‘False’ signifies semi-automated execution.
“query_no”: This column identifies the query number for each test. Since five
queries were created for each dataset, the possible values are ‘q1’, ‘q2’, ‘q3’, ‘q4’,
and ‘q5’.
“query_level”: This column reflects the complexity level of the query in each test.
The values are ‘basic’, ‘intermediate’, or ‘advanced’, based on the query’s difficulty.
Listing 4.1: A testing result (JSON object) from the ’Netflix’ dataset, and the basic query
’Count the Number of TV Shows per Country’.
{
"correctness": true,
"readability": 3,
"code_execution_errors": "None",
"executed_command": "#CountthenumberofTVShowspercountry\n\nprocesseddf=df.filter
(df[’type’]==\"TVShow\").groupBy(’country’).count()",
"code_repetition_id": "7",
"dataset": "netflix",
"user_query": "CounttheNumberofTVShowsperCountry",
"llm_response_cpu": 34.45,
"llm_response_memory": 0.01,
"llm_response_gpu": 20.25,
"llm_response_gpu_mem": 89.03,
"llm_response_response_time": 19.41,
"automated": "true"
}
104
4.2.3 Evaluation
4.2.3 Evaluation
The evaluation was carried out using the collected results and the dataset constructed
from them, based on the main criteria described in subsection 3.3. The analysis produced
a set of key insights and recommendations. A summary of these findings, along with a
discussion on potential future directions, is provided in section ??. It is worth noting
that the ‘Contextual Complexity’ metric was examined in relation to all other evaluation
metrics. This combined analysis offered a more complete understanding of query com-
plexity in relation to other metrics, along with its impact, leading to more meaningful
conclusions. Therefore, a separate subsection dedicated to Contextual Complexity was
not included.
Functional Correctness
This metric aims to evaluate how effectively the tested LLMs generate correct code in
response to user queries. It measures whether the output code produces the expected
results. The first step involves calculating the overall success rate, by comparing the
number of correct and incorrect outputs. Next, the analysis focuses on correctness across
different datasets and query complexity levels. These two additional analyses provide a
more detailed view of correctness, offering insight into how success rates vary depending
on specific aspects of the study. The results of all three tasks are presented below.
Figure 4.24. Functional correctness of the LLM’s generated code plot [4].
Figure 4.24 shows that the number of correct outputs from the code generation and
execution process clearly exceeds the number of incorrect ones. In particular, Codestral
successfully produced correct results in 228 out of 250 tests, with only 22 tests failing to
meet expectations. This is an indication of its strong performance in generating accurate
code. Similarly, Qwen 2.5 Coder produced correct outputs in 223 tests, while 27 tests
resulted in incorrect results. This comparison indicates that both LLMs can effectively
105
Chapter 4. Testing and Results: Assessing the Study’s Performance
interpret user queries and generate functioning code, with Codestral achieving a slightly
higher success rate. To ensure the reliability of the results, manual validation was per-
formed throughout the process. After each test, a verification process was conducted, on
whether the generated output aligned with the intended result of the original query. For
each case, Python scripts using the Pandas library were written to reproduce the expected
results, which were then compared with the outputs generated by the LLMs.
Figure 4.25 presents a comparison of correctness scores across all five datasets. The
models showed consistent performance, suggesting that they can produce accurate code
regardless of the dataset context. For Codestral, the lowest correctness rate was observed
with the ‘Shared Cars Locations’ dataset at 0.86 (43 out of 50 tests), followed by the
COVID19-‘Twitter’ dataset at 0.88 (44 out of 50). The ‘Madrid Daily Weather’, ‘Netflix’,
‘and Supermarket Sales’ datasets each achieved a rate of 0.94 (47 correct outputs). In
contrast, Qwen 2.5 Coder displayed a slightly different distribution: the ‘Netflix’ dataset
had the lowest rate at 0.82 (41 out of 50 correct tests), followed by ‘COVID19-Twitter’ at
0.84 (42 correct). The ‘Madrid Daily Weather’ dataset reached a score of 0.92 (46 out
of 50 correct), while both the Shared ‘Cars Locations’ and ‘Supermarket Sales’ datasets
recorded the highest rate of 0.94 (47 correct tests each). These findings suggest that
both LLMs can consistently generate correct code across different datasets, with some
variation in performance depending on the dataset.
Figure 4.25. Plot depicting the functional correctness, by dataset [4].
Figure 4.26 presents the relationship between correctness and query complexity. For
Codestral, the lowest correctness rate is observed with advanced queries at 0.84 (84
out of 100 correct outputs), while intermediate queries reach a rate of 0.95 (95 out of
100), and basic queries achieve the highest performance at 0.98 (with only one incorrect
output in 50 tests). Qwen 2.5 Coder follows a similar pattern: correctness for advanced
queries stands at 0.83 (83 out of 100), drops slightly (compared to Codestral) to 0.91 for
106
4.2.3 Evaluation
intermediate queries (91 correct outputs), and reaches 0.98 for basic queries (49 out of
50 correct). It is important to highlight that the number of tests for basic queries is half
the amount of those for advanced and intermediate queries. Nevertheless, both models
demonstrate a consistent trend: As query complexity increases, the likelihood of incorrect
outputs also rises. This indicates that both LLMs face more challenges when handling
advanced queries, which is an expected outcome, given the higher level of contextual and
logical complexity involved.
Figure 4.26. Plot for the functional correctness scores by the queries’ complexity levels [4].
Readability
The objective of this evaluation metric is to examine the readability of code generated
by each LLM, as this can influence the ease with which a human can understand the
code. The analysis consists of three tasks, starting with the distribution of readability
scores across all tests. Also, the average readability scores are provided, categorized
by dataset and query complexity. The approach to this analysis is analogous to the
one used in assessing Functional Correctness, as previously examined. As described in
subsection 3.3, the readability score is derived from a custom function, which produces
scores ranging from ‘1’ to ‘3’, where a score of ‘3’ indicates highly readable code. This
function evaluates the code based on factors such as line length, method call chains,
and nested structures, assigning penalties for lines exceeding 80 characters, method call
chains longer than three calls, and nesting depths greater than two levels.
The results shown in Fig. 4.27 indicate that the majority of tests using Codestral
received a readability score of ‘2’. Specifically, 176 out of 250 tests were assigned this
score, while the remaining 74 tests earned the highest score of ‘3’. No tests were rated with
a score of ‘1’. In a similar evaluation, Qwen 2.5 Coder displayed a comparable pattern,
with 177 out of 250 tests scoring ‘2’ and 71 tests scoring ‘3’. However, two tests generated
by Qwen were assigned a score of ‘1’. While this is a small and potentially insignificant
number, it suggests that, on rare occasions, the code produced by Qwen may exhibit more
107
Chapter 4. Testing and Results: Assessing the Study’s Performance
noticeable readability issues. For both models, the primary reason for not achieving the
highest score was the presence of long lines of code. This indicates that the generated code
often exceeds the recommended 80-character limit, a factor penalized by the readability
function, as such lines are considered harder to read and maintain. While longer lines
can sometimes improve code efficiency by consolidating commands, they often detract
from the readability for humans, which is the main reason most generated code did not
attain the highest readability score of ‘3’.
Figure 4.27. Plot for the distribution of readability scores across the tests conducted [4].
Fig. 4.28 presents the average readability scores for each dataset. For Codestral,
the ‘Madrid Daily Weather’ dataset shows the lowest average readability score at 2.08,
with 46 out of 50 tests receiving a score of ‘2’. In comparison, Qwen 2.5 Coder achieves
a slightly higher average of 2.16 on the same dataset, with 42 out of 50 tests scoring
‘2’, suggesting a marginally better format optimization. Similarly, for the ‘Shared Cars
Locations’ dataset, Codestral’s average score is 2.28, with 36 out of 50 tests rated ‘2’,
whereas Qwen records a slightly higher average of 2.38, with 31 out of 50 tests receiving
a ‘2’ rating. In the case of the Netflix’ dataset, Codestral attains an average score of 2.30,
with 35 out of 50 tests rated ‘2’, while Qwen has a lower average of 2.16, with 38 out of
50 tests rated ‘2’, but also includes 2 tests with a score of ‘1’. The ‘Supermarket Sales’
dataset yields similar averages for both models, with Codestral scoring 2.38 and Qwen
slightly ahead at 2.40. For the ‘COVID19-Twitter’ dataset, Codestral surpasses Qwen
with an average score of 2.44, compared to Qwen’s 2.28. While the results do not lead to
definitive conclusions or recommendations, they may suggest that certain datasets allow
for more efficient operations (in terms of the number of commands used), potentially
reflecting the specific nature and context of the data in these datasets.
Fig. 4.29 displays the average readability scores categorized by query complexity
levels. For Codestral, basic queries achieve the highest average readability score of 2.8,
with 40 out of 50 tests generating code that received a readability score of ‘3’. Intermediate
108
4.2.3 Evaluation
Figure 4.28. Plot depicting the average readability scores by dataset [4].
queries have an average score of 2.26, with 74 out of 100 tests scoring ‘2’, while advanced
queries yield the lowest average of 2.08, with 92 out of 100 tests rated ‘2’, and only 8
tests scoring ‘3’. In contrast, for Qwen 2.5 Coder, basic queries achieve a slightly lower
average readability score of 2.58, with 29 out of 50 tests receiving a ‘3’ rating. Intermediate
queries have an average of 2.29, with 69 out of 100 tests rated ‘2’, and one test scoring ‘1’.
Advanced queries average 2.11, with 87 out of 100 tests rated ‘2’, 12 scoring ‘3’, and one
test rated ‘1’. These results suggest that both models generally produce less readable code
as query complexity increases, which is expected. It is worth noting that, while Codestral
tends to generate more concise, single-line code for basic queries, Qwen’s approach for
simpler queries appears to be slightly less efficient. As for intermediate and advanced
queries, both models predominantly generate code with longer lines of commands, which
contributes to the higher occurrence of readability scores of ‘2’.
Figure 4.29. Plot displaying the average readability by query complexity level [4].
109
Chapter 4. Testing and Results: Assessing the Study’s Performance
Efficiency
This metric is designed to evaluate the performance of the LLMs in terms of their
computational resource usage during the code generation process. As detailed in subsec-
tion 4.2.1, the Codestral and Qwen 2.5 Coder models are not fully loaded onto the GPU.
This results in both the GPU and CPU being utilized during the code generation process,
particularly the GPU’s memory. The first task in the analysis is to examine the distribu-
tion of GPU usage, CPU usage, GPU memory, system memory (RAM), and response time
across the tests conducted for each LLM. The second task is to analyze computational
resource usage based on query complexity and readability, in order to understand how
performance varies across different query complexity levels and readability scores. It is
important to note that readability scores of ‘2’ generally correspond to longer lines of code,
which tend to require longer periods for code generation.
Fig. 4.30 shows the distribution of response times (in seconds) during the code gen-
eration process for both LLMs. Response time is measured from the moment the natural
language query is sent to the LLM, until the complete response is received. As depicted
in the figure, most code generation processes had response times under 100 seconds
for both models. For Codestral, the average response time was 79.64 seconds, with a
median of 67.90 seconds, and some instances exceeding 200 seconds, reaching a maxi-
mum of 319.53 seconds. On the other hand, Qwen 2.5 Coder demonstrated significantly
better efficiency, with an average response time of 27.66 seconds, a median of 22.98
seconds, and a maximum of 81.38 seconds. This considerable difference suggests that
Codestral’s slower performance is largely due to the partial offloading of operations to the
GPU, whereas Qwen benefits from a larger proportion of operations being offloaded to the
GPU. Since Qwen has fewer parameters (14B compared to Codestral’s 22B), its smaller
total size allows it to achieve response times closer to real-time standards.
Figure 4.30. Distribution of the LLM server’s response time during code generation [4].
Figures 4.31 and 4.32 illustrate the distribution of CPU and memory usage (as per-
110
4.2.3 Evaluation
centages) for the LLM server, during the code generation process. For Codestral, most
queries resulted in an average CPU usage of approximately 28.5%. Only a small number
of processes exceeded 50% usage, indicating that the CPU demand during code generation
was generally moderate. In contrast, Qwen 2.5 Coder displayed an average CPU usage
of around 37.6%, with its interquartile range indicating that many processes consumed
between 27% and 47.8% of the available CPU resources. Regarding system memory,
both models showed minimal RAM usage. Codestral’s average memory usage was about
0.44%, with nearly all queries falling within the 0–2% range. Qwen’s average was slightly
higher at around 0.45%, though its distribution was somewhat narrower, with a median
of 0.48% and a maximum of 4.65%. These findings suggest that, despite the moderate
CPU usage, particularly for Qwen, both systems utilized other resources (as expected),
since both LLMs were deployed across both GPU and CPU.
Figure 4.31. Distribution of the LLM server’s CPU usage during code generation [4].
Figure 4.32. Distribution of the LLM server’s memory usage during code generation [4].
111
Chapter 4. Testing and Results: Assessing the Study’s Performance
Figures 4.33 and 4.34 offer additional insights into resource usage, by displaying the
overall GPU utilization (as percentages) during the code generation processes across all
tests. In terms of GPU usage, most processes utilized approximately 20% of the GPU’s
capacity. For Codestral, the average GPU usage was around 21.3%, with the interquartile
range spanning from roughly 18.7% to 21.8%. Qwen 2.5 Coder demonstrated a slightly
lower average of 18.0%, with most tests falling between 14.4% and 19.3%. These mod-
erate GPU usage levels are consistent with the observed CPU utilization, reflecting the
shared workload between CPU and GPU for both models. In contrast, GPU memory us-
age was consistently high for both LLMs. Codestral used an average of approximately
88.8% of the available GPU memory, while Qwen averaged around 90.1%. This suggests
that a significant portion of each model’s parameters was loaded into GPU memory. The
combination of moderate GPU and CPU usage with high GPU memory consumption im-
plies that the code generation tasks likely involved short bursts of computation, rather
than sustained, high-load processing. The system appears to have maximized the use of
GPU memory, while distributing the computational workload between the GPU and CPU.
Furthermore, the relatively moderate levels of GPU usage may indicate that the generated
code was not particularly complex for the models to handle, resulting in lower overall
computational demands.
Figure 4.33. Distribution of the LLM server’s GPU usage during code generation [4].
Figures 4.35 and 4.36 show the average response times based on the readability of
the generated code and the query complexity level. In terms of readability, Codestral
required significantly more time to generate code rated with a readability score of ‘2’,
averaging around 93.7 seconds per query. In contrast, code with a readability score of
‘3’ was generated in approximately 46.2 seconds. A similar pattern is observed for Qwen
2.5 Coder, though with generally lower response times. Specifically, tests rated ‘2’ took
about 31.6 seconds on average, while those rated ‘3’ completed in roughly 18.1 seconds.
This difference is expected, as a score of ‘2’ typically reflects longer lines of code, which
require more processing, whereas a score of ‘3’ indicates shorter, more concise code that
112
4.2.3 Evaluation
Figure 4.34. Distribution of the LLM server’s GPU memory usage during code generation
[4].
is quicker to generate. Qwen’s faster overall response times are consistent with its smaller
size (14B parameters) and greater capability to offload operations to the GPU, compared
to Codestral.
Figure 4.35. The LLM server’s average response time, by code readability [4].
113
Chapter 4. Testing and Results: Assessing the Study’s Performance
Figure 4.36. The LLM server’s average response time, by query complexity level [4].
The response time results based on query complexity follow a predictable pattern.
For Codestral, basic queries had the shortest average response time at approximately
26.5 seconds. Intermediate queries took around 67.5 seconds, while advanced queries
required close to 118.3 seconds per query. Qwen 2.5 Coder showed a similar progression,
with basic queries averaging 10.3 seconds, intermediate ones at about 23.6 seconds,
and advanced queries taking roughly 40.5 seconds. These results are consistent with
expectations, as simpler queries typically lead to shorter, less complex code, while more
advanced queries involve greater processing and longer generation times.
Figures 4.37 and 4.38 present the average CPU and GPU usage levels based on code
readability and query complexity. A notable observation from both figures is that CPU and
GPU usage levels are relatively similar between the two readability scores. For CPU usage,
Codestral’s values for readability score ‘2’ fall between approximately 27.5% and 32.3%,
while those for score ‘3’ range from around 23.3% to 32.6%. A similar pattern is seen
with Qwen 2.5 Coder: For readability score ‘2’, CPU usage ranges from roughly 34.0% to
37.6%, whereas for score ‘3’, it ranges from 41.6% to 42.6%. The two Qwen cases with
a readability score of ‘1’ were excluded from analysis, due to their limited representation.
As for GPU usage, Codestral’s values for readability score ‘2’ range from about 20.2% to
21.5%, and for score ‘3’ from 20.3% to 22.4%. Qwen’s GPU usage under readability score
‘2’ ranges between 15.9% and 18.4%, while under score ‘3’, it increases slightly, ranging
from 17.4% to 20.98%. Overall, the CPU and GPU utilization appear consistent across
readability levels.
114
4.2.3 Evaluation
Figure 4.37. Average CPU usage of the LLM server, by readability and query complexity
[4].
Figure 4.38. The LLM server’s average GPU usage, by readability and query complexity
[4].
These results can be explained by the system’s operational behavior during code gen-
eration. As the process progresses, the system tends to reach a steady state in terms of
CPU and GPU usage. Once this stable condition is established, the level of resource con-
115
Chapter 4. Testing and Results: Assessing the Study’s Performance
sumption generally remains consistent until the response is fully generated, regardless
of the code’s complexity or length. This behavior may also help explain another finding
related to query complexity: In certain cases, basic queries show higher average CPU
and GPU usage than intermediate or advanced queries, and occasionally even record the
highest usage among the three.
Figure 4.39. Average response times of the LLM server, by readability and query complex-
ity [4].
Since basic queries typically require much shorter generation times (as shown in Fig.
4.39), their CPU and GPU usage is more affected by the system’s initial performance
surge. In the case of intermediate queries, this initial spike tends to have less impact, as
the longer duration of the process allows these early values to be smoothed out. The same
reasoning applies to advanced queries. As time passes, the system enters a more stable
phase where CPU and GPU usage levels off, while GPU memory usage remains consistently
high throughout. This interpretation is supported by the average GPU memory usage
shown in Fig. 4.40, where all readability scores and query complexity levels exhibit
similarly high and steady memory demand. Nonetheless, it is important to highlight that
the differences in usage values across all cases are relatively minor, so these observations
should be considered with caution.
116
4.2.3 Evaluation
Figure 4.40. Average GPU memory usage of the LLM server, by readability and query
complexity [4].
Automation
This evaluation metric aims to measure how effectively the generated code can be
executed automatically, without the need for manual adjustments. After each response
is generated by the LLM, it is reviewed by the end-user. If the code is immediately usable
(executable) for data processing, it is labeled as True for automation. Conversely, if minor
manual modifications are required, such as slight adjustments or isolating a functional
part of it, the code is labeled as False, even if it ultimately produces the correct result.
The analysis begins by examining how many tests fall into each of the two categories.
Subsequently, the automation status is evaluated in relation to functional correctness
and query complexity. These steps are intended to assess each LLM’s ability to deliver
code that directly fulfills the prompt instructions, and could also be used as part of an
automated data analysis pipeline.
Figure 4.41 shows the automation categorization across all 250 tests. For Codestral,
202 responses were marked as True, indicating that the generated code could be used
without any manual intervention. The remaining 48 tests were labeled as False, sug-
gesting that some human involvement was necessary before execution. This corresponds
to an automation success rate of 80%, meaning that 1 in 5 responses still required
some adjustment. In contrast, Qwen 2.5 Coder produced 239 fully automated responses,
achieving a success rate of roughly 95.6%, with only 11 tests requiring minor edits. These
results suggest that Qwen may currently be more suitable for use in a fully automated
environment, while Codestral might benefit from further refinement or optimization of its
initial, pre-trained configuration.
117
Chapter 4. Testing and Results: Assessing the Study’s Performance
Figure 4.41. The amount of automated and semi-automated tests, based on human inter-
vention to their code [4].
As previously mentioned, there are two main reasons why human intervention was
occasionally required for executing the generated code. First, despite explicit instructions
in the prompt to avoid it, the LLMs occasionally included additional explanatory text not
formatted as Python comments. In such cases, this extraneous text was manually re-
moved, retaining only Python-compliant comments where applicable. If this adjustment
had not been made, the code application step would have failed, as the exec() function
cannot interpret plain text that is not syntactically valid Python code. Second, the work-
flow expected the final output of the generated code to be assigned to a variable named
“processeddf”. If the LLM omitted this, or assigned the result to a different variable, the
final output was manually redirected to the “processeddf” variable. Without this variable
assignment, the system would be unable to continue with the subsequent result extrac-
tion step. Although the prompt explicitly instructed the LLMs to assign the final result to
“processeddf”, this requirement was not always met.
Figures 4.42 and 4.43 examine the relationship between automation outcomes and
two additional metrics: functional correctness and query complexity. Concerning func-
tional correctness, the results demonstrate that automation status did not significantly
affect the correctness of the generated code. In other words, whether or not manual re-
finement was needed, the likelihood of producing a functionally correct result remained
relatively stable. For Codestral, 75% of all tests (187 out of 250) produced correct out-
puts when the code was fully automated, while another 16.5% (41 out of 250) were correct
despite requiring minor human intervention. Qwen 2.5 Coder exhibited a similar trend,
achieving 213 correct outcomes from its automated responses, corresponding to 85%,
and an additional 4% (10 out of 250) correct outcomes after slight manual adjustments.
These findings suggest that human intervention was generally limited to formatting or
output variable compliance, rather than corrections to the underlying logic of the code.
118
4.2.3 Evaluation
Figure 4.42. Comparing Functional Correctness results with Automation occurences [4].
Figure 4.43. Presenting the automation occurrences by the query complexity levels [4].
As for the latter figure (4.43), the automation rate for Codestral was 86% for both
basic and intermediate queries (43 out of 50, and 86 out of 100 tests, respectively),
while advanced queries demonstrated a slightly lower automation rate of 73% (73 out
of 100 tests). In comparison, Qwen 2.5 Coder exhibited consistently higher automation
performance across all levels of query complexity, achieving 100% automation for basic
queries (50 out of 50), 91% for intermediate queries (91 out of 100), and 98% for advanced
queries (98 out of 100). While the differences in Codestral’s automation percentages
119
Chapter 4. Testing and Results: Assessing the Study’s Performance
are relatively minor, they may indicate a tendency for the model to introduce additional
explanatory content in more complex scenarios, which occasionally necessitates minor
human intervention. Nonetheless, these variations are not pronounced enough to support
strong conclusions. Overall, both LLMs demonstrated high levels of automation across
all complexity levels, reinforcing their potential to generate executable code with little to
no manual refinement.
Error Handling
This evaluation metric aims to assess the robustness of the generated code, specif-
ically focusing on the frequency and severity of errors, as well as their impact on the
functional correctness of the results. The three plots presented in this section illustrate
the distribution of errors across all tests, segmented by dataset, query complexity level,
and the potential influence of errors on output accuracy. As shown in Figure 4.44, only
20 out of 250 tests conducted with Codestral contained errors in the generated code,
yielding a success rate of 92%, which stands as a strong indication of the model’s relia-
bility. Furthermore, when considering both the presence of errors and the correctness of
results, 218 out of 250 tests featured error-free code that produced accurate outcomes,
corresponding to an overall success rate of 87%. These findings highlight Codestral’s ro-
bustness, and its capacity to consistently generate syntactically and functionally sound
code.
Figure 4.44. Comparing Functional Correctness results with error counts [4].
In comparison, Qwen 2.5 Coder demonstrated a somewhat higher incidence of errors,
with 41 out of 250 tests containing erroneous code, corresponding to a success rate of
85%. When functional correctness is considered, 200 out of 250 tests involved error-free
code that produced the correct result, leading to an 80% success rate. These results
remain encouraging for both LLMs, particularly given the complexity of the task: Gener-
120
4.2.3 Evaluation
ating PySpark code using Spark DataFrames. The structural and functional differences
between Spark and the more commonly used pandas DataFrames could have introduced
considerable challenges. Consequently, the observed low error rates represent a notewor-
thy achievement. However, it is important to note that, in Qwen’s case, the prompt had
to be supplemented with an additional instruction to guide the model toward generating
valid PySpark commands, as it occasionally intermingled them with invalid pure Python
syntax. This point will be elaborated upon in subsection 5.3.1.
Figures 4.45 and 4.46 illustrate the distribution of error occurrences and their cor-
responding impact on functional correctness, across the five datasets used in testing, as
well as across the three levels of query complexity. While the relatively small number of
errors makes it premature to draw definitive conclusions, some preliminary observations
can be made to guide future research efforts. Regarding the dataset-wise distribution,
Codestral exhibited the highest number of error-labeled tests in the ‘Shared Cars Lo-
cations’ dataset, with a total of 13 errors. This is markedly higher than in the other
datasets, where the ‘COVID-19 Twitter’ dataset yielded 3 errors, the ‘Supermarket Sales’
dataset 2 errors, and both the ‘Madrid Daily Weather’ and ‘Netflix’ datasets only 1 error
each. Similarly, Qwen 2.5 Coder recorded the highest error count in the ‘Shared Cars
Locations’ dataset, with 22 errors. As for the remaining ones, Qwen produced 7 errors in
the ‘Netflix’ dataset, 6 in the ‘COVID-19 Twitter’ dataset, 4 in the ‘Madrid Daily Weather’
dataset, and 2 in the ‘Supermarket Sales’ dataset.
Figure 4.45. The number of errors found in the tests, per dataset [4].
121
Chapter 4. Testing and Results: Assessing the Study’s Performance
Figure 4.46. The total count of errors, grouped by query complexity levels [4].
Regarding the distribution of functional correctness, it was observed that in most
cases, the presence of errors in the generated code led to incorrect outcomes, indicating
that such errors adversely affected the desired output. However, an exception was noted
in the case of the ‘Shared Cars Locations’ dataset, where 10 tests from Codestral and 20
tests from Qwen 2.5 Coder included code that contained errors, but still produced correct
results. As shown in Fig. 4.46, the majority of these cases corresponded to the dataset’s
two intermediate queries. Nevertheless, Qwen also exhibited a noticeable number of such
instances in basic queries. The primary reason for this phenomenon lies in how both
LLMs interpret and handle the dataset’s timestamp fields. Although the models occa-
sionally failed to correctly modify, or explicitly define these fields as instructed, PySpark’s
internal handling of datetime data enabled the system to operate correctly, despite these
shortcomings. Consequently, the functional correctness of the results remained intact,
even when the code deviated from the expected format. For Codestral, these 10 instances
account for 20% of the total tests conducted on the ‘Shared Cars Locations’ dataset, while
for Qwen, the 20 successful (yet erroneous) tests represent 40% of the same dataset.
An early interpretation of this behavior suggests that both Codestral and Qwen may
require further refinement in understanding the contextual and relational dynamics be-
tween columns in datasets involving location information, particularly in aligning times-
tamp data with their corresponding spatial records. To validate and expand upon this
observation, additional studies should be conducted using alternative forms of location-
based datasets, such as those involving indoor positioning data [5]. An indoor positioning
use case for future application of the current thesis’s study, is presented in chapter 6.
122
Chapter 5
Thesis Findings: Critical Assessment of the Pro-
posed System
5.1 General Note
As a concept, this thesis’s scalable data management and AI-powered analytics pro-
posal can offer a promising alternative approach in the domain of data analysis, profiling,
and quality assessment. Its capability to analyze entire datasets, support user-defined
quality rules, and operate independently of data volume, can make it a useful framework
for both data analysts and engineers. In addition, the insights gained from its use are ex-
pected to highlight the benefits of analyzing complete datasets with ease, making queries
in natural language and having an offline LLM translate them in executable analysis code.
As previously noted, the development of this framework consists of an implementation
introduced in a publication in 2023 [1], as well as a comprehensive study in 2025 [4].
5.2 Validating Scalable Data Management Frameworks
Regarding the thesis’s underlying software infrastructure, further research could be
conducted to enhance the capabilities of the scalable data management layer and demon-
strate its applicability across multiple business sectors, beyond its current tested environ-
ments. Preliminary testing with telecommunications data from the OTE Group supports
this broader potential. Evaluating the layer within industrial environments would pro-
vide valuable insights into its versatility. Future development efforts should prioritize the
long-term integration of assets like the scalable data management layer in CI sectors and
business organizations, assessing their overall impact.
Continued support from the European Union in academic institutions and companies
(through participation in research initiatives) can prove to be important, particularly with
a shift towards solutions that are closer to real-world deployment. This recommendation
also extends to global research teams interested in efficient and scalable data manage-
ment. Future frameworks should undergo comprehensive testing across diverse domains
to ensure they can handle large volumes of heterogeneous data that vary in structure,
type, and format. A current limitation of the proposed scalable data management layer
is its focus on tabular data, which, as of now, restricts its ability to process other data
123
Chapter 5. Thesis Findings: Critical Assessment of the Proposed System
types. This issue should be addressed by adapting the architecture to accommodate a
wider range of data formats. Lastly, while the layer has proven to be effective within a set
of domain-specific datasets, its broader potential remains to be fully explored.
5.3 Entrusting Offline LLMs for Future Data Analytics
5.3.1 Strengths and Limitations
As the main and final component of this thesis’s framework, the AI-powered data ana-
lytics asset has demonstrated solid signs of operational efficiency. The results highlighted
the strong performance of the two models tested, mainly in generating functional code
scripts aligned with the given objectives. In the case of Codestral, 87% of the 250 test
cases were fully successful, indicating both functional correctness and error-free execu-
tion. Similarly, Qwen 2.5 Coder achieved a full success rate of 80% across the same test
set. Moreover, the generated code from both models generally received readability scores
between ‘2’ and ‘3’, suggesting it was easy for humans to interpret. Only 2 of Qwen’s
250 tests received the lowest readability score of ‘1’. In terms of automation, Codestral
successfully automated 80% of its outputs, while Qwen surpassed this with a 96.5%
automation rate, reducing the need for human intervention. Overall, when considering
correctness irrespective of minor errors or manual edits, Codestral reached a 91% success
rate, and Qwen achieved 90%. These findings demonstrate the reliable code-generation
capabilities of both models, even in the context of generating PySpark code, which is
typically more complex than standard Python.
Overall, both offline LLMs proved capable of supporting data analysis workflows, with-
out requiring data to be uploaded into the models themselves. Instead, they generated
tailored code that was executed within the dedicated, scalable data management platform.
This outcome reinforces one of this thesis’s original aims: To enable secure and effective
data analysis through end-user queries expressed in natural language. By facilitating
code generation in secure, on-premises environments, this offline LLM-powered approach
addresses critical concerns related to data security, privacy, and integrity. The results
suggest that offline LLMs can be further explored as reliable tools for code generation in
data analytics tasks, offering a compelling alternative to online models, and potentially
replacing conventional data analysis frameworks.
Nonetheless, this thesis’s study also revealed certain limitations. Despite the robust
hardware specifications of the machine hosting the language models, performance was
not sufficient for fast, real-time code generation. On average, Codestral required approx-
imately 60 seconds to produce a response, while Qwen 2.5 Coder responded in about 25
seconds. These response times, particularly in the case of Codestral, are not ideal for
practical applications. While both CPU and GPU usage remained moderate, GPU memory
was consistently under heavy load. This indicates an early and clear need for a more
powerful GPU to achieve faster performance. As such, hardware limitations currently
represent the primary obstacle to scaling this approach further, based on the present
capabilities of offline LLMs.
124
5.3.2 Future Work
The field of large language models is advancing rapidly. Under current technological
conditions, a major performance improvement would likely require a significant GPU
upgrade, ideally one with sufficient dedicated memory to fully accommodate the LLM.
Achieving near-real-time performance would necessitate dedicating a high-performance
server exclusively to the model itself. However, the cost associated with such a system
would be substantial, potentially making it prohibitive for many organizations. Whether
such an investment is justified depends on each organization’s priorities and available
resources. Nevertheless, the ability to conduct data analysis securely without exposing
sensitive data to external environments remains a strong incentive for adopting offline
LLMs in on-premise applications.
In addition to performance limitations, improvements in process automation, partic-
ularly for Codestral, are necessary. In 20% of its test cases, human intervention was
required to make minor adjustments to the generated code, such as removing extraneous
text or assigning the final output to a specified variable, even when these instructions
were clearly stated in the prompt. Achieving full automation is essential for practical,
real-world use. By contrast, Qwen 2.5 Coder required manual edits in only 4.4% of its
test cases, indicating a higher level of readiness for deployment in automated environ-
ments.
Finally, while both Codestral and Qwen 2.5 Coder consistently followed the instruc-
tions provided in the main prompt, it was noted that Qwen initially tended to generate
partial solutions using standard Python, rather than valid PySpark operations. To miti-
gate this, the prompt for Qwen was modified to include the instruction: ‘Ensure the code
runs as valid PySpark DataFrame operations, not standard Python, and verify its execu-
tion within Spark.’ This adjustment highlights the importance of model-specific prompt
tuning. It should not be viewed as a shortcoming, but rather as a necessary refinement
within the broader evaluation process.
5.3.2 Future Work
Future work will focus on enhancing the code generation process, which lies at the
core of this thesis’s study, as a key to enabling effortless end-user query submission.
An essential priority will be the fine-tuning of Codestral and Qwen 2.5 Coder, with the
aim of achieving even higher performance. Should new large language models emerge
that demonstrate comparable or superior efficiency, they may also be considered for
evaluation. The primary objective of the fine-tuning process will be to improve automation
levels by minimizing the generation of unnecessary text, and ensuring that final results are
consistently stored in a designated variable recognized by the data processing platform,
as implemented in this study.
Although both models demonstrated high accuracy in producing error-free outputs
and correct final results, fine-tuning is expected to further improve these outcomes. Ul-
timately, a fine-tuned offline model, fully offloaded to a more powerful GPU, could yield
better performance across nearly all evaluation metrics in a follow-up study of this thesis.
However, it is essential that appropriate safeguards are put in place during fine-tuning to
125
Chapter 5. Thesis Findings: Critical Assessment of the Proposed System
prevent the introduction of bias, thereby preserving the model’s ability to generate reliable
and generalizable outputs across various data contexts.
Beyond the proposed improvements, future work may also investigate hybrid infras-
tructure configurations for the framework. Hybrid models combine offline processing with
private cloud environments, allowing the system to take advantage of scalable computa-
tional resources while maintaining the privacy advantages inherent to offline setups. This
approach could support more extensive model optimization and faster development iter-
ations. However, adopting such a hybrid architecture would introduce additional costs,
as it requires investment in both high-performance offline hardware and scalable cloud
infrastructure, alongside the added complexity of ongoing system maintenance. This,
along with the necessity of constant integration with a scalable data management tool,
would make the final system even more complex.
In parallel, expanding the proposed framework to include a dedicated user interface
(UI) could significantly enhance its practical value. Currently, the research framework
(illustrated in Fig. 3.5) is designed primarily to evaluate model capabilities within a
controlled environment. Transitioning toward a production-ready solution will require the
development of a user-friendly interface that supports both technical and non-technical
users. Such an interface would promote real-world deployment and usability. Apart from
the development of a UI, future work should consider integrating additional features and
evaluation components needed to position this study as a comprehensive enterprise-ready
solution. The model-agnostic structure and extensible design of the framework can offer
a clear path for evolving the current prototype into a fully functional platform, which will
finally be suitable for enterprise applications.
126
Chapter 6
Future Use Case: Real-Time Indoor Localization
Frameworks
The following research has been published in [5]. It is included here with periodic
adaptations to maintain consistency within the context of this dissertation. It servers
as a potential future use case scenario, for the application of this thesis’s scalable and
AI-powered data management and analysis framework.
6.1 Indoor Localization in Today’s Applications
6.1.1 Location Information
Location information is a highly valuable asset, widely utilized across various busi-
nesses, organizations, and applications. Typically, a user’s location dataset is composed
of their geographic positions, commonly referred to as ‘coordinates’, which include latitude
and longitude values. These individual records can be aggregated into larger datasets,
containing the positions of multiple individuals, the coordinates of structures (such as
buildings), or notable locations (such as monuments) [138]. In addition, location datasets
may also include elevation and altitude details, offering further context to end-users.
The sources of location data vary depending on the type and specific requirements
of each application. A primary source is the Global Positioning System (GPS), which
determines latitude and longitude by facilitating communication between satellites and
user devices. Other sources include Beacons, devices that transmit low-energy Bluetooth
signals, as well as Wi-Fi networks, where devices emit probes while searching for access
points. Beacons and Wi-Fi are particularly useful for determining location in indoor
environments, whereas GPS is predominantly employed for positioning across larger,
primarily outdoor areas.
Both types of location data, outdoor and indoor, are widely utilized by a range of ser-
vices. GPS, in particular, is well known for providing positioning information to millions
of users daily, assisting them in reaching destinations they are unfamiliar with. Modern
applications on smart devices frequently request user permission to collect location data,
and increasingly, websites also prompt users to share their location information. These
trends highlight the growing value and significance of location data across multiple indus-
tries. Nevertheless, indoor localization and nearby environment positioning technologies
127
Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks
are not as widely recognized among consumers. Products such as Tracker Bluetooth
Beacons [139], including examples like Tiles and AirTags, have helped raise awareness
about Bluetooth technology’s role in generating and providing location data.
Despite this, a large portion of end-users remain unfamiliar with the potential benefits
of indoor positioning data. The advantages of GPS coordinates are readily understood
through the widespread use of navigation applications and web-based maps. In contrast,
similar, easily accessible examples showcasing the benefits of indoor geolocation data
are lacking from the consumer perspective. This gap can be attributed to the absence
of mainstream applications, comparable to navigation tools, that leverage indoor location
information to deliver practical services. Developing such applications would represent a
major advancement in location-based services and could offer considerable value across
various sectors of modern society.
As the adoption of indoor positioning applications continues to expand, the volume of
generated data is also expected to increase. Effectively retrieving and managing indoor
localization information is essential for conducting in-depth analysis and understand-
ing how this data can be utilized. Indoor positioning can assist critical infrastructures
and businesses in gaining insights into customer behavior, thereby enhancing marketing
strategies and optimizing resource allocation. For instance, the healthcare sector can
benefit significantly by improving the tracking of medical equipment and patients within
facilities, ultimately reducing delays in urgent situations. Indoor location services hold
the potential to transform a wide range of industries [140]. Previous research has fur-
ther explored the diverse applications of indoor positioning systems [141], emphasizing
their transformative impact across various sectors and the corresponding growth in data
generation.
6.1.2 Indoor Localization as an Area Network
To encourage broader adoption of indoor localization applications in today’s market
and consequently enhance data generation and availability the research community
should focus on promoting the concept of a simplified indoor localization architecture.
Specifically, indoor positioning can be viewed as a domain of location data exchange
between different entities, operating similarly to area networks. Various definitions exist
for area networks, but generally, they are considered telecommunications networks that
connect devices within a specific geographic boundary. The scale of an area network can
vary widely, from a single room to an entire city, depending on its type and intended
purpose. Common types include Local Area Networks (LAN), Wide Area Networks (WAN),
Metropolitan Area Networks (MAN), and Personal Area Networks (PAN) [142, 143, 144].
Each type of area network is designed to optimize communication and resource shar-
ing within its respective scope, addressing the particular needs of connected devices and
users. This principle can similarly be applied to the fields of indoor localization and
positioning. Consequently, indoor localization can be conceptualized as a new form of
area network. This article proposes the introduction of the ‘Transactional Area Network’
(TAN). The term ‘Transactional’ has been selected to emphasize that data exchanges be-
128
6.2 Literature Review
tween users can be interpreted as forms of transactions. By defining TAN and outlining its
key characteristics, future indoor localization systems could adopt a unified architectural
model. This, in turn, could accelerate the adoption of such applications and significantly
boost data generation efforts.
The term ‘Transactional Area Network’ does not currently have a definition in the
global scientific literature. It is introduced and conceptually framed within the context
of this article. The idea of Transactional Area Networks emphasizes the need to expand
localized networking solutions that address the unique challenges of indoor environments.
This approach seeks to encourage the development of new technologies and methodologies
aimed at achieving simple yet effective implementations, with a particular focus on user
data exchange and generation. Through this proposal, the article aspires to stimulate
further research and innovation in the field of indoor localization, ultimately enhancing
user experiences within indoor spaces.
As previously outlined, the proposed concept of a Transactional Area Network could
provide a foundation for future indoor localization frameworks and applications. The
following sections of the article review existing research in indoor positioning and lo-
calization (‘Literature Review’), introduce the conceptual framework of the Transactional
Area Network (‘TAN Conceptualization’), present a basic implementation of the TAN model
(‘Proof-of-Concept Implementation’), and conclude with a discussion of findings and key
insights (‘Conclusions’).
6.2 Literature Review
6.2.1 RSS, PDR and Filtering Techniques
To date, a substantial number of studies and publications have focused on indoor ge-
olocation, localization, and positioning systems. Numerous implementations have been
explored, and a variety of frameworks have been introduced, each addressing different
aspects of proximity-based positioning services. Before delving into the specific contribu-
tions of this work, it is essential to review some of the most prominent existing solutions
and frameworks in the domain of indoor localization.
A widely recognized approach was presented by Zhuang et al., 2016 [140], which intro-
duced a smartphone-based indoor localization system supported by a Bluetooth Beacon
network deployed in enclosed environments. This method leveraged the features of Blue-
tooth Low Energy (BLE) and Received Signal Strength (RSS) measurements for location
detection, relying on a centralized architecture composed of a smart device, along with
multiple beacons placed throughout the area. The study proposed a hybrid localization
algorithm that integrated a channel-specific Polynomial Regression Model (PRM), channel-
specific Fingerprinting (FP), outlier detection techniques, and Extended Kalman Filtering
(EKF).
The outlined techniques are frequently employed in similar indoor localization ap-
proaches. Polynomial Regression [145] and Fingerprinting [146] are commonly combined
to estimate both the position of a user, and the distances between the user and surround-
129
Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks
ing Bluetooth Low Energy Beacons. Additionally, Outlier Detection and the Extended
Kalman Filter [147] are used to enhance the quality of localization data, particularly in
terms of distance and position accuracy. Although the proposed algorithm demonstrated
significant improvements in localization precision, especially in environments equipped
with tracker BLE beacons, its reliance on hardware beyond standard consumer-grade
smart devices (i.e., requiring BLE beacon deployment) posed limitations for broad adop-
tion. Furthermore, the solution did not address the possibility of enabling users to access
or view the presence and information of others within the same system.
Another notable contribution in the field is presented by Chen et al. [148], who
developed a smartphone-based indoor localization and tracking system, utilizing iBea-
cons. The study aimed to enhance several key aspects of Pedestrian Dead Reckoning
(PDR) on smartphones, including step detection, walking direction estimation, and initial
position determination. By combining smartphone-integrated sensors with BLE beacon
infrastructure, the proposed method achieved high-precision object localization. The sys-
tem was built upon Apple’s iBeacon technology and applied filtering techniques (such
as Kalman Filtering) to optimize the accuracy of tracking data. User positions were up-
dated dynamically whenever they entered a defined calibration zone with a three-meter
range. This work effectively demonstrated the feasibility of using smartphones as central
components in indoor positioning systems. However, it is important to note that its core
functionality still depended on external BLE beacon hardware.
In a study conducted by Zou et al. [149] in 2017, the authors proposed an Indoor
Positioning System (IPS) designed for navigation and tracking, utilizing the built-in sen-
sors of smartphones such as accelerometers, gyroscopes, and magnetometers to
estimate users’ movement and direction. The system was based on the Pedestrian Dead
Reckoning (PDR) method and incorporated a particle filter-based fusion technique. In
areas with limited Wi-Fi coverage, the weight of the particles was determined using iBea-
con measurements, while in areas with stronger Wi-Fi signals, the position estimates
were derived from Wi-Fi-based data. The proposed system demonstrated high accuracy
in estimating user movement and direction. However, it required external infrastructure.
Specifically, a modified Wi-Fi setup, beyond the use of standard consumer smartphones.
It is also worth noting that the study focused on tracking user movement within indoor
environments rather than enabling direct peer-to-peer positioning among users.
Another significant contribution was made by Yadav et al. [150] in 2019, who ex-
plored the implementation of an IPS using a combination of iBeacon technology, Inertial
Measurement Units (IMUs), and Fingerprinting techniques. Instead of relying on the
commonly used k-Nearest Neighbors (kNN) algorithm in their map matching process, the
authors adopted Bayesian estimation to probabilistically determine the most likely ref-
erence points corresponding to the user’s position. The proposed framework integrated
both BLE beacon data and PDR to enhance localization accuracy. Additionally, the incor-
poration of a fuzzy logic-based Kalman Filter further improved the quality of positioning
results. Although the approach produced promising results in terms of indoor localization
accuracy, it required access to an external database, thereby introducing a dependency
on additional hardware beyond the smart device itself.
130
6.2.1 RSS, PDR and Filtering Techniques
An important contribution was made by Dinh et al., 2020 [151], who proposed a
smartphone-based indoor positioning system that utilized BLE iBeacons and Fingerprint-
ing techniques. The system was built upon the principles of Pedestrian Dead Reckoning
[152], which - as already mentioned - estimates a user’s location by leveraging iner-
tial sensors such as accelerometers and gyroscopes. By continuously integrating step
counts and rotational movements, the system was able to compute the user’s displace-
ment and orientation from an initial reference point, enabling indoor navigation without
dependence on external signals like GPS. The authors structured their approach around
two core components. Firstly, they introduced a regression-based method for estimating
distances using Bluetooth Low Energy Received Signal Strength (BLE-RSS). This model
was designed to correlate the Received Signal Strength Indicator (RSSI) values with the
approximate distance between the beacon and the user’s device, providing a practical
means of enhancing localization accuracy. Specifically, the model is based on the loga-
rithmic least squares method, where the RSSI value (RSSI) at a distance λis given by the
equation:
RSSI =α+ln(λ)(6.1)
Here, αand are coefficients that are determined through the least squares fitting
of the collected RSSI data under both line-of-sight (LOS) and non-line-of-sight (NLOS)
conditions. These coefficients are calculated as follows:
=
nPn
i=1(RSSIiln λi)Pn
i=1RSSIiPn
i=1ln λi
nPn
i=1(ln λi)2(Pn
i=1ln λi)2(6.2)
α=Pn
i=1RSSIiPn
i=1ln λi
n(6.3)
In the initial phase of position estimation, once the approximate distances are deter-
mined using the regression model, the system proceeds with a calibration step. During
this process, the user is asked to remain stationary for a few seconds. While the user
stays still, the system collects BLE signals from the three closest beacons. These sig-
nals are then used to estimate the user’s initial location by applying a combination of
Line Intersection-Based Trilateration and a median filtering technique. The trilateration
approach involves solving a set of equations based on the calculated distances from the
nearby beacons:
(xxi)2+(yyi)2=λ2
i(6.4)
where (xi, yi)are the coordinates of the beacons and λiare the estimated distances from
the regression model. Simplifying these equations leads to:
x2+y2+aix+biy+ci=0(6.5)
where ai=2xi,bi=2yiand ci=x2
i+y2
iλ2
i. By solving these equations, the initial
position (x, y)is determined.
131
Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks
Finally, to improve the accuracy of the initial position estimate, a median filter is ap-
plied to minimize the impact of outlier values. Directional information is obtained from
the mobile device’s built-in magnetometer, which is accessed via the device’s operating
system Software Development Kit (SDK), such as those provided by iOS [153] and Android
[154]. This sensor data is used to determine the direction of the user’s movement. Since
precise distance estimation is essential for indoor positioning, the method used in this
study enabled accurate identification of the starting point. Additionally, the work by Dinh
introduced an effective and dependable radio map that uses fingerprint-based positioning
to correct deviations caused by Pedestrian Dead Reckoning. Instead of relying on high-
resolution grids with a large number of reference points, Dinh’s approach collects data
from a smaller set of points, thereby reducing the overall deployment time. Although it
presents a sophisticated solution to challenges such as efficiency and hardware require-
ments seen in earlier systems, one limitation of the method is its reliance on external
beacon devices.
Another significant contribution was made by Tuan et al., 2021 [155], who proposed
a hybrid tracking framework that integrates PDR with Wi-Fi and iBeacon signal data.
Their approach is organized into three main components. First, the authors designed
conversion functions that translate Wi-Fi and iBeacon signals into estimated proximity
distances. To support indoor tracking, they also suggested a strategic deployment method
for placing iBeacons within designated areas. Second, an enhanced version of the PDR
method was introduced for use on smartphones, which leveraged signal data to estimate
an initial position and refine the user’s path during movement. Finally, the proposed
algorithm was implemented in an iOS application to showcase its functionality. Despite
its promising design, the system relies on external beacon infrastructure, while also not
addressing the detection of multiple users within the environment.
Efforts to further improve the precision of such positioning systems have also been
explored. For instance, Elgui et al., 2020 [156] focused on enhancing the reliability and
accuracy of BLE RSSI signal measurements. Additionally, Duong et al., 2021 [157] ex-
amined the performance of indoor localization systems based on BLE-RSS fingerprinting
under static conditions. Their study was conducted in two stages: An ’offline’ phase,
during which RSSI data were collected, and an ’online’ phase, which used real-time mea-
surements combined with the previously gathered data to estimate user positions. The
effectiveness of their system was assessed across multiple real-world settings, with par-
ticular attention to parameters such as the ideal value of kin the k-Nearest Neighbors
(KNN) algorithm, and the optimal number of beacons required for accurate positioning.
6.2.2 Machine Learning and Ultra Wideband
Another indoor localization system that employed Machine Learning and Deep Learn-
ing techniques was presented by Abbas et al., 2019 [158]. The researchers developed
a system aimed at delivering high accuracy and reliability, even in environments with
significant signal noise. Their approach combined a deep learning model with denoising
autoencoders, integrated within a probabilistic framework designed to manage noise in
132
6.2.2 Machine Learning and Ultra Wideband
received Wi-Fi signals. The objective was to capture the complex relationship between
signals from Wi-Fi access points, and the device’s actual location.
Further investigation into enhancing indoor localization through Machine Learning
was carried out by Njima et al., 2022 [159]. In their study, the researchers addressed
the challenge of limited datasets by applying Generative Adversarial Networks (GANs) for
data augmentation. Their work focused on scenarios involving missing indoor localization
data, distinguishing between cases with available unlabeled data and cases where no un-
labeled data were present. In the first scenario, they proposed a weighted semi-supervised
approach based on deep neural networks, incorporating pseudo-labeling techniques that
combined a small number of real labeled samples, with a larger set of low-cost pseudo-
labeled data. This method improved accuracy, while minimizing the need for additional
labeled data. In the second, more constrained scenario, where no unlabeled data were
available, they introduced a solution that generated synthetic fingerprints using GANs. A
deep neural network was then trained on a dataset composed of both real and generated
samples, applying a similar weighting strategy as in the first case, aiming to enhance
prediction performance and reduce overfitting.
Several survey studies have also been carried out to compile and analyze the majority
of indoor localization methods and use cases. One such example is the work by Yang
et al., 2021 [160], which explored key aspects such as accuracy, scalability, stability,
reliability, and algorithmic complexity across a wide range of existing and emerging lo-
calization techniques. The survey provides an in-depth comparison of these methods,
offering valuable insights into practical implementation models, and aiming to improve
localization accuracy while reducing system complexity. Other notable surveys address-
ing similar themes have been published by Zafari et al. [161], Jang et al. [162], Subedi et
al. [163], and Mallik et al. [164].
Ultra-Wideband (UWB) technology has also been explored in various indoor localiza-
tion applications. Notably, Apple has adopted UWB to deliver accurate distance estimation
between its devices. Specifically, the U1 chip developed by Apple enables high-precision
distance measurements between compatible devices [165]. However, a key limitation is
that only devices equipped with a ’U’ series chip can support UWB-based localization.
Future advancements by Apple and other manufacturers may aim to overcome this re-
striction. A notable contribution to the field was made by Porok and Martinoli in 2013
[166], who proposed a UWB localization technique designed to address error patterns
through spatial and multimodal modeling. Their method employed tessellated maps and
incorporated relative positioning within a particle filter framework, in order to enhance
accuracy. Tests involving mobile robots equipped with UWB transmitters demonstrated
the system’s ability to achieve precise indoor localization.
A notable and frequently cited work in the Ultra-Wideband domain is the survey by
Alarifi et al., 2016 [167]. This study presented a comprehensive overview of the in-
door positioning technologies available at the time, with a particular focus on UWB. It
offered a detailed comparative analysis, including a ’SWOT’ (Strengths, Weaknesses, Op-
portunities, Threats) assessment of UWB, emphasizing its advantages and potential for
addressing indoor localization challenges. The survey also introduced taxonomies and
133
Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks
summarized recent advancements, while suggesting directions for future research.
More recent developments include a proposal by Mayer et al., 2023 [168], which in-
troduced a self-sustaining indoor localization system based on UWB technology for smart
mobile sensor nodes. These nodes were designed for compactness and long-term op-
eration, reducing the need for frequent battery replacement. The system employed an
event-driven sensing mechanism to manage energy consumption efficiently, and inte-
grated heterogeneous design elements with onboard processing. As a result, it achieved
high localization accuracy with minimal infrastructure requirements.
Additionally, Hapsari et al., 2024 [169], conducted a review focused on the evolution
of UWB-based indoor tag localization technologies from 2017 to 2022, covering appli-
cations in positioning, tracking, and navigation. Their study employed both Systematic
Mapping Study (SMS) and Systematic Literature Review (SLR) methodologies to explore
trends, challenges, and research opportunities. Key topics included the use of machine
learning, filtering techniques, and sensor fusion, along with performance metrics such as
accuracy, scalability, and energy efficiency. The paper also proposed a taxonomy of op-
timized metrics and outlined future research directions, positioning itself as an updated
continuation of Alarifi’s foundational survey of 2016.
The wide variety of indoor positioning solutions discussed in the outlined literature
offer several benefits to their adopters. However, challenges such as hardware dependen-
cies, infrastructure demands, environmental limitations, and potential concerns regard-
ing energy consumption, emphasize the ongoing need for further research in this domain.
A Transactional Area Network (TAN) could provide a standardized and unified framework,
capable of integrating diverse methods and technologies. TAN may serve as a founda-
tional policy for developing new solutions, experimenting with cutting-edge algorithms,
and advancing the broader objective of decentralized indoor positioning. Additionally, it
could promote increased data exchange, while also facilitating greater data generation
within indoor environments.
6.3 TAN Conceptualization
6.3.1 Scope and Characteristics
In order to support the development of future indoor localization systems within a
unified architectural framework, the Transactional Area Network should be built upon
a core concept and a defined set of key characteristics that shape its functionality and
purpose. Any new indoor positioning solution could align with the principles of TAN and,
as a result, be classified under this model. The rationale behind TAN has been briefly
introduced in the opening section of this chapter. First, there is currently no widely
adopted indoor localization standard. Second, many existing systems lack support for
peer-to-peer interactions, limiting users from sharing data, or exchanging relative location
information. This absence of interconnectivity contributes to lower user familiarity with
indoor localization tools, when compared to widely used GPS-based services. Third, these
factors result in a general scarcity of indoor positioning data. Future developers should
134
6.3.1 Scope and Characteristics
consider these challenges when designing new applications. The introduction of TAN
directly addresses these gaps, aiming to promote broader adoption and innovation in the
field.
Figure 6.1. High-level overview of the Transactional Area Network’s implementation within
an indoor environment [5].
As discussed in the Introduction section of this chapter, the term ‘Transactional Area
Network’ has not yet been formally defined within the global research community. The
vocable ‘Transactional’ is used in this context to reflect the nature of indoor localiza-
tion, which facilitates interactions, mutual influence, and the exchange of data between
entities. Indoor positioning systems inherently depend on information exchange among
multiple devices, rather than functioning as isolated units. This continuous interaction
among entities can be understood as a form of transaction, thereby justifying the use of
the word ‘Transactional’. Furthermore, indoor localization can be classified as an ‘Area
Network’ because it connects devices operating within a confined spatial region—namely,
indoor environments. Based on these considerations, this work adopts the term ‘Trans-
actional Area Network’ to describe the proposed framework. Accordingly, TAN can be
defined as follows: “A Transactional Area Network is a computer network that inter-
connects devices within indoor locations, aiming to enhance entity interaction, mutual
influence and data exchange”.
135
Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks
Figure 6.2. A focused overview of the user communication in a TAN, showcasing the data-
exchange pipelines between two users, as well as the existence of external hardware for
optional assistance [5].
Based on the proposed definition, a Transactional Area Network should support
streamlined and decentralized indoor positioning processes, allowing end-users to benefit
directly from its functionalities. Within this network, each instance of location, or other
data exchange between two users is referred to as a ‘transaction’. Accordingly, every suc-
cessful transmission and reception of information constitutes a transaction, with each
new data exchange representing a distinct event. To promote widespread adoption, it is
important to limit the reliance on additional hardware. Although TAN may incorporate
external devices, such as Bluetooth beacons (refer to Figures 6.1 and 6.2), the network
should primarily operate through users’ personal devices. As TAN becomes more widely
adopted, the volume of generated data will naturally grow. Consequently, the network
must be capable of forwarding this data to an external database located beyond the local
environment.
Figure 6.1 illustrates a conceptual representation of a TAN’s operational structure,
while Figure 6.2 focuses on the communication flows among users within the network.
Drawing from the preceding discussion, the core characteristics of a Transactional Area
Network can be summarized as follows:
Peer-to-Peer Design: The network should emphasize direct data exchange between
users, supporting a decentralized architecture.
Minimal Hardware: To maintain a decentralized structure, the network should pri-
marily rely on users’ personal devices, such as smartphones and smartwatches.
External components, like Bluetooth beacons, should only be employed when addi-
tional positioning information is deemed necessary.
Data Collection: Positioning data generated by each user should be collected and
136
6.3.2 Technical Details
forwarded to an external storage system, located outside the local network environ-
ment.
Data Comprehensibility: As indoor positioning data holds value primarily within its
respective context, the gathered information should be interpretable and capable of
offering useful insights about the environment.
Delving into the technical aspects of a user’s device operating within the Transactional
Area Network, it was essential to first establish the foundational criteria that define TAN’s
design. These core characteristics provide the groundwork for specifying how a user’s
device should operate in this context. Key functionalities include acting as a virtual
beacon for estimating distances, initiating peer-to-peer sessions for data exchange, and
pairing signal information with user identifiers to enable accurate and integrated data
handling.
6.3.2 Technical Details
Given the outlined criteria, the implementation of a TAN should primarily utilize the
personal devices of individual users, such as smartphones and smartwatches. The main
objective is to provide users with real-time indoor positioning information about them-
selves and others nearby. Therefore, TAN must support user-to-user localization without
relying on external hardware, like Bluetooth beacons or dedicated trackers. In this con-
text, each user’s device functions as their personal tracker. As previously discussed, the
use of additional hardware should be minimized. However, the TAN architecture may
still support optional external components (such as beacons) that offer supplementary
information, for example, identifying a specific building floor or indoor area. These ele-
ments, while potentially useful, must remain auxiliary and not essential to the network’s
operation. Maintaining TAN’s peer-to-peer design requires that the network primarily
depend on users’ own devices. Accordingly, participation in a TAN should only require
the installation of a lightweight software application. Ideally, such functionality would
be natively integrated into the device’s operating system, similar to existing GPS-based
services. Since the personal device is the main hardware component within a TAN, its
technical capabilities warrant closer examination.
The core indoor positioning functions of a TAN will rely on Bluetooth Low Energy and
Wi-Fi signals. Both iOS and Android the dominant operating systems in today’s mobile
ecosystem provide BLE-based solutions for enabling proximity-aware services. Ap-
ple’s iBeacon framework [170] enables interaction between devices using BLE, a wireless
protocol introduced in Bluetooth 4.0 that supports energy-efficient, short-range commu-
nication. iBeacon, launched as part of iOS 7, makes use of the ’Core Location’ and ’Core
Bluetooth’ frameworks to identify and interact with BLE signals. On the Android side,
Eddystone [171], developed by Google, provides a similar open-source BLE beacon format.
Eddystone supports broadcasting multiple frame types, allowing transmission of URLs,
telemetry data, and other contextual information to nearby devices. Although Google
officially discontinued support for Eddystone in 2018, the format remains available for
137
Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks
continued development. Notably, iOS devices can operate as beacons themselves [172],
mirroring Android’s ability to leverage Eddystone for creating location-based experiences,
without requiring dedicated beacon hardware.
Regarding the utilization of Wi-Fi signals, frameworks for both iOS and Android plat-
forms are available. ’Multipeer Connectivity’ [173] is a framework introduced by Apple
for iOS and macOS with the release of iOS 7. It enables devices to communicate and
exchange data directly over Wi-Fi or Bluetooth, without requiring an internet connection.
The framework supports peer-to-peer communication, allowing nearby devices to form
ad-hoc or mesh networks dynamically, based on available network infrastructure. Mul-
tipeer Connectivity allows for seamless discovery, connection, and interaction between
devices, making it suitable for proximity-based applications. On the Android platform,
’Nearby Connections API’ [174], developed by Google, provides similar functionality. It
allows Android devices to discover each other and establish connections over Wi-Fi and
Bluetooth, enabling real-time data exchange and collaboration. Nearby Connections API
abstracts much of the underlying complexity involved in device discovery and connection
management, offering a streamlined solution for developers. Additionally, it is worth not-
ing that, until December 2023, Google maintained support for the ’Nearby Messages API’
[175], which also allowed Android devices to discover and communicate using a combi-
nation of BLE and Wi-Fi. However, the Nearby Messages API has since been deprecated,
and developers are encouraged to migrate to the Nearby Connections API for continued
functionality and support.
A Transactional Area Network should integrate these operating system-specific frame-
works, in order to broadcast and receive BLE and Wi-Fi signals effectively. The architec-
ture must combine their core functionalities to facilitate the transmission of both signal
types, along with accompanying user data, from one device to another. Consequently,
the indoor localization system within a TAN must encompass, at minimum, the following
key features:
Virtual Beacon Configuration: Each user’s device should operate as a virtual bea-
con, broadcasting key beacon parameters (such as UUID, Major/Minor values, or
Instance ID) via Bluetooth Low Energy (BLE) signals. Simultaneously, it should
scan for signals from nearby devices, and estimate their distance using the received
signal strength indicator (RSSI).
Peer-to-Peer Session Establishment: Devices should initiate ad-hoc, peer-to-peer
communication sessions, using frameworks like Multipeer Connectivity or Nearby
Connections, to enable direct data exchange among nearby devices. This commu-
nication should rely on Wi-Fi signals, with Bluetooth support, and function without
requiring internet access.
Signal Pairing and User Identification: Users within the TAN network should be able
to access distance proximity data and additional information for other users, such
as names and profile images. Since BLE broadcasting and peer-to-peer communi-
cation represent two separate information channels, a robust method is required to
138
6.3.3 Use Cases
accurately associate signals with users. This ensures that each device can correctly
match (and present) the relevant information for each detected user.
External Database Connectivity: Devices should also support interaction with an
external database, allowing them to transmit all received data for storage or further
processing.
If software deployed on users’ devices were to meet all the specified criteria, TAN
could be effectively implemented in indoor settings, supporting decentralized distance
measurement and data exchange between the entities (see Figure 6.3). Each device would
function as the host of its own peer-to-peer communication session, with other detected
devices acting as guest peers. At the same time, these guest devices would also serve
as hosts for their own sessions. As a result, a Transactional Area Network could allow
individuals to identify others nearby, and potentially foster new social connections. If
implemented, this network could be tested across various indoor applications. Moreover,
it may prove especially valuable in human rescue operations, particularly in life-saving
situations during disasters.
Figure 6.3. A Transactional Area Network operates across multiple devices in indoor
environments. Each device transmits and receives BLE beacon signals, initiating peer-to-
peer sessions, by also broadcasting and receiving Wi-Fi signals [5].
6.3.3 Use Cases
One practical use case of a Transactional Area Network is its ability to help users
determine their location relative to other entities within a confined space, such as indoor
environments or nearby facilities. In this scenario, users would be able to see other in-
dividuals in their vicinity, and track their real-time distance. Additionally, users could
communicate with nearby entities through messages, and identify their positions within
139
Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks
the area using TAN’s distance (and potentially direction) data. This network could en-
hance interpersonal interactions in indoor settings like restaurants, social gatherings,
cafeterias, and more. Thus, TAN would provide real-time proximity information and com-
munication features, fostering social engagement in indoor spaces.
Another important use case for a Transactional Area Network is its potential to as-
sist professionals, such as firefighters and rescue teams, in life-saving operations during
emergencies like earthquakes or floods. In such critical situations, TAN could help rescue
teams navigate debris from collapsed structures and locate trapped individuals. Using
their devices, rescuers could receive real-time data on the survivors’ relative distances
(and ideally directions), which would significantly improve the efficiency and speed of the
rescue process. Furthermore, the network could potentially enable communication be-
tween rescuers and survivors, allowing for the exchange of vital information regarding the
condition and needs of those trapped. This functionality would be especially useful when
traditional communication systems are unavailable, providing a decentralized method
for connecting rescuers and survivors, thereby enhancing the effectiveness of life-saving
efforts.
An additional use case for TAN is monitoring traffic within a company’s premises (such
as buildings or offices), in order to track the number of employees present on a daily
basis. Data engineers can derive valuable insights from this data, including identifying
trends in employee behavior, such as peak lunch and break times. For companies with
multiple locations, the indoor positioning data collected by each device within TAN can be
combined with GPS coordinates from the device. This would enable data engineers (with
access to the TAN database) to associate specific data points with particular premises.
However, additional considerations must be taken into account, such as determining
which floor each user is on at any given time. This could be validated through the use of
external hardware, such as beacons positioned on each floor. While TAN’s primary design
is decentralized, relying mainly on end-user devices, the limited use of external beacons
is acceptable for improving accuracy.
These three use case scenarios highlight the practical applications and value of the
Transactional Area Network. Whether enhancing social interactions in indoor spaces,
supporting emergency rescue operations, or optimizing workplace analytics, TAN presents
a viable solution for real-time, decentralized data exchange. By primarily depending
on users’ personal devices and minimizing reliance on external hardware, TAN has the
potential to enhance user experience and improve operational efficiency across various
settings.
6.4 Proof-of-Concept Implementation
6.4.1 Analysis
To demonstrate the core functionality of the Transactional Area Network, the research
team developed a basic prototype using the Swift programming language. Swift [176] is a
modern language mainly used for building applications on Apple platforms, including iOS,
140
6.4.1 Analysis
macOS, and watchOS. It was introduced as a more user-friendly and secure alternative
to Objective-C, which had been the primary language for Apple development. The imple-
mented software makes use of Apple’s iBeacon and Multipeer Connectivity frameworks,
enabling Bluetooth Low Energy signal broadcasting and peer-to-peer communication over
Wi-Fi.
The initial step in the system involves setting up a virtual beacon. To achieve this, the
device begins by broadcasting a UUID along with specific ‘Major’ and ‘Minor’ values. These
values remain unchanged while the device is active in the network. A UUID is first defined
and shared among all participating devices in the session. To ensure uniqueness, each
device assigns itself random Major and Minor values within the range of 1 to 500,000.
Before broadcasting, the device scans for nearby devices and compares their Major/Minor
combinations. If a match is detected, the device generates a new pair of values and repeats
the check until it finds a unique combination. Once uniqueness is confirmed, the device
starts broadcasting and becomes part of the network (see Figure 6.4). While in typical
beacon implementations, the Major value is used to group devices under the same UUID,
this proof-of-concept assigns both Major and Minor values uniquely to each device, in
order to ensure clear identification within the network.
// loop over the existing beacons detected, with each beacon item called ’beacon’
for beacon in beacons {
var booler = true
let major = beacon.major.uint16Value
let minor = beacon.minor.uint16Value
// check if common major and / or minor values have been found
while booler {
if number1 == major {
print("CommonMajorvaluefound.Changingthevalue.")
number1 = Int.random(in: 1...500000)
}
if number2 == minor {
print("CommonMinorvaluefound.Changingthevalue.")
number2 = Int.random(in: 1...500000)
}
if number1 != major && number2 != minor && number1 != number2 {
booler = false
}
}
}
Figure 6.4. Swift code snippet that verifies and ensures the uniqueness of Major/Minor
values [5].
To join the session, the application first checks whether Bluetooth is enabled on
the device, and whether it is capable of detecting nearby signals (see Figure 6.5). If
these conditions are satisfied, the application defines a beacon region which is known as
‘CLBeaconRegion’. Within this region, it begins scanning for other devices that share the
same UUID. This initiates the ‘ranging’ phase, during which the software identifies and
logs all nearby devices broadcasting as beacons with the shared UUID. At the same time,
it estimates the distance between devices. iBeacon performs this estimation, referred to
as ‘accuracy’, by using the received signal strength indicator (RSSI) from each beacon,
141
Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks
along with the beacon’s known transmit power (TxPower) [177]. The estimated distance
is calculated using a logarithmic path loss model, represented by the following equation:
Distance =10(TxPowerRSSI
10×n)(6.6)
where:
RSSI This is the measured strength of the signal received from the iBeacon. It
is typically reported in decibels relative to one milliwatt (dBm). The RSSI value
decreases as the distance between the iBeacon and the receiver increases.
TxPower (Transmit Power) is a calibrated constant that represents the expected
RSSI at a 1-meter distance from the beacon (in dBm). It is specific to each beacon
and is typically provided by the manufacturer or determined during the calibration
process.
Path Loss Exponent (n), is a factor that represents the rate at which the signal
attenuates as it propagates through the environment. The value of nvaries with
the physical characteristics of the environment. For example, in free space, nis
typically 2, whereas in more obstructed environments, ncan be 3 or 4.
For example, if an iBeacon has a TxPower of 59 dBm and the measured RSSI at the
receiving device is 70 dBm in an environment where the path loss exponent nis 2, the
estimated distance would be calculated as follows:
Distance =1059(70)
10×2=10(11
20 )1.78 m (6.7)
The logarithmic path loss model offers a reasonably accurate estimate of distance in
most indoor settings. The collected distance data can be further processed to improve
result accuracy and, if needed, to support techniques such as triangulation for deriving
directional information.
if CLLocationManager.isMonitoringAvailable(for:
CLBeaconRegion.self) {
// Match all beacons with the specified UUID
// Create the region and begin monitoring it.
let region = CLBeaconRegion(proximityUUID: uuid,
identifier: localBeaconId)
self.locManager.startMonitoring(for: region)
// Start ranging only if the feature is available.
if CLLocationManager.isRangingAvailable() {
locManager.startRangingBeacons(in: region )
// Store the beacon so that ranging can be stopped on demand.
beaconsToRange.append(region )
}
locationManager(locManager, didRangeBeacons: beacons, in: region)
}
Figure 6.5. Swift code snippet that checks the device’s ability to transmit as a beacon [5].
The next phase involves the device initiating a peer-to-peer communication session
using the Multipeer Connectivity (MC) framework. After the device successfully begins
142
6.4.1 Analysis
operating as a virtual beacon, the software compiles three key components required to
establish an MC session: i)the device’s Major/Minor identifier pair used for beaconing, ii)
the user’s name credentials, and iii)a user-selected image. In addition to these elements,
certain other properties are needed to enable the MC session. A critical requirement is a
unique identifier for each device, referred to as ‘myPeerId’. This identifier ensures each
device is distinct within the session, helping prevent data transfer issues or conflicts. Two
more properties are also initialized: One to broadcast the device’s presence (‘serviceAd-
vertiser’, see Figure 6.6) and one to search for other devices (‘serviceBrowser’, see Figure
6.7).
Using the serviceAdvertiser, the device continuously sends out its availability to nearby
devices (peers) and accepts connection invitations. Meanwhile, the serviceBrowser allows
the device to discover new peers, initiate connection requests, and receive notifications
when peers disconnect from the session. Together, these components handle the joining
and leaving of devices in the session. Ultimately, the Multipeer Connectivity framework
enables smooth and direct data sharing between all connected devices, functioning as a
peer-to-peer, decentralized communication channel.
// invitation from another peer
func advertiser(_advertiser: MCNearbyServiceAdvertiser, didReceiveInvitationFromPeer peerID: MCPeerID,
withContext context: Data?, invitationHandler: @escaping (Bool, MCSession?) > Void) {
print("ReceivedaninvitationfromPeerwithID:\(peerID)")
// Accept the invitation and provide the session
invitationHandler(true,self.session)
}
Figure 6.6. ServiceAdvertiser receives an invitation to join another peer’s session [5].
// finding new MC peer
func browser(_browser: MCNearbyServiceBrowser, foundPeer peerID: MCPeerID, withDiscoveryInfo info: [
String : String]?) {
print("PeerFound:\(peerID)")
var peerExists = false
// check if the peer is in the peer list
for pp in peers {
if pp.id == peerID.displayName {
peerExists = true
break
}
}
// if not, invite the peer to the session
if !peerExists {
if peerID.displayName != ID {
print("Peerdoesnotexistinothersessions.Peerinvited:\(peerID)")
self.delegate?.savePeer(peerID: peerID)
browser.invitePeer(peerID, to: self.session, withContext: nil, timeout: 10)
}
}
}
Figure 6.7. ServiceBrowser locates a peer and checks if they already exist in the device’s
list with known peers [5].
143
Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks
During a Multipeer Connectivity session, each device can send data to all connected
peers. In this proof-of-concept system, the exchanged data includes the user’s name and
image. Every time a device sends data, it also includes its unique peer ID. As a result, all
shared data within the session is labeled with the sender’s peer ID. To store and manage
data received from other devices, a list called ‘peers’ is used. This list holds objects
of a custom type named ‘aPeer’, which was created specifically for this project. Each
‘aPeer’ object contains four pieces of information: i)the peer ID, ii)the user’s name, iii
their selected image, and iv)the distance from the current device (based on the ‘accuracy’
value). In this way, the peers list keeps a separate object for each nearby device, recording
their ID, name, picture, and relative distance.
When the system receives data from other devices, it first checks if the device is already
listed. If it is not, a new ‘aPeer’ object is created using the received peer ID, name, and
picture, and added to the list. If the peer is already known, the existing object is updated
with any new data. It is also possible to create a peer object with missing information,
allowing the rest to be added later during the session. For example, if a device starts
sharing its data, the name might arrive before the picture, due to network delays. In such
a case, the system creates a peer object with the peer ID and name. When the image
arrives later, the system finds the existing object and fills in the missing picture.
However, a key challenge in this system is linking each device’s beacon signal, which
provides relative distance to other peers, with the peer-to-peer session data, specifically
the user’s name and image. Since the Bluetooth Low Energy and Multipeer Connectivity
signals operate on different protocols, the system must find a way to associate them
correctly. To achieve this, the software must identify which BLE and MC signals come
from the same device. The pairing process works as follows (also depicted in Figure 6.8):
Each BLE signal provides two unique identifiers, Major and Minor values. These values
are combined into a single string, using a slash (‘/’) as a separator. For instance, if a
device has a Major value of ‘172’ and a Minor value of ‘8376’, the combined identifier
becomes ‘172/8376’. This string is then stored in the device’s ‘myPeerId’ attribute.
As a result, the device will advertise itself as a beacon using the Major/Minor values
‘172/8376’, and also participate in the peer-to-peer MC session using the same identifier.
When the software detects a BLE signal from a virtual beacon device and simultaneously
receives data from an MC peer, it combines the Major/Minor values of the beacon signal
into a string variable, separated by a slash punctuation mark (‘/’). Thus, the BLE Ma-
jor/Minor values ‘172’ and ‘8376’ will again form ‘172/8376’ on the receiving end. The
software then compares this new string variable from the BLE signal with the received
ID of the MC peer. If they match, it confirms that both signals originate from the same
device. If not, the software continues to compare the signals until all are correctly paired.
This straightforward and efficient solution effectively combines beacon signals with
peer-to-peer session data. By using this approach, each device can correctly identify
which signals belong to the same device, allowing the software to calculate their relative
distance based on the BLE signal. As a result, the distance attribute for each peer object
in the device’s list is updated accordingly. The integration of BLE signals with peer-
to-peer session data ensures that each device can accurately associate signals with the
144
6.4.1 Analysis
correct user. This seamless integration successfully demonstrates the functionality of the
Transactional Area Network, highlighting its decentralized nature.
// Loop through the existing beacon signals, with each iteration item as ’bb’
for bb in beacons {
let amajor = bb.major as! Int
let aminor = bb.minor as! Int
// create the string value of the major/minor pair, as an ID
let theid = String(amajor) + "/" + String(aminor)
print("FoundanewMajor/MinorpairasanID:",theid)
var idExists = false
// Check if the ID created matches any known IDs
// then proceed to save the distance of a known Peer, if the ID matches with one from
the list, or create a new Peer
for pp in peers {
if pp.id == theid {
pp.distance = String(format: "%.2f", ceil(bb.accuracy 100) / 100)
print("FoundIDmatchesexistingPeerinthesession.Distancefrompeer:\(bb.
accuracy)")
idExists = true
table.reloadData()
break
}
}
if !idExists {
let checkDistance = String(format: "%.2f", ceil(bb.accuracy 100) / 100)
if !checkDistance.hasPrefix("") {
print("FoundIDdoesnotmatchanyknownPeer.Creatingnewone...")
let peer = Peer()
peer.id = theid
peer.name = "Newpeer"
peer.image = #imageLiteral(resourceName: "profilepic.png")
peer.distance = checkDistance
peers.append(peer)
table.reloadData()
}
}
}
Figure 6.8. The Major/Minor pairing process, and examining for matches in the known
Peers list [5].
145
Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks
Figure 6.9. TAN prototype software’s internal components’ architecture [5].
As illustrated in Figure 6.9, the proof-of-concept software for TAN is not merely a
fusion of two technologies. The proposed framework effectively combines iBeacon and
Multipeer Connectivity, utilizing their respective signals to create a unique software so-
lution. In an Android context, these technologies would correspond to Eddystone and
Nearby Connections, respectively. TAN’s software modifies the data transmitted through
each signal (BLE RSSI and MC), utilizing the CLBeaconRegion and MCServiceAdvertiser
modules for transmission, while simultaneously waiting for incoming signals through
the CLBeaconRegion (again) and MCServiceBrowser modules. The received data is then
processed in the BLE and MC Signal Pairing module to identify common senders, and
generate corresponding iBeacon/MC signal pairs. Once a match is found, the software
stores the data in a peer list, establishing a successful connection between the sender
and receiver. This approach not only enhances the accuracy of signal pairing, but also
strengthens the overall connection. Ultimately, this integration supports seamless com-
munication and data exchange within the decentralized structure of the Transactional
Area Network.
In terms of data transmission, it is essential to ensure that the data collected by
each device is properly formatted before being transmitted to an external database. The
resulting dataset, based on the peer list, is organized into JSON objects for efficient storage
and analysis. Each JSON object contains key information, including a timestamp, the
BLE UUID, the device’s Major and Minor values, and the device’s ‘peerID’, which is derived
by combining its Major and Minor values (as previously mentioned). Since all relevant
details about each peer are stored in the device’s peer list, this list is converted into a
JSON array, where each entry represents an individual peer. Thus, each object includes
146
6.4.2 Testing and Results
the peer’s ID (formed by the combination of its BLE Major/Minor values), name, a base64-
encoded image, and the calculated distance based on the BLE signal (as shown in Figure
6.10). Given the frequency of distance updates (every second), it is crucial to define an
optimal transmission interval to the database. This interval should balance the need for
real-time updates with efficient data management. Typically, the transmission interval is
set between a few seconds and a minute, depending on the scale of each TAN.
// JSON Object
{
"timestamp": "20240522T12:34:56Z",
"uuid": "123e4567e89b12d3a456426614174000",
"major": 172,
"minor": 8376,
"id": "172/8376",
"data": {
"name": "Peer0",
"image": "https://www.thinkbymore.com/thinkbymore/file/2042052390694539264/pdf/ea499765-984b-44f8-b7a3-e854ded9af9a/images/img92.png class="_ _9">AAANSUhETgAAA...",
"extra_data_1": "example_value_1",
"extra_data_2": "example_value_2"
},
"peers": [
{
"id": "453/3221",
"data": {
"name": "Peer1",
"image": "https://www.thinkbymore.com/thinkbymore/file/2042052390694539264/pdf/ea499765-984b-44f8-b7a3-e854ded9af9a/images/img93.png class="_ _35">TG...",
"distance": "5.23",
"extra_data_1": "example_value_1",
"extra_data_2": "example_value_2"
}
},
{
"id": "7753/934",
"data": {
"name": "Peer2",
"image": "https://www.thinkbymore.com/thinkbymore/file/2042052390694539264/pdf/ea499765-984b-44f8-b7a3-e854ded9af9a/images/img94.png class="t m0 x12 h10 ya4c ff4 fs8 fc0 sc0 ls0 ws57">"distance": "3.45",
"extra_data_1": "example_value_3",
"extra_data_2": "example_value_4"
}
}
]
}
Figure 6.10. A data sample from the TAN’s proof-of-concept software [5].
6.4.2 Testing and Results
Testing Scenario 1: Living Room
TAN’s proof-of-concept software has been tested on four iOS devices: an iPhone X, an
iPhone 8, an iPhone SE (2016) and an iPad Pro 10.5 (2017). The first testing scenario
aimed to determine the distance between devices forming a Transactional Area Network
within an apartment’s living room. When each user enters their information into their
device, the software initiates the previously outlined procedure. It starts by broadcasting
147
Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks
a Bluetooth Low Energy (BLE) signal, functioning as a virtual beacon, and establishes a
new peer-to-peer session using Multipeer Connectivity (MC). On the BLE side, the signal
searches for beacons in the area with the same UUID. Simultaneously, in the MC phase,
the device creates a serviceAdvertiser, a serviceBrowser, and a session that continuously
searches for new peers and any data they may send.
Figure 6.11. Establishment of TAN between two peers, as seen from Daniel Adams’ device
(iPhone X) [5].
In Figure 6.11, the moment when two users enter TAN for the first time is depicted. The
distance between user ‘Daniel Adams’ (whose device’s screen is captured in Figure 6.11)
and user ‘David Jones’ is less than one meter, indicating close proximity. The BLE
signal calculation in this instance is quite accurate. Measurements taken with a physical
measuring instrument indicated a distance of approximately 95 cm between the two
devices, resulting in a deviation of about five centimeters. Subsequently, David Jones
moved closer to Daniel Adams. The application running on David Jones’s device recorded
this change, as shown in Figure 6.12.
148
6.4.2 Testing and Results
Figure 6.12. David Jones’s device (iPhone 8) showing the distance between him and Daniel
Adams [5].
In Figure 6.12, the user David Jones perceives the user Daniel Adams at a distance of
61 cm. According to the measurements taken, the deviation was three centimeters, with
the actual distance being 64 cm. At this point, another user enters the system. Sitting
close to Daniel Adams and David Jones is Alex Lopez, who is using an iPad tablet. He
wishes to join the decentralized TAN to detect other nearby users. The result is shown in
Figure 6.13.
Figure 6.13. Alex Lopez with his device (iPad Pro 10.5 2017) entering TAN [5].
In this figure, the information displayed on Alex Lopez’s device is shown. Alex appears
to be approximately one meter away from both Daniel Adams and David Jones. It should
be noted that Daniel and David are not seated next to each other. Enabling direction
information for each user in a Transactional Area Network is important for enhancing the
user experience and improving data generation. According to actual measurements, the
distance of Alex Lopez from Daniel Adams and David Jones had a deviation of 8 to 10 cm
from the system’s calculated value (the software calculated a shorter distance than the
149
Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks
actual one), indicating that Daniel and David were actually seated slightly further away
from Alex.
At this stage, David Jones decides to move further away from the other two users
within the system. The outcome is displayed on his device’s screen, in Figure 6.14:
Figure 6.14. David Jones’s new distance from the users, as shown in his device’s screen
[5].
From David’s device, Daniel appears to be 2.1 m away, while Alex is measured at 2.66
m. However, upon actual measurement, the BLE signal’s distance calculation shows a
deviation of approximately 15 cm. Consequently, the real distance between David and
Daniel is 2.25 m, while David and Alex are approximately 2.8 m apart. This initial testing
scenario, conducted in an apartment’s living room as previously mentioned, revealed that
the proof-of-concept software calculates the beacon signal’s distance quite accurately. As
expected, as the distance between devices increases, so does the deviation. Nonetheless,
these deviations remain relatively low, indicating a good level of precision in BLE signal
measurements, especially at close distances. Furthermore, data (users’ names and im-
ages) are seamlessly transferred via the peer-to-peer communication session with minimal
to no delays.
Testing Scenario 2: Cafeteria
The second phase of testing the proof-of-concept software for TAN took place in a
public cafeteria. Similar to the initial test case, user Daniel Adams initiates the process
on his iPhone X to detect nearby users. Upon entering the system, Daniel identifies
two nearby devices/users already transmitting signals as virtual iBeacons and MC peers,
therefore having a TAN established. Specifically, one user named John Doe is using an
iPad, while another user named Catherine Smith is in the network with her iPhone SE
(2016). Both users’ information (names and images), along with the distance between
150
6.4.2 Testing and Results
them and Daniel, is displayed on Daniel’s screen, as depicted in Figure 6.15.
Figure 6.15. Daniel Adams enters TAN through his iPhone X, viewing two peers nearby,
John Doe and Catherine Smith [5].
Daniel observes that Catherine is only 11 cm away from him, while John Doe is
approximately 41 cm away. Actual measurements showed a deviation of 2 cm to Catherine
(actual distance of 13 cm) and 6 cm to John Doe (actual distance approximately 47 cm)
from Daniel’s device. Shortly after, Daniel decides to make a phone call and exits the
cafeteria. Simultaneously, Catherine moves towards the bar, located further inside the
café, to order a beverage. Daniel’s device now displays updated distance data, as shown
in Figure 6.16.
Figure 6.16. Daniel Adams’s new distance from users John Doe and Catherine Smith [5].
151
Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks
In this case, (Figure 6.16), Daniel is 4.48 m away from John Doe and 10.23 m away
from Catherine. In the actual measurements that took place, the deviations from the
values calculated by the software are larger. Daniel is approximately 3.9 m away from
John Doe (deviation of approx. 55 cm) and 9.1 m away from Katerina (deviation of 1.1
m). Despite the relatively large deviations, the calculation of BLE signal’s distance by the
software could still be considered reliable for the case of longer distances. Although it
cannot be considered as an objective scientific acceptance factor, it should be noted that
the human eye faces difficulty in distinguishing a one-meter difference between distances
of 9–10 m. However, minimizing the actual distance error is something that could be
further analysed. Additionally, it is worth reminding that, when a user gets closer to
another given user inside the session, whose distance was relatively large at some point,
the deviation in the actual distance will decrease. In summary, the levels of deviation are
not that significant in order to deem the beacon’s distance calculation unreliable.
In this instance (Figure 6.16), Daniel registers a distance of 4.48 m from John Doe
and 10.23 m from Catherine. However, actual measurements reveal larger deviations from
the values calculated by the software. Daniel’s distance from John Doe is approximately
3.9 m (a deviation of approximately 55 cm), and from Catherine is 9.1 m (a deviation
of 1.1 m). Despite these relatively large deviations, the software’s calculation of BLE
signal distance could still be considered reliable for longer distances. It is important
to note that the human eye may struggle to distinguish a one-meter difference between
distances of 9–10 m. However, minimizing the actual distance error should require further
analysis. Moreover, as a user approaches another within the session, whose distance
was initially large, the deviation in the actual distance will decrease. In summary, the
observed deviations are not significant enough to deem the beacon’s distance calculation
unreliable.
As the second user case scenario nears completion, Daniel and Catherine return to
their seats. They both sit in the same positions they sat before. John checks his device’s
screen to verify if the distances remain at the same levels as before. Daniel should be
approximately 40 cm away, with Catherine seated between them at a distance ranging
from 10 to 20 cm. The measurements displayed on John’s screen are depicted in Figure
6.17.
152
6.5 Findings
Figure 6.17. John Doe takes one final look at his iPad’s screen, observing the distance
between himself and users Daniel Adams and Catherine Smith [5].
The distance between John Doe and Daniel Adams appears to be 0.46 cm, while
John’s distance from Catherine Smith measures at 0.17 cm. Upon actual measurements,
one observes a similar deviation as before. John is actually 0.51 cm away from Daniel (a
deviation of about 5 cm) and 0.19 cm away from Catherine (a deviation of two centimeters).
As previously mentioned, BLE signal distance calculation proves relatively accurate at
short distances. However, its accuracy diminishes with increasing distance, affecting
the software’s calculations accordingly. Nonetheless, the Multipeer Connectivity session
operates smoothly, with each device taking approximately 6 to 7 s (at worst) to receive
both names and images of the other peers/users. It is worth noting that both testing
scenarios have been repeated three times, yielding consistent results each time, with a
deviation of ±3 cm for smaller distances and ±10 to ±20 cm for greater ones.
6.5 Findings
The concept of a ‘Transactional Area Network’ (TAN) presents an innovative model in
computer science, particularly within the realm of area networks. This network connects
personal devices in a decentralized fashion, aiming to enhance interaction, mutual in-
fluence, and data exchange between entities. Such a network has the potential to serve
as a foundation for future indoor localization technologies. As outlined in its concept
analysis, a TAN-based indoor positioning system can be developed without requiring ex-
ternal hardware (such as beacon trackers), digital mapping, or specialized studies of the
indoor environment. The proof-of-concept software successfully demonstrated the net-
work’s ability to locate smart devices in an indoor setting, calculate distances between
them based on their BLE beacon signals, and facilitate data exchange through peer-to-
peer communication sessions. The key features and principles of TAN could promote
the widespread adoption of indoor localization applications, offering practical benefits to
153
Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks
users by enabling seamless location tracking and data sharing.
However, strengthening the robustness of TAN requires addressing several important
factors. While there have been various proposals to enhance the accuracy of indoor
positioning systems, their integration into a TAN must be carefully considered. Solutions
such as employing machine learning algorithms, utilizing more internal device sensors,
or applying filtering techniques could improve the accuracy of distance measurements
across the network. Nonetheless, these strategies may raise concerns regarding energy
consumption, which could deter end-users from adopting the technology. Therefore,
software enabling TAN functionality must be designed to be energy-efficient, in order to
avoid excessive power demands that might discourage use. So, frameworks built on the
TAN model should prioritize energy efficiency to prevent high consumption, ensuring that
users can engage with the network for extended periods without limitations.
Yet, the accuracy of distance calculations within TAN requires further investigation
and refinement, particularly over longer distances, where deviations from actual mea-
surements tend to become more pronounced. The existing research discussed in this
chapter provides valuable insights that could help address this challenge. In light of
the impact on energy consumption, an Extended Kalman Filtering model could be ap-
plied to the BLE RSSI signals, supplemented by an Outlier Detection layer. Additionally,
providing users with directional information about each peer within the network is cru-
cial. Knowing both the direction and distance to nearby peers significantly improves a
user’s spatial awareness and understanding of their relative position within the network.
These advancements could be integrated into future iterations of TAN’s proof-of-concept
software, as depicted in Figure 6.18.
Figure 6.18. Proposed improvement in TAN software’s internal components Architecture,
highlighting the addition of Kalman filtering and outlier detection methods, along with a
signal direction calculation process [5].
154
6.5 Findings
Future research on the concept of a Transactional Area Network should primarily
focus on a thorough examination of algorithms, accuracy, and efficiency. It should inves-
tigate the potential limitations of TAN, such as the effects of physical barriers like walls
on signal propagation, and how these might influence localization accuracy, introducing
geometric distortions. Additionally, the study of distance measurement errors, standard
deviation, and root mean square error should be considered, drawing on the work of
Dalkılıç et al. [177] on BLE signal accuracy. The proposed enhancements in Figure 6.18
should integrate algorithmic analysis based on the work of Dalkılıç and Dinh et al. [151].
These efforts would help validate the theoretical developments discussed in this article,
ensuring that TANs can be applied effectively in practical scenarios.
Regarding the implementation of the software in Swift, one minor issue arises from the
limit on the number of peers that each Multipeer Connectivity session can accommodate
(set at a maximum of 8 peers, as specified by Apple in pre-defined value ‘maximumNum-
berOfPeers’) [178]. However, this limitation can be addressed by creating multiple MC
sessions, each supporting up to 5 peers, thus keeping energy consumption and resource
requirements manageable. In this setup, certain peers would function as intermediaries
or bridges between different sessions. These intermediary peers would receive data from
one session and relay it to others, effectively linking the sessions, allowing for extended
network coverage.
Lastly, it is essential to propose an efficient data management system for effectively
retrieving and processing the indoor positioning data transmitted by each device within
a TAN. To enhance the dataset, each device should also transmit its GPS coordinates,
offering valuable context for data engineers analyzing the information. The scalable data
management layer, main part of this PhD thesis, is designed to manage large data streams
in near real-time, making it well-suited for integration with a Transactional Area Network.
For properly analyzing the data generated with TAN networks, this thesis’s data analysis
asset with the utilization of offline LLMs can prove to be a vital assistant for end-users.
Thus, the PhD thesis’s complete framework is a robust solution for managing storing,
and deeply analyzing TAN data.
In conclusion, the implementation of a Transactional Area Network holds significant
promise for advancing indoor positioning systems, offering new services to end-users
and simplifying the technology to facilitate widespread adoption. The peer-to-peer design
promotes direct communication between devices, aiming for a decentralized network that
is both efficient and scalable. By minimizing the need for external devices, TAN also
enhances accessibility and affordability, while ensuring that consistent data collection
and clarity allow for the effective reuse of the data. As TANs continue to develop and
gain adoption, they have the potential to improve indoor positioning for consumers, and
expand their applications across a variety of industries, transforming how people interact
with indoor spaces. In emergency situations, such as disasters, TANs could provide life-
saving capabilities by helping rescuers locate survivors. Ultimately, the Transactional
Area Network could redefine the foundation of future indoor positioning applications,
harnessing the power of indoor localization data.
155
Part III
Epilogue
157
Chapter 7
Conlusions: Towards Scalable Data Management
Empowered By Offline Large Language Models
7.1 Managing Scalable Volumes of Data
The specific demands of critical infrastructures and business organization underline
the importance of a middleware system, which will be able to structure data in a way that
supports reliable and effective analysis. To enable smooth interaction between connected
components, services must also be adapted in advance, so to ensure interoperability.
This thesis’s scalable data management layer follows a data-centric design, providing a
platform where small and medium-sized enterprises (SMEs), telecom companies, data
providers, content creators, transportation authorities, and other stakeholders across
various domains could work together within a well-organized and collaborative data-
sharing environment. Creating such a space is a key goal for any asset aiming to build and
maintain an ecosystem that attracts businesses, startups, and independent developers.
As a part of PhD thesis study, the scalable data management layer is specifically
designed to tackle key challenges faced by organizations today, including issues of data
quality, and operational efficiency. To improve data reliability, the software layer applies
advanced cleansing methods that ensure data from various sources remains accurate,
complete, and consistent. The performance of the asset has been evaluated as a stan-
dalone component, using real-world datasets: Customs data from the port of Valencia, as
well as anonymized mobility data from the Thessaloniki metropolitan area. These results
confirm that the scalable data management layer can be effectively applied in other real-
life scenarios. The layer plays an integral role in the final phase of this study, where it is
interconnected with an offline LLM, leading to the establishment of an AI-powered data
management and analysis framework, aiming to ease data analytics tasks for end-users.
7.2 Leveraging Offline LLMs for Data Analysis
The final - and main - part of this thesis investigates how offline large language models
(LLMs) can be used to generate code for data analysis tasks, also incorporating the scal-
able data management layer previously described. While well-known online models (such
as GPT or Gemini) offer advanced capabilities, their use may pose security and privacy
159
Chapter 7. Conlusions: Towards Scalable Data Management Empowered By Offline Large Language Models
risks, especially when dealing with sensitive data. Additionally, current LLMs may face
difficulties in processing large-scale datasets, or fully understanding the specific context
of a query. These difficulties can lead to inaccurate results. Offline LLMs offer a promising
alternative, by allowing organizations to use these models locally, maintaining full control
over both the model and their data. In this configuration, users submit natural language
queries, which the model interprets to produce suitable data analysis code. This code is
then passed through a communication pipeline to the data management layer, where it
is executed to extract the final results.
For this study, Mistral AI’s Codestral and Alibaba’s Qwen 2.5 Coder were selected as
the offline LLMs, chosen for their strong performance and efficiency compared to other
code-focused models. The communication pipeline and the data management layer were
both built using Apache Spark, ensuring robust and scalable data handling. Five different
datasets were used in the testing phase. Each dataset was paired with five queries of
different complexity levels. To ensure thorough evaluation, each query was tested ten
times. This process resulted in 250 individual tests per model, allowing for a detailed
assessment of how accurately Codestral and Qwen could generate appropriate code in
response to natural language inputs.
The findings of this PhD research emphasize the strong potential of offline large lan-
guage models (LLMs) in generating accurate and understandable code for data analysis
operations, all while maintaining strict data privacy standards. Both Codestral and Qwen
performed well, producing reliable and readable scripts with a high degree of automation.
Nonetheless, a major limitation identified is computational performance, particularly the
need for more advanced GPU resources to deliver faster, near real-time responses. To
overcome this challenge, future work could focus on fine-tuning the models, designing a
user-friendly interface, and exploring hybrid system architectures that balance scalability
with privacy. These improvements would help address current technical limitations and
support the transition from experimental setups to practical, enterprise-level solutions.
This PhD study may serve as a foundation for further research into the use of offline
LLMs for data analysis through natural language-driven code generation. Rather than
transferring data to an external model, this approach emphasizes moving the code gen-
eration process directly to where the data resides. Continued research in this area is
expected, as the potential of LLMs within the field of Data Science is far from fully real-
ized. Over time, this method could become a widely adopted approach for performing data
analysis. One clear trend is the growing influence of LLMs across scientific disciplines,
with Data Science likely to be one of the fields most deeply shaped by their integration.
The extent to which these models will redefine standard data analysis practices remains
an open and evolving question.
160
Appendices
161
A
Appendix
A.1 The Python code readability calculation function
Listing A.1: A custom function to provide a readability score, from 1 to 3 (with 3 meaning
higher readability), to the LLM’s generated code
def evaluate_readability(code):
"""
The function analyzes the code to determine the readability score by:
Checking the length of each command.
Counting the number of method call chains.
Analyzing the depth of nested structures.
"""
try:
# Split the code into individual commands based on newlines and semicolons
commands = [cmd.strip() for cmd in code.replace(’\n’, ’;’).split(’;’) if cmd.strip()]
readability_score = 0 # Initialize the readability score
total_commands = len(commands) # Get the total number of commands
if total_commands == 0:
return readability_score
# Initialize counters for long lines and maximum chain length
long_lines = 0
max_chain_length = 0
for cmd in commands:
# Split the command into lines to check for long lines
lines = re.split(r’;\s|\n’, cmd)
long_lines += sum(1 for line in lines if len(line) > 80) # Count lines longer than 80 characters
# Parse the command into an abstract syntax tree (AST)
try:
tree = ast.parse(cmd)
except SyntaxError:
continue # Skip this command if there’s a syntax error
# Count method calls in chained operations
for node in ast.walk(tree):
163
A. Appendix
if isinstance(node, ast.Call):
chain_length = 0
current_node = node.func
# Traverse the chain of method calls to count its length
while isinstance(current_node, ast.Attribute):
chain_length += 1
current_node = current_node.value
max_chain_length = max(max_chain_length, chain_length)
# Adjust readability score for long lines
if long_lines / total_commands < 0.1: # Allow some tolerance for long lines
readability_score += 1
# Adjust readability score based on the length of method call chains
if max_chain_length <= 3: # Full score for chains up to 3 method calls
readability_score += 1
# Parse the entire code to evaluate nested structures
try:
tree = ast.parse(code)
except SyntaxError:
return readability_score # Return the score if there’s a syntax error in the entire code
def count_nested_structures(node, depth=0):
# Increase depth for specific nodes (if, for, while, function definitions)
if isinstance(node, (ast.If, ast.For, ast.While, ast.FunctionDef)):
depth += 1
# Recursively check the depth of nested structures
return max([count_nested_structures(child, depth) for child in ast.iter_child_nodes(node)], default=
depth)
nested_structure_depth = count_nested_structures(tree)
# Full score if the maximum depth of nested structures is 2 or less
if nested_structure_depth <= 2:
readability_score += 1
return readability_score # Return the final readability score
except Exception as e:
print(f"Errorevaluatingreadability:{e}")
return 0# Return a score of 0 if there’s an error
)
A.2 Summary of the ‘Shared Cars Locations’ Dataset
Listing A.2: A dataset summary of the "Shared Cars Locations" dataset, carefully crafted
by a research team member.
{
"intro": "Thisdatasetcontainsinformationaboutsharedcarlocations,includingvariousattributesthat
describeeachentry.",
"columns": [
164
A.2 Summary of the ‘Shared Cars Locations’ Dataset
{
"name": "latitude",
"type": "float64",
"simplified_type": "float",
"description": "Latitudecoordinateofthecar’slocation.",
"sample_values": [
32.083,
32.064615,
32.113535
]
},
{
"name": "longitude",
"type": "float64",
"simplified_type": "float",
"description": "Longitudecoordinateofthecar’slocation.",
"sample_values": [
34.8043,
34.77577,
34.7911
]
},
{
"name": "total_cars",
"type": "int64",
"simplified_type": "int",
"description": "Totalnumberofcarsavailableatthelocation.",
"sample_values": [
0,
1,
3
]
},
{
"name": "cars_list",
"type": "object",
"simplified_type": "string",
"description": "Listofcaridentifiersavailableatthelocation.",
"sample_values": [
"[]",
"[213]",
"[137,180,193]"
]
},
{
"name": "timestamp",
"type": "object",
"simplified_type": "datetime",
"description": "TimestampofthedatarecordinUTC.",
"sample_values": [
"2019120621:51:02UTC",
"2019112914:00:02UTC",
165
A. Appendix
"2019092013:21:03UTC"
],
"datetime_format": "%Y%m%d%H:%M:%S%Z"
}
],
"note": "Thesampledataprovidedhereissyntheticandintendedtoillustratethedatastructure."
}
166
Bibliography
[1] Anastasios Nikolakopoulos, Matilde Julian Segui, Andreu Belsa Pellicer, Michalis
Kefalogiannis, Christos Antonios Gizelis, Achilleas Marinakis, Konstantinos
Nestorakis και Theodora Varvarigou. BigDaM: Efficient Big Data Management and
Interoperability Middleware for Seaports as Critical Infrastructures.Computers,
12(11):218, 2023.
[2] Anastasios Nikolakopoulos, Efthymios Chondrogiannis, Efstathios Karanastasis,
María José López Osa, Jordi Arjona Aroca, Michalis Kefalogiannis, Vasiliki Apos-
tolopoulou, Efstathia Deligeorgi, Vasileios Siopidis και Theodora Varvarigou. Scal-
able Data Profiling for Quality Analytics Extraction.IFIP International Conference on
Artificial Intelligence Applications and Innovations,σελίδες 177–189. Springer, 2024.
[3] Achilleas Marinakis, Matilde Julian Segui, Andreu Belsa Pellicer, Carlos E Palau,
Christos Antonios Gizelis, Anastasios Nikolakopoulos, Antonios Misargopoulos, Fil-
ippos Nikolopoulos-Gkamatsis, Michalis Kefalogiannis, Theodora Varvarigou και
others. Efficient Data Management and Interoperability Middleware in Business-
Oriented Smart Port Use Cases.IFIP International Conference on Artificial Intelligence
Applications and Innovations,σελίδες 108–119. Springer, 2022.
[4] Anastasios Nikolakopoulos, Antonios Litke, Alexandros Psychas, Eleni Veroni και
Theodora Varvarigou. Exploring the Potential of Offline LLMs in Data Science: A
Study on Code Generation for Data Analysis.IEEE Access, 13:64087–64114, 2025.
[5] Anastasios Nikolakopoulos, Alexandros Psychas, Antonios Litke και Theodora Var-
varigou. Leveraging Indoor Localization Data: The Transactional Area Network (TAN).
Electronics, 13(13):2454, 2024.
[6] Economist. The world’s most valuable resource is no longer oil, but data - Web Article,
2017.
[7] Yash Agrawal. The Accelerating Pace of Technological Trends - Adapting to Market
Dynamics as an IT Professionals - Web Article, 2023.
[8] Fabio Duarte. Amount of Data Created Daily - Web Article, 2024.
[9] Steven M Rinaldi. Modeling and simulating critical infrastructures and their inter-
dependencies.37th Annual Hawaii International Conference on System Sciences,
2004. Proceedings of the,σελίδες 8–pp. IEEE, 2004.
167
BIBLIOGRAPHY
[10] John D Moteff, Claudia Copeland, John W Fischer, Science Resources και Industry
Division. Critical infrastructures: What makes an infrastructure critical? Congres-
sional Research Service, Library of Congress Washington, DC, 2003.
[11] OpenAI. GPT-4o Large Language Model, 2024. accessed on 25 April 2025.
[12] Google. Gemini Large Language Model, 2024. accessed on 25 April 2025.
[13] Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar,
Muhammad Usman, Nick Barnes και Ajmal Mian. A comprehensive overview of
large language models.arXiv preprint arXiv:2307.06435, 2023.
[14] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou,
Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong και others. A survey of
large language models.arXiv preprint arXiv:2303.18223, 2023.
[15] Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna
Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke
Hüllermeier και others. ChatGPT for good? On opportunities and challenges of large
language models for education.Learning and individual differences, 103:102274,
2023.
[16] Arne Bewersdorff, Kathrin Seßler, Armin Baur, Enkelejda Kasneci και Claudia
Nerdel. Assessing student errors in experimentation using artificial intelligence and
large language models: A comparative study with human raters.Computers and
Education: Artificial Intelligence, 5:100177, 2023.
[17] Christof Ebert και Panos Louridas. Generative AI for software practitioners.IEEE
Software, 40(4):30–38, 2023.
[18] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer και Mike Lewis.
Generalization through memorization: Nearest neighbor language models.arXiv
preprint arXiv:1911.00172, 2019.
[19] Joonsang Baek, Quang Hieu Vu, Joseph K Liu, Xinyi Huang και Yang Xiang. A
secure cloud computing based framework for big data information management of
smart grid.IEEE transactions on cloud computing, 3(2):233–244, 2014.
[20] Andre Luckow, Ken Kennedy, Fabian Manhardt, Emil Djerekarov, Bennie Vorster
και Amy Apon. Automotive big data: Applications, workloads and infrastructures.
2015 IEEE International Conference on Big Data (Big Data),σελίδες 1201–1210.
IEEE, 2015.
[21] Apache Hadoop - Framework.https://hadoop.apache.org.
[22] Ivo D Dinov. Methodological challenges and analytic opportunities for modeling and
interpreting Big Healthcare Data.Gigascience, 5(1):s13742–016, 2016.
168
BIBLIOGRAPHY
[23] Kuljeet Kaur, Sahil Garg, Georges Kaddoum, Elias Bou-Harb και Kim Kwang Ray-
mond Choo. A big data-enabled consolidated framework for energy efficient software
defined data centers in IoT setups.IEEE Transactions on Industrial Informatics,
16(4):2687–2697, 2019.
[24] Praveen Kumar Donta, Boris Sedlak, Victor Casamayor Pujol και Schahram Dust-
dar. Governance and sustainability of distributed continuum systems: a big data
approach.Journal of Big Data, 10(1):1–31, 2023.
[25] Yunhe Feng, Sreecharan Vanam, Manasa Cherukupally, Weijian Zheng, Meikang
Qiu και Haihua Chen. Investigating code generation performance of ChatGPT with
crowdsourcing social data.2023 IEEE 47th Annual Computers, Software, and Ap-
plications Conference (COMPSAC),σελίδες 876–885. IEEE, 2023.
[26] Qiuhan Gu. Llm-based code generation method for golang compiler testing.Proceed-
ings of the 31st ACM Joint European Software Engineering Conference and Sympo-
sium on the Foundations of Software Engineering,σελίδες 2201–2203. ACM, 2023.
[27] Steven I Ross, Fernando Martinez, Stephanie Houde, Michael Muller και Justin D
Weisz. The programmer’s assistant: Conversational interaction with a large language
model for software development.Proceedings of the 28th International Conference
on Intelligent User Interfaces,σελίδες 491–514. ACM, 2023.
[28] Ahmed Soliman, Samir Shaheen και Mayada Hadhoud. Leveraging pre-trained
language models for code generation.Complex & Intelligent Systems,σελίδες 1–26,
2024.
[29] CoNaLa. CoNaLa: The Code / Natural Language Challenge, 2024. accessed on 25
April 2025.
[30] Yusuke Oda. Django Dataset for Code Translation Tasks, 2024. accessed on 25
April 2025.
[31] Giovanni Pinna, Damiano Ravalico, Luigi Rovito, Luca Manzoni και Andrea
De Lorenzo. Enhancing Large Language Models-Based Code Generation by Lever-
aging Genetic Improvement.European Conference on Genetic Programming (Part of
EvoStar),σελίδες 108–124. Springer, 2024.
[32] Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang,
Ying Li, Qianxiang Wang και Tao Xie. Codereval: A benchmark of pragmatic code
generation with generative pre-trained models.Proceedings of the 46th IEEE/ACM
International Conference on Software Engineering,σελίδες 1–12. ACM, 2024.
[33] Safwan Omari, Kshitiz Basnet και Mohammad Wardat. Investigating large language
models capabilities for automatic code repair in Python.Cluster Computing, 2024.
[34] Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta,
Shin Yoo και Jie M Zhang. Large language models for software engineering: Survey
169
BIBLIOGRAPHY
and open problems.2023 IEEE/ACM International Conference on Software Engineer-
ing: Future of Software Engineering (ICSE-FoSE),σελίδες 31–53. IEEE, 2023.
[35] Man Fai Wong, Shangxin Guo, Ching Nam Hang, Siu Wai Ho και Chee Wei Tan.
Natural language generation and understanding of big code for AI-assisted program-
ming: A review.Entropy, 25(6):888, 2023.
[36] GitHub. GitHub Copilot AI Developer Tool, 2024. accessed on 25 April 2025.
[37] Google. DeepMind AlphaCode AI Developer Tool, 2024. accessed on 25 April 2025.
[38] Jianxun Wang και Yixiang Chen. A Review on Code Generation with LLMs: Ap-
plication and Evaluation.2023 IEEE International Conference on Medical Artificial
Intelligence (MedAI),σελίδες 284–289. IEEE, 2023.
[39] Hsiao Chuan Liu, Chia Tung Tsai και Min Yuh Day. A Pilot Study on AI-Assisted
Code Generation with Large Language Models for Software Engineering.International
Conference on Technologies and Applications of Artificial Intelligence,σελίδες 162–
175. Springer, 2024.
[40] IBM. What is Prompt Engineering?, 2024. accessed on 25 April 2025.
[41] Ggaliwango Marvin, Nakayiza Hellen, Daudi Jjingo και Joyce Nakatumba-Nabende.
Prompt engineering in large language models.International conference on data intel-
ligence and cognitive informatics,σελίδες 387–402. Springer, 2023.
[42] Juan David Velásquez-Henao, Carlos Jaime Franco-Cardona και Lorena Cadavid-
Higuita. Prompt Engineering: a methodology for optimizing interactions with AI-
Language Models in the field of engineering.Dyna, 90(230):9–17, 2023.
[43] William Cain. Prompting change: exploring prompt engineering in large language
model AI and its potential to transform education.TechTrends, 68(1):47–57, 2024.
[44] Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal
και Aman Chadha. A systematic survey of prompt engineering in large language
models: Techniques and applications.arXiv preprint arXiv:2402.07927, 2024.
[45] Luca Beurer-Kellner, Marc Fischer και Martin Vechev. Prompting is programming: A
query language for large language models.Proceedings of the ACM on Programming
Languages, 7(PLDI):1946–1969, 2023.
[46] Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg και Elena L
Glassman. ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis
Testing.Proceedings of the CHI Conference on Human Factors in Computing Systems,
σελίδες 1–18. ACM, 2024.
[47] Apache. Apache Spark - Framework.
170
BIBLIOGRAPHY
[48] Diego García-Gil, Sergio Ramírez-Gallego, Salvador García και Francisco Herrera. A
comparison on scalability for batch big data processing on Apache Spark and Apache
Flink.Big Data Analytics, 2(1):1–11, 2017.
[49] MongoDB - Framework.https://www.mongodb.com.
[50] Kubernetes - Framework.https://kubernetes.io.
[51] Odd Erik Gundersen και Sigbjørn Kjensmo. State of the art: Reproducibility in
artificial intelligence.Proceedings of the AAAI conference on artificial intelligence,
τόµος 32, 2018.
[52] MistralAI. Codestral Large Language Model, 2024. accessed on 25 April 2025.
[53] Alibaba Qwen. Qwen 2.5 Coder Large Language Model, 2024. accessed on 25 April
2025.
[54] IBM. What is data profiling? - Web Article.
[55] IEEE Spectrum Bill Sweet. The Smart Meter Avalanche - Web Article.https://
spectrum.ieee.org/the-smart-meter-avalanche#toggle-gdpr.
[56] IT News Australia Nate Cochrane. US smart grid to generate 1000 petabytes of data
a year - Web Article.https://www.itnews.com.au/news/us-smart-grid-to-generate-
1000-petabytes-of-data-a-year-170290.
[57] Data Dynamics. Analyzing Energy Consumption: Unleashing the Power of Data in
the Energy Industry - Web Article.https://www.datadynamicsinc.com/blog-analyzing-
energy-consumption-unleashing-the-power-of-data-in-the-energy-industry/.
[58] Big Data and Transport Understanding and assessing options - International Trans-
port Forum.https://www.itf-oecd.org/sites/default/files/docs/15cpb_bigdata_0.
pdf.
[59] Weiwei Jiang και Jiayun Luo. Big data for traffic estimation and prediction: a survey
of data and tools.Applied System Innovation, 5(1):23, 2022.
[60] Hira Zahid, Tariq Mahmood, Ahsan Morshed και Timos Sellis. Big data analytics in
telecommunications: literature review and architecture recommendations.IEEE/CAA
Journal of Automatica Sinica, 7(1):18–38, 2019.
[61] DataPorts Horizon 2020 EU Research Project.https://dataports-project.eu/.
[62] Microsoft Research AI4Science και Microsoft Azure Quantum. The impact of large
language models on scientific discovery: a preliminary study using gpt-4.arXiv
preprint arXiv:2311.07361, 2023.
[63] Anastasios Nikolakopoulos, Spyridon Evangelatos, Eleni Veroni, Konstantinos
Chasapas, Nikolaos Gousetis, Apostolos Apostolaras, Christos D. Nikolopoulos και
171
BIBLIOGRAPHY
Thanasis Korakis. Large Language Models in Modern Forensic Investigations: Har-
nessing the Power of Generative Artificial Intelligence in Crime Resolution and Suspect
Identification.2024 5th International Conference in Electronic Engineering, Informa-
tion Technology and Education (EEITE),σελίδες 1–5. IEEE, 2024.
[64] Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu
και Robert McHardy. Challenges and applications of large language models.arXiv
preprint arXiv:2307.10169, 2023.
[65] Reihaneh H Hariri, Erik M Fredericks και Kate M Bowers. Uncertainty in big data
analytics: survey, opportunities, and challenges.Journal of Big data, 6(1):1–16,
2019.
[66] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li και Peter J Liu. Exploring the limits of transfer learn-
ing with a unified text-to-text transformer.Journal of machine learning research,
21(140):1–67, 2020.
[67] Fabian Suchanek και Anh Tuan Luu. Knowledge bases and language models: Com-
plementing forces.International Joint Conference on Rules and Reasoning,σελίδες
3–15. Springer, 2023.
[68] Kai Kang, Yuqi Yang, Yijun Wu και Ren Luo. Integrating Large Language Models
in Bioinformatics Education for Medical Students: Opportunities and Challenges.
Annals of Biomedical Engineering,σελίδες 1–5, 2024.
[69] European Union Law. General Data Protection Regulation, 2016. accessed on 25
April 2025.
[70] Qian Wang, Minxin Du, Xiuying Chen, Yanjiao Chen, Pan Zhou, Xiaofeng Chen και
Xinyi Huang. Privacy-preserving collaborative model learning: The case of word vec-
tor training.IEEE Transactions on Knowledge and Data Engineering, 30(12):2381–
2393, 2018.
[71] Wei Dai, Isaac Wardlaw, Yu Cui, Kashif Mehdi, Yanyan Li και Jun Long. Data
profiling technology of data governance regarding big data: review and rethinking.
Information Technology: New Generations: 13th International Conference on Infor-
mation Technology,σελίδες 439–450. Springer, 2016.
[72] Ziawasch Abedjan, Lukasz Golab και Felix Naumann. Data profiling: A tutorial.Pro-
ceedings of the 2017 ACM International Conference on Management of Data,σελίδες
1747–1751, 2017.
[73] Ikbal Taleb, Mohamed Adel Serhani και Rachida Dssouli. Big data quality: a data
quality profiling model.World Congress on Services,σελίδες 61–77. Springer, 2019.
[74] Zhicheng Liu και Aoqian Zhang. Sampling for big data profiling: A survey.IEEE
Access, 8:72713–72726, 2020.
172
BIBLIOGRAPHY
[75] Zhicheng Liu και Aoqian Zhang. A survey on sampling and profiling over big data
(technical report).arXiv preprint arXiv:2005.05079, 2020.
[76] Bahaa Eddine Elbaghazaoui, Mohamed Amnai και Abdellatif Semmouri. Data pro-
filing over big data area: a survey of big data profiling: state-of-the-art, use cases and
challenges.Intelligent Systems in Big Data, Semantic Web and Machine Learning,
σελίδες 111–123. 2021.
[77] Júlia Colleoni Couto, Juliana Damasio, Rafael Bordini και Duncan Ruiz. New
trends in big data profiling.Science and Information Conference,σελίδες 808–825.
Springer, 2022.
[78] Anastasija Nikiforova. Definition and Evaluation of Data Quality: User-Oriented
Data Object-Driven Approach to Data Quality Assessment. Baltic Journal of Modern
Computing, 8(3), 2020.
[79] Marcel Altendeitering, ISST Fraunhofer και Tobias Moritz Guggenberger. Data Qual-
ity Tools: Towards a Software Reference Architecture. 2024.
[80] European Sea Ports Organisation - Conference.https://www.espo.be/.
[81] International Association of Ports and Harbors.https://www.iaphworldports.org.
[82] The Worldwide Network of Port Cities.http://www.aivp.org/en/.
[83] Athanasios Drougkas, Anna Sarri, Pinelopi Kyranoudi και Antigone Zisi. Port Cyber-
security. Good practices for cybersecurity in the maritime sector.ENSISA, 10:328515,
2019.
[84] JW Kim, JY Son και KK Yoon. An implementation of integrated interfaces for telecom
systems and TMS in vessels.International Journal of Engineering and Technology,
10(2):195–199, 2018.
[85] The Marketplace of the European Innovation Partnership on Smart Cities and Com-
munities.https://eu-smartcities.eu/.
[86] BigDataStack H2020 Project.https://www.bigdatastack.eu.
[87] SmartShip H2020 Project.https://www.smartship2020.eu/.
[88] Showkat Ahmad Bhat, Nen Fu Huang, Ishfaq Bashir Sofi και Muhammad Sultan.
Agriculture-food supply chain management based on blockchain and IoT: a narrative
on enterprise blockchain interoperability.Agriculture, 12(1):40, 2021.
[89] Shailja Thakur, Baleegh Ahmad, Hammond Pearce, Benjamin Tan, Brendan Dolan-
Gavitt, Ramesh Karri και Siddharth Garg. Verigen: A large language model for ver-
ilog code generation.ACM Transactions on Design Automation of Electronic Systems,
29(3):1–31, 2024.
173
BIBLIOGRAPHY
[90] Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang,
Shichao Liu και Qing Wang. ClarifyGPT: A Framework for Enhancing LLM-Based
Code Generation via Requirements Clarification.Proceedings of the ACM on Software
Engineering, 1(FSE):2332–2354, 2024.
[91] David Rau και Jaap Kamps. Query Generation Using Large Language Models.Ad-
vances in Information Retrieval,σελίδες 226–239. Springer, 2024.
[92] Devendra Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen tau Yih,
Joelle Pineau και Luke Zettlemoyer. Improving Passage Retrieval with Zero-Shot
Question Generation.Proceedings of the 2022 Conference on Empirical Methods in
Natural Language Processing,σελίδες 3781–3797, Abu Dhabi, United Arab Emi-
rates, 2022. Association for Computational Linguistics.
[93] Beir Cellar. BEIR: A Heterogeneous Benchmark, 2024. accessed on 25 April 2025.
[94] Microsoft. TREC 2020 Deep Learning Track, 2020. accessed on 25 April 2025.
[95] Xuanhe Zhou, Zhaoyan Sun και Guoliang Li. Db-gpt: Large language model meets
database.Data Science and Engineering, 9(1):102–111, 2024.
[96] OpenAI. OpenAI Codex, 2024. accessed on 25 April 2025.
[97] DataRobot. DataRobot AI Solutions, 2024. accessed on 25 April 2025.
[98] ThoughtSpot. ThoughtSpot AI-Powered Analytics, 2024. accessed on 25 April 2025.
[99] Tableau. Tableau Data Visualization, 2024. accessed on 25 April 2025.
[100] Microsoft. Microsoft Power BI, 2024. accessed on 25 April 2025.
[101] Apache. Apache Flink - Framework.
[102] Apache. Apache Storm - Framework.
[103] Hriday Kumar Gupta και Rafat Parveen. Comparative study of big data frameworks.
2019 International Conference on Issues and Challenges in Intelligent Computing
Techniques (ICICT),τόµος 1, σελίδες 1–4. IEEE, 2019.
[104] Ovidiu Cristian Marcu, Alexandru Costan, Gabriel Antoniu και María S Pérez-
Hernández. Spark versus flink: Understanding performance in big data analytics
frameworks.2016 IEEE International Conference on Cluster Computing (CLUSTER),
σελίδες 433–442. IEEE, 2016.
[105] Jorge Veiga, Roberto R Expósito, Xoán C Pardo, Guillermo L Taboada και Juan
Tourifio. Performance evaluation of big data frameworks for large-scale data ana-
lytics.2016 IEEE International Conference on Big Data (Big Data),σελίδες 424–431.
IEEE, 2016.
[106] Guidovan Rossum. Python - Programming Language.
174
BIBLIOGRAPHY
[107] Abhinav Nagpal και Goldie Gabrani. Python for data analytics, scientific and techni-
cal applications.2019 Amity international conference on artificial intelligence (AICAI),
σελίδες 140–145. IEEE, 2019.
[108] Apache. PySpark Overview - Introduction.
[109] Efstratios Karypiadis, Anastasios Nikolakopoulos, Achilleas Marinakis, Vrettos
Moulos και Theodora Varvarigou. SCAL-E: An Auto Scaling Agent for Optimum Big
Data Load Balancing in Kubernetes Environments.2022 International Conference on
Computer, Information and Telecommunication Systems (CITS),σελίδες 1–5. IEEE,
2022.
[110] Apache NiFi - Framework.https://nifi.apache.org.
[111] Apache Flume.https://flume.apache.org/.
[112] Apache Kafka - Framework.https://kafka.apache.org.
[113] Kaggle. Netflix: Movies and TV Shows Dataset, 2024. accessed on 25 April 2025.
[114] Kaggle. Covid-19 Twitter Dataset, 2024. accessed on 25 April 2025.
[115] Kaggle. Shard Cars Locations Dataset: Location History of Shared Cars, 2024.
accessed on 25 April 2025.
[116] AutoTel. AutoTel Project in Tel Aviv, 2024. accessed on 25 April 2025.
[117] MavenAnalytics. Madrid Daily Weather Dataset: Daily weather conditions in Madrid
from 1997-2015, 2024. accessed on 25 April 2025.
[118] Kaggle. Supermarket Sales Dataset: Historical Record of Sales Data in 3 Different
Supermarkets, 2024. accessed on 25 April 2025.
[119] Varun Dogra, Sahil Verma, Marcin Woźniak, Jana Shafi, Muhammad Fazal Ijaz και
others. Shortcut Learning Explanations for Deep Natural Language Processing: A
Survey on Dataset Biases.IEEE Access, 2024.
[120] OpenAI. HumanEval Dataset, 2024. accessed on 25 April 2025.
[121] Tianyang Liu, Canwen Xu και Julian McAuley. Repobench: Benchmarking
repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091,
2023.
[122] Meta. CRUXEval: Code Reasoning, Understanding, and Execution Evaluation, 2024.
accessed on 25 April 2025.
[123] Meta. CodeLlama: A State-of-the-Art Large Language Model for Coding, 2023. ac-
cessed on 25 April 2025.
[124] DeepSeekAI. DeepSeek Coder 33B Large Language Model, 2024. accessed on 25
April 2025.
175
BIBLIOGRAPHY
[125] Meta. Llama 3: Openly Available Large Language Model, 2024. accessed on 25 April
2025.
[126] DataCamp. What is Mistral’s Codestral? Key Features, Use Cases, and Limitations,
2024. accessed on 25 April 2025.
[127] Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu
Liu, Jiajun Zhang, Bowen Yu, Kai Dang και others. Qwen2. 5-Coder Technical
Report.arXiv preprint arXiv:2409.12186, 2024.
[128] Shiben Zhu, Wanqin Hu, Zhi Yang, Jiani Yan και Fang Zhang. Qwen-2.5 Out-
performs Other Large Language Models in the Chinese National Nursing Licensing
Examination: Retrospective Cross-Sectional Comparative Study.JMIR Medical Infor-
matics, 13:e63731, 2025.
[129] Qiang Yin, Jianhua Wang, Sheng Du, Jianquan Leng, Jintao Li, Yinhao Hong, Feng
Zhang, Yunpeng Chai, Xiao Zhang, Xiaonan Zhao και others. An adaptive elastic
multi-model big data analysis and information extraction system.Data Science and
Engineering, 7(4):328–338, 2022.
[130] Elad Yom-Tov, Shai Fine, David Carmel και Adam Darlow. Learning to estimate
query difficulty: including applications to missing content detection and distributed
information retrieval.Proceedings of the 28th annual international ACM SIGIR confer-
ence on Research and development in information retrieval,σελίδες 512–519, 2005.
[131] Jiafeng Guo και Yanyan Lan. Query classification.Query Understanding for Search
Engines,σελίδες 15–41, 2020.
[132] Yonatan Belinkov και James Glass. Analysis methods in neural language processing:
A survey.Transactions of the Association for Computational Linguistics, 7:49–72,
2019.
[133] Emily M Bender και Alexander Koller. Climbing towards NLU: On meaning, form,
and understanding in the age of data.Proceedings of the 58th annual meeting of the
association for computational linguistics,σελίδες 5185–5198, 2020.
[134] OTE. OTE Group of Companies.
[135] HuggingFace. Hugging Face: Codestral v01 Large Language Model, 2024. accessed
on 25 April 2025.
[136] LMStudio. LM Studio Server, 2025. accessed on 25 April 2025.
[137] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes και Yejin Choi. The curious case of
neural text degeneration.arXiv preprint arXiv:1904.09751, 2019.
[138] All You Need to Know about Location Data.https://www.quadrant.io/resources/
location-data. accessed on 25 April 2025.
176
BIBLIOGRAPHY
[139] J.H. Huh και K. Seo. An Indoor Location-Based Control System Using Bluetooth
Beacons for IoT Systems.Sensors, 17:2917, 2017.
[140] Y Zhuang, J Yang, Y Li, L Qi και N El-Sheimy. Smartphone-based indoor localization
with Bluetooth low energy beacons.Sensors, 16:596, 2016.
[141] R Faragher και R Harle. Location Fingerprinting With Bluetooth Low Energy Beacons.
IEEE Journal on Selected Areas in Communications, 33:2418–2428, 2015.
[142] P. H. Legay και G. Roullet. LION and MAX, the experiences of two ESPRIT Projects
on High-Speed MANs.High-Capacity Local and Metropolitan Area Networks: Archi-
tecture and Performance Issues. Berlin/Heidelberg, Germany, 1991.
[143] Robert Jordan και Chaouki T. Abdallah. Wireless communications and networking:
An overview.IEEE Antennas and Propagation Magazine, 44:185–193, 2002.
[144] J. Crisp και B. Elliott. LANs and Topology. Oxford, UK, 2005.
[145] R. M. Heiberger και E. Neuwirth. Polynomial regression.R Through Excel: A Spread-
sheet Interface for Statistics, Data Analysis, and Graphics,σελίδες 269–284. New
York, NY, USA, 2009.
[146] F Lemic, V Handziski, M Aernouts, T Janssen, R Berkvens, A Wolisz και J Famaey.
Regression-based estimation of individual errors in fingerprinting localization.IEEE
Access, 7:33652–33664, 2019.
[147] M.I. Ribeiro. Kalman and Extended Kalman Filters: Concept, Derivation and Proper-
ties.Institute for Systems and Robotics, 43:3736–3741, 2004.
[148] Z Chen, Q Zhu και Y. C. Soh. Smartphone inertial sensor-based indoor localization
and tracking with iBeacon corrections.IEEE Transactions on Industrial Informatics,
12:1540–1549, 2016.
[149] H Zou, Z Chen, H Jiang, L Xie και C Spanos. Accurate indoor localization and
tracking using mobile phone inertial sensors, WiFi and iBeacon.Proceedings of the
IEEE International Symposium on Inertial Sensors and Systems (INERTIAL),σελίδες
1–4, Kauai, HI, USA, 2017.
[150] R.K. Yadav, B. Bhattarai, H.S. Gang και J.Y. Pyun. Trusted K Nearest Bayesian
Estimation for Indoor Positioning System.IEEE Access, 7:51484–51498, 2019.
[151] T.M.T. Dinh, N.S. Duong και K. Sandrasegaran. Smartphone-Based Indoor Posi-
tioning Using BLE iBeacon and Reliable Lightweight Fingerprint Map.IEEE Sensors
Journal, 20:10283–10294, 2020.
[152] A.R. Pratama και R. Hidayat. Smartphone-based pedestrian dead reckoning as
an indoor positioning system.Proceedings of the 2012 International Conference on
System Engineering and Technology (ICSET), Bandung, West Java, Indonesia, 2012.
177
BIBLIOGRAPHY
[153] Apple Inc. Software Development Kit (SDK) for iOS. Available online: https:
//developer.apple.com/ios/ (accessed on 25 April 2025).
[154] Google Inc. Software Development Kit (SDK) for Android. Available online: https:
//developer.android.com/tools (accessed on 25 April 2025).
[155] T.D. Vy, T. L. Nguyen και Y Shin. A precise tracking algorithm using PDR and
Wi-Fi/iBeacon corrections for smartphones.IEEE Access, 9:49522–49536, 2021.
[156] K Elgui, P Bianchi, F Portier και O Isson. Learning methods for RSSI-based geolo-
cation: A comparative study.Pervasive and Mobile Computing, 67:101199, 2020.
[157] N Duong και T Dinh. On the accuracy of iBeacon-based Indoor Positioning System
in the iOS platform.Proceedings of the 2021 18th International Multi-Conference on
Systems, Signals & Devices (SSD),σελίδες 58–62, Monastir, Tunisia, 2021.
[158] M Abbas, M Elhamshary, H Rizk, M Torki και M Youssef. WiDeep: WiFi-based Accu-
rate and Robust Indoor Localization System using Deep Learning.Proceedings of the
2019 IEEE International Conference on Pervasive Computing and Communications
(PerCom),σελίδες 1–10, Kyoto, Japan, 2019.
[159] W Njima, A Bazzi και M Chafii. DNN-Based Indoor Localization Under Limited Dataset
Using GANs and Semi-Supervised Learning.IEEE Access, 10:69896–69909, 2022.
[160] T Yang, A Cabani και H Chafouk. A Survey of Recent Indoor Localization Scenarios
and Methodologies.Sensors, 21:8086, 2021.
[161] F Zafari, A Gkelias και K. K. Leung. A Survey of Indoor Localization Systems and
Technologies.IEEE Communications Surveys & Tutorials, 21:2568–2599, 2019.
[162] B Jang και H Kim. Indoor Positioning Technologies Without Offline Fingerprinting
Map: A Survey.IEEE Communications Surveys & Tutorials, 21:508–525, 2019.
[163] S Subedi και J Pyun. A Survey of Smartphone-Based Indoor Positioning System
Using RF-Based Wireless Technologies.Sensors, 20:7230, 2020.
[164] M Mallik, A Panja και C Chowdhury. Paving the way with machine learning for
seamless indoor–outdoor positioning: A survey.Information Fusion, 94:126–151,
2023.
[165] Tech Insights. Apple U1 Ultra Wideband (UWB) Chip Analysis. Available on-
line: https://www.techinsights.com/blog/apple-u1-tmka75-ultra-wideband-uwb-chip-
analysis (accessed on 25 April 2025).
[166] A. Prorok και A. Martinoli. Accurate indoor localization with ultra-wideband using
spatial models and collaboration.International Journal of Robotics Research, 33:547–
568, 2014.
178
BIBLIOGRAPHY
[167] A. Alarifi, A. Al-Salman, M. Alsaleh, A. Alnafessah, S. Al-Hadhrami, M.A. Al-Ammar
και H.S. Al-Khalifa. Ultra Wideband Indoor Positioning Technologies: Analysis and
Recent Advances.Sensors, 16:707, 2016.
[168] P. Mayer, M. Magno και L. Benini. Self-Sustaining Ultrawideband Positioning System
for Event-Driven Indoor Localization.IEEE Internet of Things Journal, 11:1272–1284,
2024.
[169] G.I. Hapsari, R. Munadi, B. Erfianto και I.D. Irawati. Future Research and Trends
in Ultra-Wideband Indoor Tag Localization.IEEE Access, 2024.
[170] Apple Inc. Apple iBeacon. Available online: https://developer.apple.com/ibeacon/
(accessed on 25 April 2025).
[171] Google. Eddystone Beacon API. Available online: https://github.com/google/
eddystone (accessed on 25 April 2025).
[172] Apple Inc. Turning an iOS Device into an iBeacon Device. Avail-
able online: https://developer.apple.com/documentation/corelocation/turning_an_
ios_device_into_an_ibeacon_device (accessed on 25 April 2025).
[173] Apple Inc. Multipeer Connectivity: Support Peer-to-Peer Connectivity and the Discov-
ery of Nearby Devices. Available online: https://developer.apple.com/documentation/
multipeerconnectivity (accessed on 25 April 2025).
[174] Google. Nearby Connections API. Available online: https://developers.google.com/
nearby/connections/overview (accessed on 25 April 2025).
[175] Google. Nearby Messages API. Available online: https://developers.google.com/
nearby/messages/overview (accessed on 25 April 2025).
[176] Apple Inc. Swift Programming Language. Available online: https://developer.apple.
com/swift/ (accessed on 25 April 2025).
[177] F. Dalkılıç, U.C. Çabuk, E. Arıkan και A. Gürkan. An analysis of the positioning
accuracy of iBeacon technology in indoor environments.Proceedings of the 2017
International Conference on Computer Science and Engineering (UBMK),σελίδες 549–
553, Antalya, Turkey, 2017.
[178] Apple Inc. Multipeer Connectivity: Maximum Number of Peers. Avail-
able online: https://developer.apple.com/documentation/multipeerconnectivity/
mcbrowserviewcontroller/1406954-maximumnumberofpeers (accessed on 25 April 2025).
179
List of Abbreviations
AI Artificial Intelligence
API Application Programming Interface
BLE Bluetooth Low Energy
CI Critical Infrastructures
DB Data Base
GAN Generative Adversarial Network(s)
GB/GiB Gigabyte(s)
EKF Extended Kalman Filtering
EU European Union
FP Fingerprinting
IMU Inertial Measurement Unit(s)
IPS Indoor Positioning System
JSON JavaScript Object Notation
kNN k-Nearest Neighbors (Algorithm)
LLM Large Language Model(s)
MB Megabyte(s)
ML Machine Learning
MC Multipeer Connectivity
PDR Pedestrian Dead Reckoning
PRM Polynomial Regression Model
P2P Peer To Peer
RSSI Received Signal Strength Indicator
TAN Transactional Area Network
TB Terabyte(s)
UUID Universally Unique Identifier
UWB Ultra-Wideband
VDC Virtual Data Container
VDR Virtual Data Repository
SQL Structured Query Language
181