Optimizing Data Analysis Through Offline Large Language Models and Scalable Data Management Techniques PDF Free Download

Name: Optimizing Data Analysis Through Offline Large Language Models and Scalable Data Management Techniques PDF
Author: mr_sarah

1 / 189

0 views•189 pages

Optimizing Data Analysis Through Offline Large Language Models and Scalable Data Management Techniques PDF Free Download

Optimizing Data Analysis Through Offline Large Language Models and Scalable Data Management Techniques PDF free Download. Think more deeply and widely.

National Technical University of Athens

School of Electrical and Computer Engineering

Division of Communications, Electronics & Information Systems

Optimizing Data Analysis Through Oﬄine

Large Language Models and Scalable Data

Management Techniques

Research, Study and Implementation

PhD Thesis

ANASTASIOS NIKOLAKOPOULOS

Supervisor: Theodora Varvarigou

Professor, NTUA

Athens, May 2025

National Technical University of Athens

School of Electrical and Computer Engineering

Division of Communications, Electronics & Information Systems

Optimizing Data Analysis Through Oﬄine Large

Language Models and Scalable Data Management

Techniques

Research, Study and Implementation

PhD Thesis

ANASTASIOS NIKOLAKOPOULOS

Supervisor: Theodora Varvarigou

Professor, NTUA

Approved by the examination committee on 8th May 2025.

(Signature) (Signature) (Signature)

........................ .......................... ..........................

Theodora Varvarigou Emmanouil Varvarigos Symeon Papavasileiou

Professor, NTUA Professor, NTUA Professor, NTUA

(Signature) (Signature) (Signature)

.............................. ...................... .................

Konstantinos Tserpes Dimitrios Askounis Antonios Litke

Assistant Professor, NTUA Professor, NTUA Lecturer, HAA

(Signature)

........................

Anastasios Doulamis

Professor, NTUA

Athens, May 2025

National Technical University of Athens

School of Electrical and Computer Engineering

Division of Communications, Electronics & Information Systems

Anastasios Nikolakopoulos, 2025.

The copying, storage and distribution of this PhD diploma thesis, exall or part of it, is

prohibited for commercial purposes. Reprinting, storage and distribution for non - proﬁt,

educational or of a research nature is allowed, provided that the source is indicated and

that this message is retained.

The content of this thesis does not necessarily reﬂect the views of the Department, the

Supervisor, or the committee that approved it.

DISCLAIMER ON ACADEMIC ETHICS AND INTELLECTUAL PROPERTY RIGHTS

Being fully aware of the implications of copyright laws, I expressly state that this PhD

diploma thesis, as well as the electronic ﬁles and source codes developed or modiﬁed in

the course of this thesis, are solely the product of my personal work and do not infringe

any rights of intellectual property, personality and personal data of third parties, do not

contain work / contributions of third parties for which the permission of the authors /

beneﬁciaries is required and are not a product of partial or complete plagiarism, while

the sources used are limited to the bibliographic references only and meet the rules of

scientiﬁc citing. The points where I have used ideas, text, ﬁles and / or sources of other

authors are clearly mentioned in the text with the appropriate citation and the relevant

complete reference is included in the bibliographic references section. I fully, individually

and personally undertake all legal and administrative consequences that may arise in the

event that it is proven, in the course of time, that this thesis or part of it does not belong

to me because it is a product of plagiarism.

(Signature)

.............................

Anastasios

Nikolakopoulos

PhD, National Technical University of Athens

Athens, May 2025

...To my loving parents, Konstantinos & Georgia, and my beloved wife, Chrysa

Abstract

The importance of data across various sectors demands innovative approaches to

data management and analytics. This PhD thesis investigates the integration of oﬄine

large language models (LLMs) for automated code generation, aiming to streamline data

analysis processes, and thus enhance the scalability and eﬃciency of data management

systems. By leveraging oﬄine LLMs, the proposed approach empowers users to perform

data analyses without extensive programming skills, thereby democratizing data ana-

lytics. The research delves into the architecture and implementation of scalable data

management systems that can eﬃciently handle datasets of several volumes. Based on

an eﬃcient data management platform, the capabilities of oﬄine LLMs to generate ana-

lytical code are examined, showcasing how these models can transform user queries into

executable scripts that facilitate data manipulation and interpretation. Through exper-

iments and case studies, the practical applications and beneﬁts of the proposed study

are showcased. The results highlight the potential of oﬄine Large Language Models in

Data Science and Analysis. This thesis contributes to the ﬁeld by presenting a study that

integrates AI-driven code generation with robust data management practices, ultimately

paving the way for more eﬃcient and user-friendly data analytics solutions.

Keywords

Data Science, Data Analysis, Machine Learning, Artiﬁcial Intelligence, Large Language

Models, Indoor Localization, Code Generation, Big Data

Abstract in Greek

Η σηµασία που έχουν αποκτήσει τα δεδοµένα σε ποικίλους τοµείς, καθιστά αναγκαία

την ανάπτυξη καινοτόµων προσεγγίσεων στη διαχείριση και ανάλυσή τους. Η παρούσα δι-

δακτορική διατριβή ερευνά την ενσωµάτωση τοπικών Μεγάλων Γλωσσικών Μοντέλων (oﬄine

LLMs) για την αυτόµατη παραγωγή κώδικα, µε στόχο την απλοποίηση των διαδικασιών α-

νάλυσης δεδοµένων και, κατ’ επέκταση, τη ϐελτίωση της επεκτασιµότητας και αποδοτικότη-

τας των συστηµάτων διαχείρισης δεδοµένων. Μέσω της αξιοποίησης τοπικά εκτελούµενων

LLMs, η προτεινόµενη προσέγγιση δίνει τη δυνατότητα σε χρήστες χωρίς εκτεταµένες προ-

γραµµατιστικές γνώσεις να πραγµατοποιούν αναλύσεις δεδοµένων, συµβάλλοντας έτσι στη

δηµοκρατικοποίηση της επιστήµης των δεδοµένων. Η έρευνα εστιάζει στην αρχιτεκτονική και

την υλοποίηση επεκτάσιµων συστηµάτων διαχείρισης δεδοµένων, ικανών να διαχειρίζονται

αποδοτικά σύνολα δεδοµένων µεγάλου όγκου. Βασιζόµενη σε µια αποδοτική πλατφόρµα

διαχείρισης δεδοµένων, η εργασία εξετάζει τις δυνατότητες των oﬄine LLMs για την παραγω-

γή αναλυτικού κώδικα, αναδεικνύοντας πώς αυτά τα µοντέλα µπορούν να µετασχηµατίσουν

ϕυσικές ερωτήσεις χρηστών σε εκτελέσιµα σενάρια, τα οποία διευκολύνουν τον χειρισµό και

την ερµηνεία των δεδοµένων. Μέσα από πειραµατικές διαδικασίες και µελέτες περίπτωσης,

παρουσιάζονται οι πρακτικές εφαρµογές και τα οφέλη της προτεινόµενης προσέγγισης. Τα

αποτελέσµατα αναδεικνύουν την αποδοτικότητα των τοπικά εκτελούµενων Μεγάλων Γλωσ-

σικών Μοντέλων στην επιστήµη και ανάλυση δεδοµένων. Η διατριβή συµβάλλει στο πεδίο

παρουσιάζοντας µια µελέτη που συνδυάζει τεχνικές αυτόµατης παραγωγής κώδικα µέσω τε-

χνητής νοηµοσύνης µε στιβαρές πρακτικές διαχείρισης δεδοµένων, ανοίγοντας τον δρόµο για

πιο αποδοτικές και ϕιλικές προς τον χρήστη λύσεις ανάλυσης δεδοµένων.

Λέξεις Κλειδιά

Επιστήµη ∆εδοµένων, Ανάλυση ∆εδοµένων, Μηχανική Μάθηση, Τεχνητή Νοηµοσύνη,

Μεγάλα Γλωσσικά Μοντέλα, Γεωεντοπισµός Εσωτερικών Χώρων, Παραγωγή Κώδικα, Μεγάλα

∆εδοµένα

Acknowledgements

I would like to express my gratitude to my supervisor, Dr. Theodora Varvarigou, for her

unwavering trust, insightful guidance, and continuous support throughout the course of

my doctoral journey. I am also thankful to the members of my examination committee for

their valuable feedback, constructive suggestions, and contribution to the evaluation and

improvement of this dissertation. Moreover, I would like to extend my appreciation to my

colleagues at the Distributed Knowledge and Media Systems (DKMS) laboratory for their

collaboration, insightful discussions, and the positive and inspiring research environment

they helped foster. Last but not least, my heartfelt thanks go to my parents and my wife

for their endless patience, understanding, and unconditional support, which gave me the

strength and motivation to persevere during challenging times.

Athens, May 2025

Anastasios Nikolakopoulos

Table of Contents

Abstract 3

Abstract in Greek 5

Acknowledgements 7

Prelude 17

Extended PhD Thesis Summary in Greek 19

I Introduction 35

1 Introduction: Towards Eﬃcient Data Management and Analysis 37

1.1 Scalable Data Proﬁling for Quality Analytics Extraction . . . . . . . . . . . 37

1.2 A Scalable Data Management and Interoperability Solution . . . . . . . . . 38

1.2.1 Data Management Challenges . . . . . . . . . . . . . . . . . . . . . . 40

1.2.2 Eﬃcient Data Management Layer Proposal . . . . . . . . . . . . . . . 42

1.3 Large Language Models and their Impact in Modern Applications . . . . . . 43

1.3.1 Applications and Impact of LLMs . . . . . . . . . . . . . . . . . . . . 43

1.3.2 Exploring Oﬄine LLMs in Data Analysis . . . . . . . . . . . . . . . . 44

II Main Analysis 47

2 Literature Review: Assessing Similar and Existing Approaches to Data Man-

agement and AI-Powered Analysis 49

2.1 Generic Approaches to Scalable Data Management and Analytics . . . . . . 49

2.1.1 Data Proﬁling and Quality Analysis . . . . . . . . . . . . . . . . . . . 49

2.1.2 User-Deﬁned Quality Rules . . . . . . . . . . . . . . . . . . . . . . . 50

2.2 Related Scalable Data Management Proposals and Solutions . . . . . . . . 51

2.2.1 EU Research Projects and Initiatives . . . . . . . . . . . . . . . . . . 51

2.2.2 Research Work on Scientiﬁc Literature . . . . . . . . . . . . . . . . . 52

2.3 Applications of LLMs for Code Generation and Data Analysis Operations . . 54

2.3.1 Code Generation Techniques . . . . . . . . . . . . . . . . . . . . . . 54

2.3.2 Literature Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.3.3 Other Applications and Market Tools . . . . . . . . . . . . . . . . . . 56

2.3.4 Prompt Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

TABLE OF CONTENTS

3 Study Design, Methodology, and Implementation: A Complete Analysis of the

Framework 59

3.1 Initial Generic Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.1.2 Framework Operational Flows . . . . . . . . . . . . . . . . . . . . . . 61

3.1.3 Technologies To Use . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.2 The Scalable Data Management Software Infrastructure . . . . . . . . . . . 63

3.2.1 Data Processing and Virtualization . . . . . . . . . . . . . . . . . . . 63

3.2.2 Pre-Processing and Filtering Tool . . . . . . . . . . . . . . . . . . . . 64

3.2.3 Virtual Data Repository . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.2.4 Virtual Data Container . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.3 The Final System with the Utilization of Oﬄine LLMs . . . . . . . . . . . . . 69

3.3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.3.2 Dataset Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.4 Technical Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.4.1 LLMs and Data Processing Platform . . . . . . . . . . . . . . . . . . . 76

3.4.2 Prompt Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.4.3 Dataset Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4 Testing and Results: Assessing the Study’s Performance 87

4.1 Validating the Scalable Capabilities of the Underlying Infrastructure . . . . 87

4.2 Evaluating the Final System . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.2.1 Physical Machine Speciﬁcations . . . . . . . . . . . . . . . . . . . . . 101

4.2.2 Testing Results Collection . . . . . . . . . . . . . . . . . . . . . . . . 103

4.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5 Thesis Findings: Critical Assessment of the Proposed System 123

5.1 General Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.2 Validating Scalable Data Management Frameworks . . . . . . . . . . . . . . 123

5.3 Entrusting Oﬄine LLMs for Future Data Analytics . . . . . . . . . . . . . . 124

5.3.1 Strengths and Limitations . . . . . . . . . . . . . . . . . . . . . . . . 124

5.3.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6 Future Use Case: Real-Time Indoor Localization Frameworks 127

6.1 Indoor Localization in Today’s Applications . . . . . . . . . . . . . . . . . . 127

6.1.1 Location Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.1.2 Indoor Localization as an Area Network . . . . . . . . . . . . . . . . . 128

6.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.2.1 RSS, PDR and Filtering Techniques . . . . . . . . . . . . . . . . . . . 129

6.2.2 Machine Learning and Ultra Wideband . . . . . . . . . . . . . . . . . 132

6.3 TAN Conceptualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.3.1 Scope and Characteristics . . . . . . . . . . . . . . . . . . . . . . . . 134

6.3.2 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.3.3 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

TABLE OF CONTENTS

6.4 Proof-of-Concept Implementation . . . . . . . . . . . . . . . . . . . . . . . . 140

6.4.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.4.2 Testing and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.5 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

III Epilogue 157

7 Conlusions: Towards Scalable Data Management Empowered By Oﬄine Large

Language Models 159

7.1 Managing Scalable Volumes of Data . . . . . . . . . . . . . . . . . . . . . . 159

7.2 Leveraging Oﬄine LLMs for Data Analysis . . . . . . . . . . . . . . . . . . . 159

Appendices 161

A Appendix 163

A.1 The Python code readability calculation function . . . . . . . . . . . . . . . 163

A.2 Summary of the ‘Shared Cars Locations’ Dataset . . . . . . . . . . . . . . . 164

Bibliography 179

List of Abbreviations 181

List of Figures

1.1 The 16 critical infrastructure sectors of global industry and economy [1]. . 39

3.1 The initial architecture of the proposed scalable data management and anal-

ysis framework, along with its two operational ﬂows, ’Data Proﬁler’ and ’Data

Evaluator’ [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.2 Architectural view of the scalable data management layer, with its three

subcomponents [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.3 The information ﬂow of the Virtual Data Container [3] . . . . . . . . . . . . 67

3.4 Accepted operators by VDC’s Rules System [1]. . . . . . . . . . . . . . . . . 69

3.5 The proposed system’s ﬁnal architecture overview [4]. . . . . . . . . . 72

4.1 A JSON object from the “urn-ngsi-ld-ITI-Customs” dataset, containing in-

formation for a given port’s customs item [1]. . . . . . . . . . . . . . . . . . 88

4.2 A JSON object that contains information for a random-anonymized-OTE

cellular network user in the area of Thessaloniki, Greece [1]. . . . . . . . . 90

4.3 Screenshot from scalable data management layer’s Apache Spark Cluster

WebUI [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.4 Screenshot from Spark cluster driver’s logs [1]. . . . . . . . . . . . . . . . . 91

4.5 A screenshot from the Apache Spark Cluster’s WebUI, for the November

2022 subset [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.6 Screenshot from the logs of the Spark driver “driver-20230901183325-

0000”, for the November 2022 subset [1]. . . . . . . . . . . . . . . . . . . . 92

4.7 A screenshot from the Spark Cluster’s WebUI, for the Customs subset con-

taining 6 million objects [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.8 A screenshot from “driver-20230901184317-0001” Spark cluster driver’s

logs, for the 6 million subset [1]. . . . . . . . . . . . . . . . . . . . . . . . . 93

4.9 A screenshot from Spark Cluster’s WebUI, for the Customs subset compris-

ing 11 million objects [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.10 A screenshot from the logs of the Spark driver “driver-20230901185350-

0002”, for the 11 million subset [1]. . . . . . . . . . . . . . . . . . . . . . . 94

4.11 A screenshot from the Spark Cluster’s WebUI, for the complete Customs

dataset [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.12 A screenshot from the logs of the Spark driver “driver-20230901191216-

0003”, for the complete Customs subset [1]. . . . . . . . . . . . . . . . . . 95

LIST OF FIGURES

4.13 A screenshot from the Spark Cluster’s WebUI, for the Mobility subset of

2019 with 6.9 million objects [1]. . . . . . . . . . . . . . . . . . . . . . . . 96

4.14 A screenshot from the logs of the Spark driver “driver-20231004114021-

0000”, regarding the Mobility 2019 subset [1]. . . . . . . . . . . . . . . . . 96

4.15 The Spark Cluster’s WebUI, depicting the driver “driver-20231004114951-

0001” in running state, for the 2020 Mobility subset [1]. . . . . . . . . . . 97

4.16 A screenshot from the “driver-20231004114951-0001” Spark driver’s logs,

for the Mobility 2020 subset [1]. . . . . . . . . . . . . . . . . . . . . . . . . 97

4.17 A screenshot from the Spark Cluster’s WebUI, where the driver “driver-

20231004115659-0002” is depicted in active state, for the Mobility 2021

subset [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.18 A screenshot from the “driver-20231004115659-0002” Spark driver’s logs,

which handled the Mobility subset of 2021 [1]. . . . . . . . . . . . . . . . . 98

4.19 A screenshot from the Spark Cluster’s WebUI, where the driver “driver-

20231004120403-0003” is depicted in active state, for the complete Mobility

dataset [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.20 A screenshot from the Spark driver’s “driver-20231004120403-0003” logs,

as it ceased its operation on the complete Mobility dataset [1]. . . . . . . . 99

4.21 Scatter plot with the tests of ITI Customs and OTE Mobility subsets, in

diﬀerent plot lines [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.22 Scatter plot including all tests, in the same plot line [1]. . . . . . . . . . . . 100

4.23 Scatter plot including the ﬁnal ITI Customs dataset [1]. . . . . . . . . . . . 100

4.24 Functional correctness of the LLM’s generated code plot [4]. . . . . . . . . 105

4.25 Plot depicting the functional correctness, by dataset [4]. . . . . . . . . . . . 106

4.26 Plot for the functional correctness scores by the queries’ complexity levels [4].107

4.27 Plot for the distribution of readability scores across the tests conducted [4]. 108

4.28 Plot depicting the average readability scores by dataset [4]. . . . . . . . . . 109

4.29 Plot displaying the average readability by query complexity level [4]. . . . . 109

4.30 Distribution of the LLM server’s response time during code generation [4]. . 110

4.31 Distribution of the LLM server’s CPU usage during code generation [4]. . . 111

4.32 Distribution of the LLM server’s memory usage during code generation [4]. 111

4.33 Distribution of the LLM server’s GPU usage during code generation [4]. . . 112

4.34 Distribution of the LLM server’s GPU memory usage during code generation

[4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.35 The LLM server’s average response time, by code readability [4]. . . . . . . 113

4.36 The LLM server’s average response time, by query complexity level [4]. . . . 114

4.37 Average CPU usage of the LLM server, by readability and query complexity

[4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.38 The LLM server’s average GPU usage, by readability and query complexity [4].115

4.39 Average response times of the LLM server, by readability and query com-

plexity [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.40 Average GPU memory usage of the LLM server, by readability and query

complexity [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

LIST OF FIGURES

4.41 The amount of automated and semi-automated tests, based on human in-

tervention to their code [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

4.42 Comparing Functional Correctness results with Automation occurences [4]. 119

4.43 Presenting the automation occurrences by the query complexity levels [4]. . 119

4.44 Comparing Functional Correctness results with error counts [4]. . . . . . . 120

4.45 The number of errors found in the tests, per dataset [4]. . . . . . . . . . . 121

4.46 The total count of errors, grouped by query complexity levels [4]. . . . . . . 122

6.1 High-level overview of the Transactional Area Network’s implementation

within an indoor environment [5]. . . . . . . . . . . . . . . . . . . . . . . . 135

6.2 A focused overview of the user communication in a TAN, showcasing the

data-exchange pipelines between two users, as well as the existence of ex-

ternal hardware for optional assistance [5]. . . . . . . . . . . . . . . . . . . 136

6.3 A Transactional Area Network operates across multiple devices in indoor

environments. Each device transmits and receives BLE beacon signals,

initiating peer-to-peer sessions, by also broadcasting and receiving Wi-Fi

signals [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.4 Swift code snippet that veriﬁes and ensures the uniqueness of Major/Minor

values [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.5 Swift code snippet that checks the device’s ability to transmit as a beacon [5].142

6.6 ServiceAdvertiser receives an invitation to join another peer’s session [5]. . 143

6.7 ServiceBrowser locates a peer and checks if they already exist in the device’s

list with known peers [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.8 The Major/Minor pairing process, and examining for matches in the known

Peers list [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.9 TAN prototype software’s internal components’ architecture [5]. . . . . . . 146

6.10 A data sample from the TAN’s proof-of-concept software [5]. . . . . . . . . 147

6.11 Establishment of TAN between two peers, as seen from Daniel Adams’ device

(iPhone X) [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.12 David Jones’s device (iPhone 8) showing the distance between him and

Daniel Adams [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.13 Alex Lopez with his device (iPad Pro 10.5 2017) entering TAN [5]. . . . . . . 149

6.14 David Jones’s new distance from the users, as shown in his device’s screen

[5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.15 Daniel Adams enters TAN through his iPhone X, viewing two peers nearby,

John Doe and Catherine Smith [5]. . . . . . . . . . . . . . . . . . . . . . . 151

6.16 Daniel Adams’s new distance from users John Doe and Catherine Smith [5]. 151

6.17 John Doe takes one ﬁnal look at his iPad’s screen, observing the distance

between himself and users Daniel Adams and Catherine Smith [5]. . . . . 153

6.18 Proposed improvement in TAN software’s internal components Architecture,

highlighting the addition of Kalman ﬁltering and outlier detection methods,

along with a signal direction calculation process [5]. . . . . . . . . . . . . . 154

Prelude

In the era of digital transformation, data has emerged as one of the most valuable

assets across industries, driving innovation, strategic decision-making, and operational

eﬃciency. The rapid expansion of digital services, IoT deployments, and interconnected

systems has led to an exponential increase in data generation. Organizations today are

faced with an unprecedented volume, velocity, and variety of data, making eﬃcient data

management and analytics critical requirements for maintaining competitiveness, and

achieving business and research objectives.

Traditional approaches to data analytics often require specialized knowledge of pro-

gramming and query languages, data processing frameworks, and of course the data

itself. This creates a barrier for non-technical users who wish to extract meaningful in-

sights from datasets. Moreover, as datasets grow in complexity, ensuring data quality,

managing storage eﬃciently, and executing large-scale analytical tasks in a timely man-

ner, present formidable challenges. It should also be noted that organizations tend to

adhere to strict data privacy regulations, highlighting the need for solutions that ensure

data integrity, while preventing exposure to external cloud-based services.

Furthermore, current big data environments rely heavily on distributed computing

frameworks (such as Apache Spark) to process large data sets eﬃciently. Such frame-

works have proven to be robust approaches to big data management. However, the

process of deﬁning queries, writing scripts, and optimizing execution pipelines remains

a bottleneck, especially for end-users, analysts and decision-makers who are unfamiliar

with programming or data technologies. The introduction of oﬄine LLMs into this land-

scape could provide a transformative shift in how users interact with data, automating

the translation of human language into structured queries and execution plans.

The aforementioned challenges call for innovative approaches to data management

and analytics. Lately, Large language models (LLMs), such as GPT or Gemini, have

demonstrated remarkable proﬁciency in understanding and generating human-like text.

A future data management & analytics framework could beneﬁt by an oﬄine version of

such LLMs, since the model’s advanced natural language processing capabilities would

provide several advantages to data analysis, while also ensuring data privacy and security.

The LLM’s role should be to interpret users’ natural language queries and translate them

into system executable code, which could then be automatically executed within the

framework itself. This translation process could bridge the gap between non-technical

users and complex data analytics tasks, democratizing access to powerful data insights.

This thesis proposes a scalable data management framework, which is augmented

by the capabilities of an oﬄine large language model (LLM). This framework aims to

Prelude

simplify data interaction by allowing users to issue natural language queries, which the

LLM translates into executable code for the framework itself. Its architecture supports

various data volumes, sources and formats, enabling seamless integration with existing

data infrastructures. The core innovation of this framework lies in its outlined ability

to leverage oﬄine LLMs and translate natural language queries into executable code,

enabling users — even those with minimal technical expertise — to interact with complex

datasets in an intuitive manner. Also, the ﬁnal framework ensures data security and

compliance, allowing organizations to retain full control over their datasets, all while

maintaining computational eﬃciency.

In summary, the proposed framework not only bridges the gap between end-users

and data management & analytics technologies, but also enhances analytical workﬂows

by i)enabling natural language querying for complex data operations, ii)automating the

generation of optimized executable code for data management systems iii), providing a

privacy-preserving alternative by using an oﬄine LLM version instead of cloud-based AI

services, iv)ensuring scalability and adaptability across diﬀerent industries and data do-

mains. However, there are challenges to consider. The accuracy of the LLM in translating

natural language queries into eﬀective code is critical. Misinterpretations could lead to

incorrect or suboptimal data analyses. Therefore, continuous reﬁnement of the LLM’s

prompt engineering commands are necessary to improve its performance. Furthermore,

maintaining the system’s eﬃciency as data volumes grow requires careful evaluation of

the underlying infrastructure.

Indeed, the proposed scalable data management framework represents a potential ad-

vancement in data analytics accessibility and eﬃciency. By enabling non-technical users

to perform complex data analyses through simple queries, the framework democratizes

access to powerful data insights. The indoor localization data management use case

acts as an example of the practical beneﬁts and potential of this approach, highlighting

its applicability across various domains. This novel integration of oﬄine LLMs and data

management solutions not only enhances accessibility, but also optimizes data workﬂows,

making analytics more eﬃcient and secure. Through this thesis, a systematic approach

to designing, implementing, and evaluating this framework is presented, showcasing its

potential to assist the ﬁeld of scalable data management and analytics.

Extended PhD Thesis Summary in Greek

Πλαίσιο της ∆ιατριβής

Στο σηµερινό ταχέως εξελισσόµενο τεχνολογικό τοπίο, τα δεδοµένα ξεχωρίζουν ως ένα από τα

πιο πολύτιµα αγαθά [6]. Καθώς οι εξελίξεις στην τεχνολογία λαβαίνουν χώρα µε αυξανόµενη

συχνότητα [7], η δηµιουργία και διάδοση των δεδοµένων έχουν γίνει αναπόσπαστα στοιχεία

των αναδυόµενων προϊόντων, υπηρεσιών, και συστηµάτων. Κατά συνέπεια, ο όγκος των δεδο-

µένων που παράγονται καθηµερινά παγκοσµίως είναι εντυπωσιακός. Σύµφωνα µε πρόσφα-

τες εκτιµήσεις, παράγονται περίπου 328,77 εκατοµµύρια τεραµπάιτ δεδοµένων καθηµερινά

[8], περιλαµβάνοντας νεοδηµιουργηµένα, καταγεγραµµένα, αντιγραµµένα ή καταναλωµένα

δεδοµένα. Αυτή η τεράστια ϱοή δεδοµένων έχει ήδη αποδείξει τη σηµαντική τους αξία σε

διάφορους τοµείς.

Η διατήρηση της υψηλής ποιότητας των δεδοµένων είναι απαραίτητη για όλους τους

καταναλωτές τους. Υπάρχει έντονη ανάγκη για την εξερεύνηση µιας κλιµακούµενης - ή ανε-

ξάρτητης από τον όγκο - λύσης ανάλυσης δεδοµένων, η οποία να επιτρέπει στους χρήστες να

ορίζουν εύκολα δικά τους ερωτήµατα (queries), προσαρµοσµένα στις συγκεκριµένες αναλυ-

τικές τους ανάγκες. Η παρούσα διδακτορική διατριβή προτείνει ένα λογισµικό σύστηµα σχε-

διασµένο να χειρίζεται δεδοµένα µε κλιµακούµενο τρόπο, εκτελώντας εκτενείς αξιολογήσεις

ποιότητας σε ολόκληρα σύνολα δεδοµένων, µε τη ϐοήθεια τεχνητής νοηµοσύνης και µεγάλων

γλωσσικών µοντέλων (Large Language Models). Το προτεινόµενο σύστηµα εναρµονίζεται µε

τις καθιερωµένες πρακτικές της ανάλυσης δεδοµένων, εξάγοντας ποιοτικές πληροφορίες από

τα σύνολα δεδοµένων, ενώ προσφέρει στους χρήστες τη δυνατότητα να καθορίζουν τα κρι-

τήρια ποιότητας τους. Είναι ικανό να εφαρµόζει τους µηχανισµούς προφίλ σε ολόκληρα

σύνολα δεδοµένων (data proﬁling), ανεξάρτητα από τον όγκο αυτών, και να λειτουργεί απο-

δοτικά ακόµα και σε περιβάλλοντα υπολογιστών µε περιορισµένους πόρους, εξασφαλίζοντας

τόσο την κλιµάκωση όσο και την απόδοση του συστήµατος. Ο εύκολος τρόπος µε τον οποίο

πραγµατοποιείται η ανάλυση δεδοµένων από τους τελικούς χρήστες, αποτελεί επίσης έναν

κύριο στόχο της παρούσας διατριβής.

Συγκεκριµένα, παρατηρείται ανάγκη για λύσεις κλιµακούµενης διαχείρισης δεδοµένων

σε διάφορους τοµείς. ΄Ενας εξ΄ αυτών είναι ο τοµέας των Κρίσιµων Υποδοµών (ΚΥ) Critical

Infrastructures - CIs). Ο όρος αυτός αναφέρεται στα συστήµατα, τα δίκτυα και τις υποδοµές

που είναι Ϲωτικής σηµασίας για τη λειτουργία της σύγχρονης κοινωνίας, καθώς και των

ϐασικών της τοµέων. Αυτοί περιλαµβάνουν, µεταξύ άλλων, την Ενέργεια, τις Μεταφορές,

το Σύστηµα Υδρεύσης, τις Τηλεπικοινωνίες, το Σύστηµα Υγείας, τα Χρηµατοοικονοµικά, τη

Γεωργία, και τις Κρατικές / Κυβερνητικές Εγκαταστάσεις [9]. ∆εδοµένου του ϑεµελιώδη

ϱόλου τους, οποιαδήποτε διακοπή αυτών των υποδοµών µπορεί να οδηγήσει σε κοινωνικές

και οικονοµικές συνέπειες [10]. Εποµένως, είναι κατάλληλο να ϑεωρούνται οι ΚΥ ως οι

Extended PhD Thesis Summary in Greek

πυλώνες πάνω στους οποίους στηρίζεται η παγκόσµια οικονοµία.

Από την πλευρά των δεδοµένων, οι ΚΥ είναι παραγωγοί µεγάλου όγκου πληροφορίας,

παράγοντας τεράστιες ποσότητες δεδοµένων καθηµερινά σε τοµείς όπως οι λειτουργίες (των

υποδοµών), η παρακολούθηση (διαδικασιών), η συντήρηση (συστηµάτων), η ασφάλεια (των

δικτύων), και άλλα. Καθώς η ψηφιακή µετάβαση και η διασυνδεσιµότητα συνεχίζουν να

εξελίσσονται, ο ϱυθµός παραγωγής δεδοµένων σχετικών µε τις ΚΥ αυξάνεται ϱαγδαία. Η

αποτελεσµατική διαχείριση και αξιοποίηση αυτών των δεδοµένων είναι κρίσιµη για την ϐελ-

τιστοποίηση της απόδοσης, την ενίσχυση της ασφάλειας, και την υποστήριξη ενηµερωµένων

αποφάσεων. Λύσεις που καλύπτουν µε επιτυχία τις σύνθετες απαιτήσεις των ΚΥ, αποδει-

κνύουν εν δυνάµει την ικανότητά τους να ξεχωρίσουν και σε λιγότερο απαιτητικά επιχειρη-

σιακά περιβάλλοντα. Αυτό καθιστά τις ΚΥ ιδανικό δοκιµαστικό πεδίο για την πρόοδο των

τεχνολογιών διαχείρισης δεδοµένων, όπως είναι ενα κοµµάτι της παρούσας διατριβής.

Η αντιµετώπιση των προκλήσεων της κλιµακούµενης ανάλυσης δεδοµένων απαιτεί µια

πολυδιάστατη προσπάθεια που ϕέρνει µαζί εξειδικευµένη γνώση στην επιστήµη των υπολο-

γιστών, τη µηχανική δεδοµένων, τη µηχανική µάθηση και τη γνώση του εκάστοτε τοµέα, ενώ

ταυτόχρονα προάγει τη συνεργασία µεταξύ διαφορετικών ϕορέων. Καθώς η τεχνολογία προ-

χωρά, αναµένονται νέες προκλήσεις, κάνοντάς το απαραίτητο για ερευνητές και επαγγελµα-

τίες να παραµένουν ενηµερωµένοι και προσαρµοστικοί στον εξελισσόµενο τοµέα των µεγάλων

δεδοµένων. Το χάσµα µεταξύ της ϐέλτιστης κλιµακούµενης διαχείρισης δεδοµένων και της

αξιοποίησής τους από οργανισµούς (από τους οποίους παράγονται), καθώς και τρίτους /

ωφελούµενους, µπορεί να γεφυρωθεί από την προτεινόµενη µελέτη της διατριβής, µέσω της

αποδοτικής και κλιµακούµενης διαχείρισης δεδοµένων.

Εννοιολογικά, το σύστηµα της διατριβής είναι ένα µεσαίο λογισµικό πλαίσιο (middleware),

το οποίο αποτελεί µια υποδοµή λογισµικού που επιτρέπει τη διαχείριση δεδοµένων ανε-

ξαρτήτως όγκου. Αρχικά προσαρµοσµένο για να ανακουφίσει τις ΚΥ από την χρονοβόρα

εργασία της σωστής και πλήρους διαχείρισης των δεδοµένων που παράγονται στους τοµείς

τους, εξελίχθηκε σε ένα λογισµικό εργαλείο µε εφαρµογές σε διάφορους τοµείς. Η πρώιµη

εννοιολογική του υιοθέτηση παρουσιάστηκε το 2022 [3]. Αυτή η αρχική προσέγγιση έχει

επεκταθεί, αναδοµηθεί, ϐελτιωθεί και υλοποιηθεί, οδηγώντας στο σηµερινό κλιµακούµενο

πλαίσιο διαχείρισης δεδοµένων.

Αναφορικά µε την ϐέλτιστη ανάλυση δεδοµένων, µε κύριους αναλυτές τους τελικούς

χρήστες, η τεχνητή νοηµοσύνη αποτελεί το κύριο πλαίσιο ανάπτυξης. Συγκεκριµένα, η

ταχεία ανάπτυξη των Μεγάλων Γλωσσικών Μοντέλων (LLMs) έχει δηµιουργήσει νέες ευκαι-

ϱίες για έρευνα και πρακτικές εφαρµογές σε ποικίλους τοµείς. Οι προηγµένες δυνατότητες

αυτών των µοντέλων έχουν προκαλέσει σηµαντικό ενδιαφέρον στην παγκόσµια ερευνητική

κοινότητα, οδηγώντας σε αυξηµένες επενδύσεις που στοχεύουν στην προώθηση εφαρµογών

ϐασισµένων σε LLM. Γνωστά LLMs, όπως το GPT-4 της OpenAI [11] και το Gemini της Google

[12], έχουν ϕτάσει σε επίπεδο πολυπλοκότητας όπου µπορούν να κατανοούν, να δηµιουργο-

ύν και να χειρίζονται τη γλώσσα του ανθρώπου σε εξαιρετικό ϐαθµό. Ως αποτέλεσµα, αυτά

τα µοντέλα εφαρµόζονται πλέον σε ενα ευρύ ϕάσµα τοµέων, όπως (για παράδειγµα) η υγειο-

νοµική περίθαλψη και η µηχανική. Αυτές οι εξελίξεις έχουν ϑέσει νέα πρότυπα τόσο στην

έρευνα όσο και στις ϐιοµηχανικές εφαρµογές των γλωσσικών µοντέλων, µε κύρια έµφαση

στη ϐελτίωση της ακρίβειας των αποτελεσµάτων (των µοντέλων) [13] [14].

Extended PhD Thesis Summary in Greek

Επιπλέον, τα µεγάλα γλωσσικά µοντέλα έχουν δείξει σηµαντικό δυναµικό στη ϐελτίωση

διαφόρων πτυχών της Επιστήµης ∆εδοµένων. Για παράδειγµα, το GPT-4 µπορεί να εκτελεί

εργασίες όπως καθαρισµό δεδοµένων, εξαγωγή χαρακτηριστικών, και ακόµη και εκπαίδευση

άλλων µοντέλων, µε ελάχιστη ανθρώπινη παρέµβαση [15] [16]. Στην ανάλυση δεδοµένων, τα

LLMs προσφέρουν το πλεονέκτηµα της αυτοµατοποιηµένης δηµιουργίας πληροφοριών από

σύνολα δεδοµένων. Πρακτικά, αυτά τα µοντέλα µπορούν να εκτελούν παραδοσιακές εργασίες

ανάλυσης δεδοµένων, να ανιχνεύουν ανωµαλίες και να παρέχουν ακόµη και προγνωστική

ανάλυση, για µελλοντική χρήση από µηχανικούς δεδοµένων και άλλους επαγγελµατίες. Κα-

τά συνέπεια, τα LLMs µπορούν να εφαρµοστούν στην επεξεργασία δεδοµένων από διάφορες

πηγές, όπως για παράδειγµα είναι τα δεδοµένα ανατροφοδότησης (feedback) πελατών, χρη-

µατοοικονοµικών εκθέσεων και δεδοµένων κοινωνικών µέσων, µε τελικό στόχο την εξαγωγή

αξιοποιήσιµων πληροφοριών. Η ικανότητά τους να κατανοούν και να ερµηνεύουν διάφορους

τύπους συνόλων δεδοµένων τα καθιστά πολύτιµα εργαλεία για τους επιστήµονες δεδοµένων

[17] [18].

Η σηµασία των τοπικών γλωσσικών µοντέλων (local LLMs) για τη διατήρηση της ακεραι-

ότητας και της ιδιωτικότητας των δεδοµένων πρέπει να σηµειωθεί. Οργανισµοί που διαχει-

ϱίζονται ευαίσθητα δεδοµένα ϑα πρέπει να είναι σε ϑέση να εκµεταλλευτούν τις δυνατότητες

των µεγάλων γλωσσικών µοντέλων. Για παράδειγµα, οι χρήστες τέτοιων οργανισµών ϑα µπο-

ϱούσαν να ωφεληθούν σηµαντικά από τη διεπαφή µε γλωσσικά µοντέλα, απλώς κάνοντας

ερωτήµατα ανάλυσης δεδοµένων χρησιµοποιώντας ϕυσική γλώσσα, και λαµβάνοντας οπτι-

κοποιηµένα αποτελέσµατα (ή άµεσα δείγµατα δεδοµένων) σε αντάλλαγµα. Για παράδειγµα,

αντί να επιλέγουν χειροκίνητα συγκεκριµένες στήλες και να εφαρµόζουν προκαθορισµένους

κανόνες µέσω ενός εξειδικευµένου περιβάλλοντος (όπως είναι το Microsoft Excel), οι τελικοί

χρήστες και οι επιστήµονες δεδοµένων ϑα µπορούσαν απλά να γράψουν ερωτήµατα σε ϕυ-

σική γλώσσα και να ϐλέπουν τα αποτελέσµατα στην οθόνη τους, χάρις στις δυνατότητες των

γλωσσικών µοντέλων.

Συνοπτικά, τα oﬄine LLMs µπορούν να ϐοηθήσουν σηµαντικά τους επιστήµονες δεδο-

µένων στην ανάλυση δεδοµένων και την εξαγωγή πληροφοριών, απλοποιώντας τη διαδικασία

υποβολής ερωτηµάτων ανάλυσης και µειώνοντας την προσπάθεια των χρηστών. Αυτό οδηγεί

στο συµπέρασµα ότι οι επιχειρήσεις, οι οργανισµοί και οι εταιρείες ϑα πρέπει να εξετάσουν

την υιοθέτηση τοπικών µοντέλων. Το κύριο µέρος αυτής της διδακτορικής διατριβής εξε-

τάζει πώς τα oﬄine LLMs µπορούν να ενισχύσουν την ανάλυση δεδοµένων, δηµιουργώντας

κώδικα για εργασίες ανάλυσης δεδοµένων ϐάσει ερωτηµάτων σε ϕυσική γλώσσα. ∆ιατη-

ϱώντας τα δεδοµένα τοπικά, εξαλείφονται οι κίνδυνοι που σχετίζονται µε την ακεραιότητα

και την ασφάλεια των δεδοµένων, καθώς εκείνα δεν µεταδίδονται σε online LLM µοντέλα.

Επιπλέον, αυτή η προσέγγιση αντιµετωπίζει τον περιορισµό των LLMs στην επεξεργασία µε-

γάλων συνόλων δεδοµένων, λόγω των αρχιτεκτονικών και υπολογιστικών περιορισµών τους.

Στη προτεινόµενη µέθοδο, το LLM δεν ϕορτώνει ολόκληρα σύνολα δεδοµένων. Αντίθετα,

εργάζεται µε µια συνοπτική περίληψη. Για κάθε ερώτηµα προφίλ / ανάλυσης δεδοµένων,

το µοντέλο δηµιουργεί εξειδικευµένο κώδικα, ο οποίος εκτελείται από µια εργασία λογι-

σµικού στο πλήρες σύνολο δεδοµένων (που τρέχει στο κλιµακούµενο λογισµικό διαχείρισης

δεδοµένων, κοµµάτι της παρούσας διατριβής) για την παραγωγή των τελικών αποτελεσµάτων.

΄Ετσι, τόσο η ακεραιότητα των δεδοµένων, όσο και οι περιορισµοί επεξεργασίας διαχειρίζονται

Extended PhD Thesis Summary in Greek

αποτελεσµατικά.

Η τελική προτεινόµενη λύση αποτελείται από ένα τοπικό γλωσσικό µοντέλο που είναι

ενσωµατωµένο στην περιγραφόµενη υποδοµή λογισµικού διαχείρισης δεδοµένων. Τα δεδο-

µένα παραµένουν υπό τον έλεγχο του οργανισµού, εξασφαλίζοντας την ασφάλεια τους. Για

κάθε σύνολο δεδοµένων, το LLM παρέχεται µε συνοπτικά µεταδεδοµένα, επιτρέποντάς του

να κατανοήσει το πλαίσιο των δεδοµένων υπο ανάλυση. Κάθε ϕορά που ο επιστήµονας δε-

δοµένων του οργανισµού υποβάλει ένα ερώτηµα σε ϕυσική γλώσσα, το µοντέλο δηµιουργεί

εξειδικευµένο κώδικα για ανάλυση των δεδοµένων. Αυτός ο κώδικας εφαρµόζεται στα δε-

δοµένα µέσω εξειδικευµένου λογισµικού, µε τα τελικά αποτελέσµατα να επιστρέφονται στον

χρήστη. ΄Οπως αναφέρθηκε προηγουµένως, αυτή η προσέγγιση όχι µόνο εξασφαλίζει την

ασφάλεια των δεδοµένων, αλλά και ενισχύει την αποτελεσµατικότητα της ανάλυσης τους, α-

ντιµετωπίζοντας τους ϐασικούς περιορισµούς των LLMs. Επιπλέον, απλοποιεί τη διαδικασία

για τους τελικούς χρήστες, εφόσον εκείνοι ειναι σε ϑέση να υποβάλουν τις προτιµήσεις τους

για ανάλυση δεδοµένων χρησιµοποιώντας ϕυσική γλώσσα.

Σχετική Επιστηµονική Βιβλιογραφία

Αναφορικά µε την αποδοτική διαχείριση δεδοµένων, µέχρι σήµερα έχουν υπάρξει πολλές

προτάσεις και υλοποιήσεις. Οι κυριότερες που ταιριάζουν µε το πλαίσιο της παρούσας διατρι-

ϐής παρουσιάζονται παρακάτω. Αρχικά, Ο Baek (2014) [19] πρότεινε ένα πλαίσιο ϐασισµένο

στο cloud για τη διαχείριση µεγάλων δεδοµένων σε έξυπνα δίκτυα ηλεκτρικής ενέργειας,

δίνοντας έµφαση στον ϱόλο των ιεραρχικών κέντρων cloud στην επεξεργασία και ανάλυση

δεδοµένων από έξυπνους µετρητές. Αν και τονίστηκαν τα οφέλη του cloud computing, όπως

η κλιµακωσιµότητα, η ενεργειακή αποδοτικότητα και η µείωση κόστους, το έργο δεν περι-

λάµβανε τελική αξιολόγηση αποδοτικότητας. Παροµοίως, ο Luckow (2015) [20] επεκτάθηκε

στο Ϲήτηµα εµφάνισης µεγάλου όγκου δεδοµένων στην αυτοκινητοβιοµηχανία, και τα πιθα-

νά οφέλη υιοθέτησης του Apache Hadoop [21], εξετάζοντας την κλιµακωσιµότητά του, τις

διάφορες µηχανές επεξεργασίας του, και την ενσωµάτωσή του σε υπάρχοντα συστήµατα λο-

γισµικού που χρησιµοποιούνται στην αυτοκινητοβιοµηχανία. Ο Dinov (2016) [22] µελέτησε

τις πολυπλοκότητες της διαχείρισης µεγάλων συνόλων δεδοµένων στον τοµέα της υγειονο-

µικής περίθαλψης, υπογραµµίζοντας τις προκλήσεις στην ενσωµάτωση ετερογενών τύπων

δεδοµένων, και τις απόπειρες προηγµένων τεχνικών ανάλυσης πάνω τους. Τελικά, η οµάδα

πρότεινε την υποστηρίξη υβριδικών λύσεων που περιλαµβάνουν κυρίως το cloud computing.

ΟKaur (2019) [23] πρότεινε ένα ενεργειακά αποδοτικό σύστηµα, µε δυνατότητες διαχε-

ίρισης µεγάλων δεδοµένων για τα Software-Deﬁned Data Centers (SDDCs) σε περιβάλλοντα

IoT. Η έρευνα εστίασε στη ϐελτιστοποίηση της ανάπτυξης εικονικών µηχανών (Virtual Mach-

ines) που ϑα µπορούσαν να ¨τρέχουν¨ το προτεινόµενο σύστηµα διαχείρισης δεδοµένων, µε

στόχο τη µείωση του ενεργειακού κόστους, διατηρώντας παράλληλα την απόδοση του. Αν και

δεν επικεντρώνεται στην κλιµακούµενη ενοποίηση δεδοµένων, το έργο είναι σχετικό µε τον

σχεδιασµό αποδοτικών υποδοµών λογισµικού. Πιο πρόσφατα, ο Donta (2023) [24] διερεύνη-

σε τα Distributed Computing Continuum Systems (DCCS) για την επεξεργασία δεδοµένων

IoT. Το προτεινόµενο πλαίσιο διακυβέρνησης δεδοµένων παρουσιάστηκε αναλυτικά, χωρίς

όµως να αντιµετωπίζει πλήρως τις προκλήσεις της διαλειτουργικότητας και της ενοποίησης

Extended PhD Thesis Summary in Greek

δεδοµένων.

΄Οσον αφορά τη ϐέλτιστη ανάλυση δεδοµένων µε χρήση τεχνητής νοηµοσύνης, και συ-

γκεκριµένα γλωσσικών µοντέλων για παραγωγική εκτελέσιµου κώδικα ανάλυσης δεδοµένων,

πρόσφατες µελέτες έχουν διερευνήσει τις δυνατότητες παραγωγής κώδικα από τα Large Lan-

guage Models (LLMs) σε διάφορα πλαίσια. Ο Feng (2023) [25] πρότεινε ένα κλιµακούµενο

σύστηµα που αξιοποιεί δεδοµένα από µέσα κοινωνικής δικτύωσης για την αξιολόγηση της

απόδοσης του ChatGPT, σε εργασίες όπως η επίλυση ακαδηµαϊκών ασκήσεων και η α-

ποσφαλµάτωση κώδικα, χωρίς ωστόσο να εστιάζει στην ανάλυση δεδοµένων ή στη χρήση

τοπικώνLLMs. Ο Gu (2023) [26] παρουσίασε µια προσέγγιση δοκιµών µεταγλωττιστών χρη-

σιµοποιώντας µοντέλα τύπου encoder-decoder, και µια στρατηγική καθαρισµού κώδικα για

τη ϐελτίωση της ποιότητας των συνόλων δεδοµένων, σε ευθυγράµµιση µε τον προσανατολισµό

της παρούσας διατριβής στην παραγωγή κώδικα µέσω LLMs. Ο Ross (2023) [27] παρουσίασε

ένα πρωτότυπο σύστηµα που εξετάζει πώς οι επαγγελµατίες αλληλεπιδρούν - µέσω συνοµι-

λίας - µε LLMs που κατανοούν κώδικα, προτείνοντας τη χρησιµότητά τους ως εργαλεία

αύξησης παραγωγικότητας για µηχανικούς λογισµικού. Αντίστοιχα, ο Soliman (2024) [28]

µελέτησε υβριδικά LLMs για αυξηµένη ακρίβεια στην παραγωγή κώδικα, αξιολογώντας τα

µοντέλα τους σε τυπικά σύνολα δεδοµένων [29] [30].

ΟPinna (2024) [31] µελετά την παραγωγή κώδικα µέσω LLMs από περιγραφές προ-

ϐληµάτων, αντιµετωπίζοντας την πρόκληση των λανθασµένων αποτελεσµάτων των µοντέλων,

και παρουσιάζοντας ϐελτιωµένη ποιότητα κώδικα µέσω υβριδικών προσεγγίσεων. Η παρούσα

διατριβή συνάδει µε αυτή την ιδέα, ενσωµατώνοντας LLMs µε επιπλέον µεθοδολογίες λογισµι-

κού. Συµπληρωµατικές προσπάθειες περιλαµβάνουν την δηµοσίευση του Yu (2024) [32] για

την αξιολόγηση µοντέλων παραγωγής κώδικα, καθώς και την µελέτη του Omari (2024) [33]

για τη χρήση του ChatGPT στην ανίχνευση και διόρθωση σφαλµάτων σε απλά προγράµµατα

Python. Και οι δυο µελέτες τονίζοουν τη χρησιµότητα των LLMs στην ανάπτυξη λογισµικού.

Αν και τα έργα αυτά δεν επικεντρώνονται σε τοπικά γλωσσικά µοντέλα ή συγκεκριµένα στην

ανάλυση δεδοµένων (τα οποία αποτελούν καίρια στοιχεία της διατριβής), συνολικά εµπλου-

τίζουν την παρούσα διατριβή, η οποία επεκτείνει τα ευρήµατά τους εφαρµόζοντας παραγωγή

κώδικα µε τοπικά LLMs σε κλιµακούµενα σενάρια ανάλυσης δεδοµένων.

ΟFan (2023) [34] διερεύνησε τη χρήση των µεγάλων γλωσσικών µοντέλων στη µηχανική

λογισµικού, εντοπίζοντας προκλήσεις και επισηµαίνοντας την ανάγκη για υβριδικές προ-

σεγγίσεις, οι οποίες ϑα συνδυάζουν τα LLMs µε παραδοσιακές µεθοδολογίες για εργασίες

όπως η συγγραφή κώδικα, ο σχεδιασµός και η διόρθωση σφαλµάτων. Ο Wong (2023) [35]

ανέλυσε τις τεχνικές NLP, στις οποίες ανήκουν τα LLMs που εκπαιδεύονται σε µεγάλα σύνο-

λα δεδοµένων κώδικα, και τους ϱόλους τους στον προγραµµατισµό υποβοηθούµενο από AI,

καλύπτοντας εφαρµογές όπως η παραγωγή κώδικα και η ανίχνευση σφαλµάτων. Επίσης,

γίνεται αναφορά σε εργαλεία τεχνητής νοηµοσύνης, όπως τα GitHub Copilot [36] και Deep-

Mind AlphaCode [37]. Ο Wang (2023) [38] και ο Liu (2024) [39] εξέτασαν περαιτέρω την

παραγωγή κώδικα µέσω LLMs, τονίζοντας την ανάγκη για ϐελτιωµένες τεχνικές αξιολόγησης

των αποτελεσµάτων που παράγονται από τα µοντέλα.

Το prompt engineering αποτελεί ένα κρίσιµο στοιχείο του συστήµατος που προτείνεται

στην παρούσα διατριβή. Περιλαµβάνονει το σχεδιασµό και τη ϐελτίωση των στοιχείων εισόδου

στα µοντέλα AI για να διασφαλιστούν ακριβή, σχετικά και ποιοτικά αποτελέσµατα εξόδου [40]

Extended PhD Thesis Summary in Greek

[41]. Με την αποτελεσµατική διαµόρφωση των prompts, οι προγραµµατιστές µπορούν να

καθοδηγήσουν τα γλωσσικά µοντέλα έτσι ώστε εκείνα να κατανοούν καλύτερα τη γλώσσα,

το πλαίσιο και την πρόθεση τους, ϐελτιώνοντας την ποιότητα των απαντήσεων σε διάφορες

εφαρµογές, όπως η ανάπτυξη λογισµικού [42] [43]. Αυτή η µέθοδος µπορεί να µειώσει

την ανάγκη για εκτενή µετα-επεξεργασία των απαντήσεων ων µοντέλων, οδηγώντας σε πιο

αποδοτικά workﬂows. ΄Εχουν αναπτυχθεί διάφορες τεχνικές prompting, όπως το zero-shot,

το few-shot και το chain-of-thought prompting, µε στόχο να οδηγήσουν σε - όσο το δυνατόν

- καλύτερες αλληλεπιδράσεις µε ένα µοντέλο, έχοντας πάντα υπόψιν τις ανάγκες της κάθε

περίπτωσης. Η συνεχιζόµενη έρευνα γύρω από το prompting ϑα προτείνει νέες στρατηγικές

για τη ϐελτίωση της αποτελεσµατικότητας αυτού [44] [45] [46]. Καθώς τα µεγάλα γλωσσικά

µοντέλα εξελίσσονται, το prompt engineering ϑα διαδραµατίζει όλο και πιο σηµαντικό ϱόλο

στην ϐελτιστοποίηση των δυνατοτήτων τους.

Περιγραφή Συστήµατος ∆ιατριβής και Στοιχεία Υλοποίησης

΄Οπως σηµειώθηκε παραπάνω, το προτεινόµενο σύστηµα της παρούσας διατριβής απο-

τελείται από δυο σκέλη : Το πρώτο είναι µια ϐάση λογισµικού ϐέλτιστης διαχείρισης δε-

δοµένων. Το δεύτερο (και κυριότερο) είναι το σύστηµα έξυπνης ανάλυσης δεοδοµένων, µε

χρήση γλωσσικών µοντέλων. Για να λειτουργήσει το δεύτερο, στηρίζεται στο πρώτο.

Ξεκινώντας µε το λογισµικό διαχείρισης δεδοµένων, εκείνο έχει σχεδιαστεί για να υπο-

στηρίξει τους στόχους διαλειτουργικότητας δεδοµένων της τελικής πλατφόρµας. Η κύρια

ευθύνη του είναι να ικανοποιήσει τις απαιτήσεις ποιότητας δεδοµένων που είναι συγκεκρι-

µένες για κάθε πηγή, είτε πρόκειται για µια ΚΥ, είτε για µια επιχείρηση. Το λογισµικό

συµβάλει στην σωστή προετοιµασία των δεδοµένων από διάφορες πηγές, διατηρεί τα σχετικά

metadata για όλες τις εισερχόµενες ϱοές δεδοµένων, και καθιστά τα καθαρισµένα και επε-

ξεργασµένα σύνολα δεδοµένων διαθέσιµα. Η πλήρης αρχιτεκτονική και η ϱοή δεδοµένων

του λογισµικού ϐέλτιστης διαχείρισης δεδοµένων απεικονίζονται στο Σχήµα 3.2.

Ως τεχνολογία πάνω στην οποία ϐασίζεται το λογισµικό διαχείρισης δεδοµένων, επιλέχθη-

κε το Apache Spark [47], το οποίο είναι σύµφωνο µε τις ανάγκες της πλατφόρµας. Το Spark

υποστηρίζει µια ποικιλία γλωσσών προγραµµατισµού, όπως Java,Scala,Python και R, και

υποστηρίζεται από εκτενές documentation και µια µεγάλη κοινότητα χρηστών. Αυτά τα

χαρακτηριστικά συµβάλλουν στη δηµοτικότητά του και στην ευκολία υιοθέτησής του. Ε-

πιπλέον, το Spark είναι ικανό να παρέχει προηγµένες αναλυτικές πληροφορίες (analytics),

κάνοντάς το κατάλληλο για περίπλοκες εργασίες επεξεργασίας δεδοµένων. Ενώ άλλα πλαίσια

προσφέρουν κάποια πλεονεκτήµατα και περιορισµούς, το Spark ξεχωρίζει για την ευκολία

χρήσης, την επεκτασιµότητα και τους αποδοτικούς χρόνους εκτέλεσης του [48]. Το λογι-

σµικό διαχείρισης δεδοµένων είναι δοµηµένο γύρω από τρία κύρια υποσυστήµατα, τα οποία

συνεργάζονται για να εκπληρώσουν τους στόχους του : Το Pre-Processing and Filtering Tool,

το Virtual Data Repository και το Virtual Data Container.

Το Pre-Processing and Filtering Tool είναι το πρώτο υποσύστηµα του λογισµικού ϐέλ-

τιστης διαχείρισης δεδοµένων, υπεύθυνο για την προετοιµασία των συνόλων δεδοµένων που

λαµβάνονται από εξωτερικές πηγές. Σχεδιασµένο µε µια γενική και προσαρµόσιµη δοµή,

το εργαλείο αυτό µπορεί να χειριστεί ένα ευρύ ϕάσµα τύπων δεδοµένων αποτελεσµατικά.

Extended PhD Thesis Summary in Greek

Μετά τη λήψη ενός πλήρους συνόλου δεδοµένων, κατασκευάζει ένα dataframe, το οποίο

χρησιµεύει ως πίνακας αναπαράστασης όλων των συλλεγµένων δεδοµένων. Για να διατηρη-

ϑεί η συνέπεια των δεδοµένων, το εργαλείο εξετάζει τα metadata του εκάστοτε συνόλου, ώστε

να προσδιορίσει τους κατάλληλους τύπους δεδοµένων για κάθε στήλη. ΄Οταν εντοπίζονται

ασυµβατότητες, εκτελεί διορθώσεις συγκεκριµένες για κάθε τύπο δεδοµένων, ένα σηµαντικό

ϐήµα για να διασφαλιστεί η αξιοπιστία των επόµενων εργασιών επεξεργασίας, που εξαρτώνται

από την δοµική ακεραιότητα του συνόλου. Αφού ευθυγραµµιστούν οι τύποι δεδοµένων, το

εργαλείο ξεκινά τη διαδικασία καθαρισµού και ϕιλτραρίσµατος, εφαρµόζοντας ένα σύνολο

τυπικών τεχνικών προ-επεξεργασίας για να προετοιµάσει το σύνολο δεδοµένων για περαιτέρω

χρήση :

•Αφαίρεση των κενών διαστηµάτων από όλα τα κελιά που περιέχουν τιµές τύπου string,

εξασφαλίζοντας συνεπή µορφοποίηση.

•Μετατροπή κενών κελιών και περιπτώσεων µε την συµβολοσειρά ’NULL’ σε τιµές ΝαΝ

(Not a Number) σε όλες τις στήλες.

•∆ιαγραφή εγγραφών (γραµµών) που είτε δεν περιέχουν τιµές τύπου datetime, είτε

περιέχουν µη έγκυρες εγγραφές datetime.

•Μετατροπή έγκυρων τιµών datetime στη µορφή UTC, για να εξασφαλιστεί η συνέπεια

και να υποστηριχθεί η τυποποίηση.

Το δεύτερο υποσύστηµα της επεκτάσιµης στρώσης διαχείρισης δεδοµένων είναι το Virtual

Data Repository (VDR) , το οποίο λειτουργεί ως προσωρινή µονάδα αποθήκευσης για όλα

τα σύνολα δεδοµένων που έχουν υποστεί προεπεξεργασία, καθαρισµό και ϕιλτράρισµα από

το Pre-Processing and Filtering Tool. Μόλις ένα σύνολο δεδοµένων ολοκληρώσει αυτές τις

διαδικασίες, αποθηκεύεται στο VDR µαζί µε το αντίστοιχο συσχετιστικό του πίνακα, ο οποίος

καταγράφει τις διασυνδέσεις µεταξύ των στηλών.

Για να καλυφθούν οι απαιτήσεις απόδοσης και επεκτασιµότητας του λογισµικού διαχε-

ίρισης δεδοµένων, το VDR έχει υλοποιηθεί χρησιµοποιώντας την MongoDB [49], µια ευρέως

χρησιµοποιούµενη ϐάση δεδοµένων. Η MongoDB επιλέχθηκε για την συµβατότητα της σε

περιπτώσεις αυτόµατης κλιµάκωσης, τη λειτουργία σηαρδινγ, και την ευέλικτη διαµόρφωση

της. Επιπλέον, η συµβατότητα της MongoDB µε το Kubernetes [50], µια πλατφόρµα ορ-

χήστρωσης κοντέινερ, ενισχύει την ικανότητά της να λειτουργεί σε δυναµικά, κατανεµηµένα

περιβάλλοντα. Ως αποτέλεσµα, το VDR αναπτύχθηκε σε ένα σύµπλεγµα (cluster) Kuber-

netes, καθώς και σε ένα σύµπλεγµα Docker, εκµεταλλευόµενο τις προηγµένες δυνατότητες

εξισορρόπησης ϕορτίου, αναπαραγωγής, και κλιµάκωσης της πλατφόρµας, διασφαλίζοντας

έτσι συνεπή και αποδοτική διαχείριση δεδοµένων.

Το τρίτο και τελευταίο υποσύστηµα είναι το Virtual Data Container (VDC), το οποίο πα-

ίζει καθοριστικό ϱόλο στην πρόσβαση των χρηστών στα αποθηκευµένα δεδοµένα του Virtual

Data Repository. Το VDC είναι ένα ευέλικτο και γενικής χρήσης υποσύστηµα, υπεύθυνο

για περαιτέρω επεξεργασία και ϕιλτράρισµα των δεδοµένων σύµφωνα µε τις συγκεκριµένες

ερωτήσεις που ορίζονται από τους καταναλωτές των δεδοµένων. Οι κανόνες ϕιλτραρίσµατος

Extended PhD Thesis Summary in Greek

εξυπηρετούν δύο κύριες λειτουργίες. Πρώτον, επιτρέπουνσε έναν χρήστη την εξαγωγή συ-

γκεκριµένων πληροφοριών, δηµιουργώντας έτσι µια προσαρµοσµένη ¨δεξαµενή δεδοµένων¨

που ευθυγραµµίζεται µε τις συγκεκριµένες απαιτήσεις του χρήστη. ∆εύτερον, οι ερωτήσεις

ανάλυσης του χρήστη ϐοηθούν στο πρόσθετο ϕιλτράρισµα εσφαλµένων δεδοµένων, που ε-

ίναι γνωστά στους χρήστες από την εµπειρία τους - όπως ακραία εξωγενή δεδοµένα (π.χ.,

ϑερµοκρασίες εξωτερικού χώρου -100 ϐαθµοί Κελσίου) - και τα οποία ϑα µπορούσαν να

υποδηλώνουν δυσλειτουργία αισθητήρων.

Προχωρώντας στο δεύτερο και κύριο σκέλος του προτεινόµενου συστήµατος, όπως ήδη

αναφέρθηκε, ο τελικός στόχος είναι ο σχεδιασµός ενός επεκτάσιµου εργαλείου λογισµικο-

ύ για ανάλυση δεδοµένων που να µπορεί να χειρίζεται αποτελεσµατικά σύνολα δεδοµένων

διαφόρων µεγεθών. Η τελική υλοποίηση του συστήµατος σε αυτή τη διδακτορική εργα-

σία στοχεύει να επιτρέπει στους χρήστες να υποβάλλουν ερωτήµατα ανάλυσης δεδοµένων,

χρησιµοποιώντας ϕυσική γλώσσα. Αυτά τα ερωτήµατα ερµηνεύονται και µετατρέπονται σε

εκτελέσιµο κώδικα από ένα τοπικό µεγάλο γλωσσικό µοντέλο (oﬄine LLM). Ο παραγόµενος

κώδικας αποστέλλεται στο προαναφερθέν λογισµικό διαχείρισης δεδοµένων για εκτέλεση.

Αυτή η διαδικασία επιτρέπει στους χρήστες να καταθέτουν δικούς τους κανόνες ανάλυσης,

απλώς περιγράφοντας τους µε ϕυσική γλώσσα, ϐελτιώνοντας την προσβασιµότητα και την

ευχρηστία του συστήµατος.

Συνοπτικά, το τελικό µέρος της µελέτης αρχίζει µε την καθιέρωση του κύριου στόχου

του : Την αξιολόγηση της ικανότητας ενός τοπικού LLM να δηµιουργεί έγκυρο κώδικα -

µε ϐάση ερωτήµατα ανάλυσης διατυπωµένα σε ϕυσική γλώσσα - για ϐέλτιστη ανάλυση δε-

δοµένων. Το επόµενο ϐήµα της µελέτης περιλαµβάνει την επιλογή κατάλληλων συνόλων

δεδοµένων για δοκιµή και αξιολόγηση. Ακολουθεί ο προσδιορισµός πέντε µοναδικών ερωτη-

µάτων ανά σύνολο δεδοµένων. Στη συνέχεια, αναπτύσσεται ένα σχέδιο υλοποίσης, το οποίο

καθορίζει τον αριθµό των δοκιµών και ορίζει τη διαδικασία επικοινωνίας µεταξύ του LLM

και του λογισµικού διαχείρισης δεδοµένων. Καθορίζονται επίσης οι δείκτες αξιολόγησης, µε

αιτιολόγηση για τη επιλογή τους. Ακολουθεί η συλλογή και ανάλυση των αποτελεσµάτων για

την αξιολόγηση της απόδοσης του LLM. Τέλος, διατυπώνονται τα συµπεράσµατα µε ϐάση τα

ευρήµατα, και παρουσιάζονται προτάσεις για µελλοντικές ϐελτιώσεις.

Αναλυτικά, η µεθοδολογία που ακολουθήθηκε σε αυτή τη µελέτη παρουσιάζεται ως εξής :

1. Ορισµός Στόχου : Το τελικό και κύριο µέρος αυτής της µελέτης στοχεύει στην αξιολόγη-

ση της αποτελεσµατικότητας και αποδοτικότητας ενός τοπικού µεγάλου γλωσσικού µο-

ντέλου (oﬄine LLM) για τη δηµιουργία κώδικα για εργασίες ανάλυσης δεδοµένων. Τα

LLMs που χρησιµοποιούνται σε αυτήν την αξιολόγηση είναι το Codestral της Mistral

AI και το Qwen 2.5 Coder της Alibaba. Αυτά τα µοντέλα ελέγχονται για την ικανότητά

τους να παράγουν κώδικα Python χρησιµοποιώντας το πλαίσιο PySpark, εστιάζοντας

σε εργασίες ανάλυσης δεδοµένων. Οι λόγοι για την επιλογή των Codestral,Qwen και

PySpark συζητούνται στην υποενότητα 3.4.1.

2. Επιλογή ∆εδοµένων : Επιλέχθηκαν πέντε δηµοσίως διαθέσιµα σύνολα δεδοµένων για

να δοκιµάσουν και να επικυρώσουν την προτεινόµενη προσέγγιση. Αυτά τα σύνολα

δεδοµένων καλύπτουν µια σειρά από ϑεµατικούς τοµείς, όπως η ανάλυση κοινωνι-

κών µέσων, τα δεδοµένα καιρού και οι καταγραφές πωλήσεων σουπερµάρκετ. ΄Ολα

Extended PhD Thesis Summary in Greek

τα επιλεγµένα σύνολα δεδοµένων παρουσιάζονται στην υποενότητα 3.3.2. Η χρήση

πολλαπλών συνόλων δεδοµένων αντί για ένα µόνο ϐοηθά στην αξιολόγηση της ικα-

νότητας των LLMs να γενικεύουν τις επιδόσεις τους σε διάφορους τύπους δεδοµένων,

εξασφαλίζοντας ότι η απόδοσή τους δεν περιορίζεται σε έναν τοµέα.

3. Σχεδίαση Ερωτηµάτων : Για κάθε ένα από τα πέντε σύνολα δεδοµένων, έχει δηµιουρ-

γηθεί ένα σύνολο πέντε ερωτηµάτων, µε αποτέλεσµα 25 συνολικά ερωτήµατα που χρη-

σιµοποιούνται για τις δοκιµές. Αυτά τα ερωτήµατα είναι οµαδοποιηµένα σε τρεις κατη-

γορίες ϐάσει της πολυπλοκότητάς τους : «Βασικά», «Μεσαία» και «Προχωρηµένα». Αυτή

η κατηγοριοποίηση αντανακλά το επίπεδο δυσκολίας των απαιτούµενων εργασιών δη-

µιουργίας κώδικα από το LLM. Η ορισµός τους είναι ο εξής :

•Βασικά : Αυτά τα ερωτήµατα επικεντρώνονται σε απλές εργασίες, όπως ϕιλτράρι-

σµα δεδοµένων, καταµέτρηση εγγραφών ή εξαγωγή διακριτών τιµών. Απαιτούν

ϑεµελιώδεις λειτουργίες που ϐοηθούν στην αποκάλυψη της ϐασικής δοµής και

περιεχοµένων του συνόλου δεδοµένων.

•Μεσαία : Αυτά τα ερωτήµατα αφορούν µετρίως σύνθετες εργασίες, όπως οµαδο-

ποίηση δεδοµένων, υπολογισµός αθροισµάτων ή εκτέλεση ϐασικών αριθµητικών

συναρτήσεων. Αν και πιο σύνθετα από τα ϐασικά ερωτήµατα, εξακολουθούν να

στηρίζονται σε τυπικές µεθόδους επεξεργασίας δεδοµένων.

•Προχωρηµένα : Αυτά τα ερωτήµατα αφορούν πιο εξελιγµένες λειτουργίες, όπως

πολυεπίπεδη οµαδοποίηση, µετασχηµατισµούς στηλών (π.χ., συγχώνευση ή ε-

ξάπλωση) και εκτέλεση στατιστικών αναλύσεων. Σχεδιάστηκαν για να αξιολο-

γήσουν την ικανότητα του LLM να παράγει κώδικα για προχωρηµένο χειρισµό

δεδοµένων, και τελικά να εξαγάγει σηµαντικά συµπεράσµατα από τα δεδοµένα.

Το πλήρες σύνολο των 25 συγγραφέντων ερωτηµάτων παρουσιάζεται στην υποενότητα

3.4.3.

4. Σχέδιο Εκτέλεσης : Κάθε ένα από τα 25 ερωτήµατα ϑα υποβληθεί στο τοπικό LLM

µοντέλο δέκα ϕορές. Κατά τη διάρκεια κάθε επανάληψης, ϑα εκτελείται ολόκληρη η

διαδικασία, µε τα αποτελέσµατα να καταγράφονται. Συνολικά, ϑα πραγµατοποιηθούν

δέκα δοκιµές για κάθε ερώτηµα. Αυτή η επανάληψη αποσκοπεί στην αξιολόγηση

της συνέπειας του µοντέλου, παρατηρώντας τις παραλλαγές στον παραγόµενο κώδικα,

καθώς και τη σταθερότητα και αναπαραγωγιµότητα των εξόδων. Αυτή η προσέγγιση

είναι σύµφωνη µε καθιερωµένες πειραµατικές πρακτικές, όπου οι επαναλαµβανόµενες

δοκιµές ενισχύουν την αξιοπιστία των ευρηµάτων [51]. Συνολικά, ϑα εκτελούνται 250

δοκιµές — δέκα για κάθε ένα από τα 25 ερωτήµατα — ϐάσει των συνόλων δεδοµένων

που επιλέχθηκαν για αυτή τη µελέτη.

5. ∆είκτες Αξιολόγησης : Το αποτέλεσµα κάθε δοκιµής ϑα αξιολογηθεί χρησιµοποιώντας

ένα σύνολο καθορισµένων δεικτών αξιολόγησης. Μόλις ολοκληρωθούν όλες οι 250

δοκιµές, τα αποτελέσµατα ϑα συγκεντρωθούν σε ένα ενιαίο σύνολο δεδοµένων αξιο-

λόγησης. ∆εδοµένου ότι η µελέτη πραγµατοποιείται ανεξάρτητα για δύο τοπικά LLMs,

Extended PhD Thesis Summary in Greek

ο συνολικός αριθµός των δοκιµών ϑα ϕτάσει τις 500. Η απόδοση κάθε LLM ϑα αναλυ-

ϑεί ϐάσει των εξής κριτηρίων αξιολόγησης :

•Λειτουργική Ορθότητα : Αυτός ο δείκτης αξιολογεί αν ο παραγόµενος κώδικας

οδηγεί στο αναµενόµενο αποτέλεσµα. Με άλλα λόγια, εξετάζει αν ο κώδικας

παραδίδει µε επιτυχία το αποτέλεσµα που είχε κατά νου ο χρήστης, ϐάσει του

αρχικού ερωτήµατος στη ϕυσική γλώσσα.

•Αναγνωσιµότητα : Αυτός ο δείκτης αναφέρεται στο πόσο εύκολα µπορεί να δια-

ϐάσει και να κατανοήσει ένας άνθρωπος τον παραγόµενο κώδικα. Η ϐαθµολογία

υπολογίζεται χρησιµοποιώντας µια προσαρµοσµένη συνάρτηση που έχει υλοποι-

ηθεί µέσα στην πλατφόρµα διαχείρισης δεδοµένων, η οποία παρουσιάζεται στο

Παράρτηµα Α.1. Η συνάρτηση εξετάζει τρία ϐασικά σηµεία : Το µήκος της γραµ-

µής κώδικα, το ϐάθος των αλυσίδων συναρτήσεων, και το επίπεδο εµφωλευµένων

κοµµατιών κώδικα. Εφαρµόζονται ποινές σε κώδικα µε γραµµές που ξεπερνούν

τους 80 χαρακτήρες, αλυσίδες µεθόδων µε περισσότερες από τρεις κλήσεις, και

εµφωλευµένα επίπεδα πέρα από το δεύτερο. Η συνάρτηση επιστρέφει ϐαθµολογία

από το ΄1΄ έως το ΄3΄, όπου η ϐαθµολογία ΄3΄ υποδεικνύει υψηλή αναγνωσιµότητα.

•Αποδοτικότητα : Αυτό αναφέρεται στο πόσο αποδοτικά χρησιµοποιεί η διαδικα-

σία δηµιουργίας κώδικα τους υπολογιστικούς πόρους. Κύριοι δείκτες απόδοσης

περιλαµβάνουν τον χρόνο απόκρισης, τη χρήση GPU και CPU, και τη χρήση

µνήµης για τις δύο µονάδες επεξεργασίας. Αυτοί παρακολουθούνται χρησιµο-

ποιώντας ένα εργαλείο παρακολούθησης πόρων, αναπτυγµένο σε Python. Αυτός

ο δείκτης ϐοηθά στην αξιολόγηση του υπολογιστικού κόστους που σχετίζεται µε

κάθε δοκιµή.

•Απόδοση ανά ∆υσκολία Ερωτήµατος : Αυτός ο δείκτης αξιολογεί το πόσο καλά

το LLM χειρίζεται ερωτήµατα διαφορετικών επιπέδων πολυπλοκότητας — Βασικά,

Μεσαία και Προχωρηµένα — όπως ορίστηκαν προηγουµένως. Τα αποτελέσµατα

παρέχουν πληροφορίες για την ικανότητα του µοντέλου να προσαρµόζεται σε

διάφορες απαιτήσεις ερωτήσεων ανάλυσης, και επιτρέπουν συγκρίσεις απόδοσης

ανάµεσα σε αυτές τις κατηγορίες.

•Αυτοµατοποίηση : Αυτός ο δείκτης αξιολογεί αν ο παραγόµενος κώδικας µπορεί

να εκτελεστεί αυτόµατα ή µε ελάχιστες χειροκίνητες προσαρµογές. Σε ορισµένες

περιπτώσεις, το µοντέλο µπορεί να περιλαµβάνει εξηγήσεις σε ϕυσική γλώσσα

εκτός από τα σχόλια σε Python, παρά τις ϱητές οδηγίες για την αποφυγή τέτοιων

εξόδων. Επίσης, το LLM µπορεί να χρησιµοποιήσει διαφορετικά ονόµατα µετα-

ϐλητών από αυτά που καθορίζονται στο prompt. ΄Οταν συµβαίνουν τέτοιες απο-

κλίσεις, απαιτείται χειροκίνητη παρέµβαση για να τροποποιηθεί ο παραγόµενος

κώδικας, πριν τελικά συνεχιστεί η διαδικασία.

•∆ιαχείριση Σφαλµάτων : Αυτός ο δείκτης ελέγχει αν ο παραγόµενος κώδικας ε-

κτελείται χωρίς σφάλµατα. Εντοπίζει τυχόν προβλήµατα που προκύπτουν κατά

την εκτέλεση, από µικρές προειδοποιήσεις έως κρίσιµα σφάλµατα που µπορεί να

διακόψουν ή να τερµατίσουν τη διαδικασία.

Extended PhD Thesis Summary in Greek

6. Συλλογή και Ανάλυση ∆εδοµένων : Το τελικό ϐήµα της µεθοδολογίας περιλαµβάνει τη

συλλογή των αποτελεσµάτων από όλες τις 250 δοκιµές (ανά µοντέλο), τη συγχώνευσή

τους σε ένα ενιαίο σύνολο δεδοµένων και τη διεξαγωγή διερευνητικής ανάλυσης. Ε-

ποµένως, κάθε LLM ϑα διαθέτει ένα σύνολο δεδοµένων που περιέχει 250 γραµµές. Η

ανάλυση αυτή ϑα προσφέρει πολύτιµες πληροφορίες για την απόδοση των LLMs, ϐάσει

των διάφορων δεικτών αξιολόγησης. Τελικά, ϑα αξιολογηθεί ο ϐασικός στόχος αυτής

της ενότητας της διατριβής : Αν τα τοπικά γλωσσικά µοντέλα µπορούν να υποστηρίξουν

αποτελεσµατικά την επιστήµη δεδοµένων, µέσω της παραγωγής κώδικα για εργασίες

ανάλυσης αυτών.

Το Σχήµα 3.5 παρουσιάζει µια επισκόπηση υψηλού επιπέδου της τελικής αρχιτεκτονικής

που αναπτύχθηκε στη µελέτη αυτής της διατριβής. Η διαδικασία ξεκινά όταν ο τελικός

χρήστης υποβάλλει ένα ερώτηµα σε ϕυσική γλώσσα. ΄Ενα σχετικό σύνολο δεδοµένων έχει

προεπιλεχθεί, επιτρέποντας τον προσδιορισµό τόσο της περίληψης του συνόλου δεδοµένων,

όσο και του κύριου αρχείου (του συνόλου) δεδοµένων που απαιτείται, για την εκτέλεση του

κώδικα εντός της πλατφόρµας διαχείρισης δεδοµένων. Το ερώτηµα του χρήστη συνδυάζεται

µε την περίληψη του συνόλου δεδοµένων σε ένα ενιαίο µήνυµα, το οποίο αποστέλλεται στο

LLM κατά τη ϕάση της σχεδίασης του (prompt engineering) µηνύµατος, όπως περιγράφεται

στην υποενότητα 3.4.2. Το prompt αυτό δεν ενώνει µόνο το ερώτηµα µε την περίληψη, αλλά

παρέχει και κρίσιµες οδηγίες στα µοντέλα Codestral και Qwen.

Αφού το LLM λάβει το prompt µήνυµα, ξεκινά η διαδικασία παραγωγής απάντησης απο

εκείνο. Κατά τη διάρκεια αυτής της ϕάσης, ένα απλό Python script παρακολουθεί τους υπο-

λογιστικούς πόρους του server του LLM. Μετά τη δηµιουργία της απάντησης από το µοντέλο,

αυτή αποστέλλεται πίσω µέσω ενός λογισµικού - διαύλου επικοινωνίας, µαζί µε τα δεδοµένα

παρακολούθησης πόρων. Ο χρήστης στη συνέχεια εξετάζει την απάντηση, για να διαπιστώσει

αν ο κώδικας µπορεί να µεταβεί απευθείας στην επόµενη ϕάση (υποδεικνύοντας πλήρη αυ-

τοµατοποίηση), ή αν απαιτεί µικρές προσαρµογές (όπως η αφαίρεση περιττού κειµένου που

δεν έχει επισηµανθεί από το LLM), κατατάσσοντας έτσι τη διαδικασία ως ηµι-αυτόµατη και α-

πορρίπτοντας την πλήρη αυτοµατοποίηση. Μετά από αυτό το ϐήµα, ο παραγόµενος κώδικας

και τα αποτελέσµατα παρακολούθησης πόρων ενοποιούνται σε ένα ενιαίο αντικείµενο τύπου

JSON (ϐλ. Παράδειγµα 3.1). Η διεπαφή επικοινωνίας µε το LLM ολοκληρώνεται µε τη µε-

τάδοση του νεοσχηµατισµένου αντικειµένου JSON στην πλατφόρµα διαχείρισης δεδοµένων,

η οποία ενεργοποιεί την εκτέλεση µιας νέας εργασίας PySpark. Στη συνέχεια, το προκαθορι-

σµένο σύνολο δεδοµένων ανακτάται και επεξεργάζεται αναλόγως, ενώ το ληφθέν αντικείµενο

JSON αναλύεται για την εξαγωγή του παραγόµενου κώδικα, καθώς και των αποτελεσµάτων

παρακολούθησης πόρων.

Το επόµενο ϐήµα περιλαµβάνει την εφαρµογή του παραχθέντος κώδικα ανάλυσης δεδο-

µένων, ο οποίος εκτελεί τις καθορισµένες εντολές επί του ϕορτωµένου συνόλου δεδοµένων,

µε σκοπό την εξαγωγή των επιθυµητών αποτελεσµάτων ανάλυσης. Μόλις ολοκληρωθεί αυτό

το στάδιο, τα αποτελέσµατα της ανάλυσης εξάγονται. ∆ηµιουργείται ένα τελικό αντικείµενο

τύπου JSON, το οποίο λειτουργεί ως πρότυπο αποθήκευσης αποτελεσµάτων κάθε δοκιµής,

και περιλαµβάνει τις µετρικές παρακολούθησης πόρων του server του LLM, τον παραγόµενο

κώδικα, και άλλα σχετικά χαρακτηριστικά της διαδικασίας. Το εν λόγω αντικείµενο, µαζί µε

Extended PhD Thesis Summary in Greek

τα αποτελέσµατα της ανάλυσης από το στάδιο εκτέλεσης του κώδικα, αποθηκεύεται τοπικά.

Και τα δύο αρχεία εξετάζονται από τον χρήστη, ο οποίος αξιολογεί τα αποτελέσµατα της α-

νάλυσης για να διαπιστώσει εάν είναι ακριβή και συνάδουν µε το επιδιωκόµενο αποτέλεσµα,

ϐάσει του περιεχοµένου του αρχικού του ερωτήµατος. Σε κάθε περίπτωση, προστίθεται µια

διευκρίνιση αναφορικά µε τη λειτουργική ορθότητα στο JSON αντικείµενο αποτελέσµατος

της δοκιµής, µε την τιµή ’True’ για σωστά αποτελέσµατα και ’False’ για λανθασµένα. Το

συγκεκριµένο ϐήµα ολοκληρώνει τη ϱοή της δοκιµής, όπως παρουσιάζεται στο Σχήµα 3.5.

΄Οπως αναφέρθηκε προηγουµένως, πέντε ερωτήµατα ϑα δοκιµαστούν για κάθε ένα από τα

πέντε σύνολα δεδοµένων, µε δέκα επαναλήψεις ανά ερώτηµα. Η διαδικασία αυτή ϑα επα-

ναληφθεί δύο ϕορές, µία για κάθε µοντέλο LLM, οδηγώντας σε συνολικά 500 δοκιµές : 250

για κάθε µοντέλο.

΄Οπως ήδη σηµειώθηκε σε διάφορα σηµεία προηγουµένως, στο πλαίσιο αυτής της µε-

λέτης επιλέχθηκαν πέντε διαφορετικά σύνολα δεδοµένων για τη διαδικασία αξιολόγησης.

Κάθε σύνολο δεδοµένων περιγράφεται συνοπτικά στο µήνυµα προτροπής (prompt) που α-

ποστέλλεται στο LLM, όπως συζητείται στην υποενότητα 3.4.2. Ο κώδικας που παράγεται

από το LLM εκτελείται µέσω µιας εργασίας PySpark, η οποία ϕορτώνει το αντίστοιχο σύνολο

δεδοµένων και εκτελεί τον παραγόµενο κώδικα. Αν και το Apache Spark — η υποκείµενη

τεχνολογία της προτεινόµενης πλατφόρµας — έχει σχεδιαστεί για τη διαχείριση δεδοµένων

µεγάλης κλίµακας, το µέγεθος των συνόλων δεδοµένων δεν αποτέλεσε κριτήριο επιλογής

για το παρόν σκέλος της αξιολόγησης της διατριβής. Αυτό οφείλεται στο γεγονός ότι το

επίκεντρο του συγκεκριµένου µέρους είναι η ικανότητα του LLM να παράγει κώδικα για

εργασίες ανάλυσης δεδοµένων, και όχι η πλήρης επεξεργασία ολόκληρων συνόλων δεδο-

µένων. Τα δεδοµένα δεν ϕορτώνονται στο ίδιο το LLM, γεγονός που καθιστά το µέγεθός τους

(κατά ϐάση) άνευ σηµασίας για την αξιολόγηση. Ο όγκος των δεδοµένων είχε σηµασία για

την αξιολόγηση του πρώτου σκέλους της παρούσας διατριβής, του λογισµικού αποδοτικής

διαχείρισης δεδοµένων. Τα παρόντα σύνολα δεδοµένων προήλθαν από το Kaggle και το Da-

ta Playground της Maven Analytics, δύο πλατφόρµες που παρέχουν δωρεάν και δηµόσια

προσβάσιµα δεδοµένα.

Επιπρόσθετα, όπως ήδη σηµειώθηκε, τα εκτός σύνδεσης LLMs που επιλέχθηκαν για τη

µελέτη αυτή είναι το Codestral της Mistral AI [52] και το Qwen 2.5 Coder της Alibaba [53].

Στο παρόν πλαίσιο, το κάθε τοπικό γλωσσικό µοντέλο έχει σχεδιαστεί ώστε να επιστρέφει

απαντήσεις µε τη µορφή κώδικα, οι οποίες εκτελούνται στη συνέχεια εντός της πλατφόρµας

διαχείρισης δεδοµένων. Κατά τη διάρκεια αυτής της µελέτης, ισχυροποιήθηκε η προτίµηση

του Apache Spark [47] ως το πλέον κατάλληλο εργαλείο για τη διαχείριση και εκτέλεση των

εργασιών επί των δεδοµένων. ΄Οπως έχει ήδη τεκµηριωθεί, το Apache Spark αποτελεί µια

ισχυρή πλατφόρµα, ϐελτιστοποιηµένη για υπολογισµούς υψηλής απόδοσης. Υποστηρίζει

επεξεργασία εντός µνήµης (in-memory computation), γεγονός που επιταχύνει σηµαντικά τον

χρόνο εκτέλεσης, ιδίως σε εργασίες µε επαναλαµβανόµενους υπολογισµούς. Το Spark είναι

ιδιαίτερα επεκτάσιµο και µπορεί να διαχειριστεί αποδοτικά µεγάλα σύνολα δεδοµένων µέσω

κατανεµηµένων υπολογιστικών συστηµάτων. Επιπλέον, η ευελιξία του αποτελεί σηµαντικό

πλεονέκτηµα, καθώς υποστηρίζει ένα ευρύ ϕάσµα λειτουργιών, όπως επεξεργασία παρτίδων

(batch processing), ϱοές πραγµατικού χρόνου, και ανάλυση γράφων, καθιστώντας το µια

ολοκληρωµένη λύση για επεκτάσιµες ϱοές εργασιών δεδοµένων.

Extended PhD Thesis Summary in Greek

Για τη ϐελτίωση των δυνατοτήτων δηµιουργίας κώδικα από τα γλωσσικά µοντέλα Co-

destral και Qwen, αναπτύχθηκε το προαναφερθέν ειδικά σχεδιασµένο µήνυµα προτροπής

(prompt message), το οποίο χρησιµοποιείται για την έναρξη της αλληλεπίδρασης µε κάθε

µοντέλο, όπως ϕαίνεται στην Καταχώρηση 3.2. Κάθε ϕορά που ο χρήστης υποβάλλει ένα

ερώτηµα σε ϕυσική γλώσσα, αυτό συνδυάζεται µε το προαναφερθέν µήνυµα, προκειµένου

να καθοδηγηθεί το µοντέλο στην παραγωγή ακριβούς και σχετικού κώδικα. Η προσέγγι-

ση αυτή ϐασίζεται στη στρατηγική µηχανικής προτροπών τύπου ’few-shot’, κατά την οποία

παρέχονται στο µοντέλο ϐασικά χαρακτηριστικά του συνόλου δεδοµένων που πρόκειται να

αναλυθεί. Το µήνυµα περιλαµβάνει συνοπτική παρουσίαση της δοµής, της µορφής και των

στηλών του συνόλου δεδοµένων, καθώς και περιγραφές για κάθε στήλη. Με αυτόν τον τρόπο,

διασφαλίζεται ότι ο παραγόµενος κώδικας είναι κατάλληλα προσαρµοσµένος στα ειδικά χα-

ϱακτηριστικά του εκάστοτε συνόλου δεδοµένων.

Αποτελέσµατα και Συµπεράσµατα

Η παρούσα πρόταση της διδακτορικής διατριβής, που αφορά την επεκτάσιµη διαχείριση

δεδοµένων και αναλυτική επεξεργασία - ενισχυµένη από τεχνητή νοηµοσύνη - µπορεί να

αποτελέσει µια πολλά υποσχόµενη εναλλακτική προσέγγιση στον τοµέα της ανάλυσης, της

προτυποποίησης και της αξιολόγησης ποιότητας δεδοµένων. Η δυνατότητά της να αναλύει

ολόκληρα σύνολα δεδοµένων, να υποστηρίζει κανόνες ποιότητας και ανάλυσης, οριζόµενους

από τον χρήστη, και να λειτουργεί ανεξάρτητα από τον όγκο των δεδοµένων, την καθιστά

ένα χρήσιµο πλαίσιο τόσο για αναλυτές δεδοµένων όσο και για µηχανικούς. Επιπλέον,

τα συµπεράσµατα που προκύπτουν από τη χρήση του αναµένεται να αναδείξουν τα οφέλη

της ανάλυσης συνόλων δεδοµένων µε χρήση τεχνητής νοηµοσύνης, µέσω της διατύπωσης

ερωτηµάτων σε ϕυσική γλώσσα και της αυτόµατης µετατροπής τους σε εκτελέσιµο αναλυτικό

κώδικα από ένα τοπικό LLM. Η ανάπτυξη του εν λόγω πλαισίου αποτελείται από µια αρχική

υλοποίηση που παρουσιάστηκε σε δηµοσίευση του 2023 [1], καθώς και µια εκτενή µελέτη

που δηµοσιεύτηκε το 2025 [4].

Αναφορικά το λογισµικό ϐέλτιστης διαχείρισης δεδοµένων, διαπιστώθηκε σαφής γραµµι-

κή σχέση µεταξύ του χρόνου ολοκλήρωσης και του µεγέθους των εισερχόµενων δεδοµένων.

Συγκεκριµένα, παρατηρείται ότι για κάθε λεπτό που περνά, επεξεργάζεται, ϕιλτράρεται, κα-

ϑαρίζεται και αποθηκεύεται περίπου 1GB δεδοµένων. Καθώς αυξάνεται το µέγεθος του

συνόλου δεδοµένων (σε GB), αυξάνεται αναλογικά και ο χρόνος ολοκλήρωσης (σε λεπτά).

Αυτή η γραµµική αύξηση ορίστηκε επίσης ως το «κατώφλι αποδοχής» για µια επιτυχηµένη

υλοποίηση. Το πλαίσιο αναµενόταν να λειτουργεί µε ϱυθµό «1GB ανά λεπτό» (δοθέντων των

περιορισµών στους υπολογιστικούς πόρους), στόχος ο οποίος επιτεύχθηκε, ικανοποιώντας

τις προσδοκίες. Επιπρόσθετα, αξίζει να σηµειωθεί ότι το εν λόγω λογισµικό διαχείρισης δε-

δοµένων δεν παρουσίασε καµία υποβάθµιση στην απόδοση του, ολοκληώνοντας µε συνέπεια

όλα τα καθήκοντα και τις δοκιµές στις οποίες υπεβλήθη. Ανεξαρτήτως του µεγέθους εισόδου,

το σύστηµα παρέµεινε πλήρως λειτουργικό. Επιπλέον, κάθε Spark driver αξιοποίησε µόνο

5GB µνήµης RAM και 6 πυρήνες επεξεργαστή από το σύστηµα, όπως είχε οριστεί εξάρχής

ως ¨περιορισµός¨, αναδεικνύοντας την αποδοτικότητα της προτεινόµενης αρχιτεκτονικής.

΄Οσον αφορά το σύστηµα ανάλυσης δεδοµένων µε χρήση τοπικών µεγάλων γλωσσικών

Extended PhD Thesis Summary in Greek

µοντέλων, ως το κύριο και τελικό συστατικό στοιχείο του πλαισίου της παρούσας διατριβής,

εκείνο έδειξε σαφή δείγµατα λειτουργικής αποδοτικότητας. Τα αποτελέσµατα ανέδειξαν την

υψηλή απόδοση των δύο δοκιµαζόµενων µοντέλων, κυρίως στην παραγωγή λειτουργικών

σεναρίων κώδικα που ευθυγραµµίζονται µε τους καθορισµένους στόχους. Στην περίπτωση

του Codestral, το 87% των 250 περιπτώσεων δοκιµής ήταν πλήρως επιτυχείς, υποδεικνύο-

ντας τόσο λειτουργική ορθότητα, όσο και εκτέλεση χωρίς σφάλµατα. Αντίστοιχα, το Qwen

2.5 Coder πέτυχε ποσοστό πλήρους επιτυχίας 80% στο ίδιο σύνολο δοκιµών. Επιπλέον, ο

παραγόµενος κώδικας από τα δύο µοντέλα έλαβε γενικά ϐαθµολογίες αναγνωσιµότητας µε-

ταξύ ΄2΄ και ΄3΄, γεγονός που υποδηλώνει ότι ήταν εύκολος στην κατανόηση από ανθρώπους.

Μόνο 2 από τις 250 δοκιµές του Qwen έλαβαν τη χαµηλότερη ϐαθµολογία αναγνωσιµότη-

τας ΄1΄. ΄Οσον αφορά την αυτοµατοποίηση, το Codestral πέτυχε αυτοµατοποίηση στο 80%

των εξόδων του, ενώ το Qwen ξεπέρασε αυτό το ποσοστό µε 96.5%, µειώνοντας την ανάγκη

για ανθρώπινη παρέµβαση. Συνολικά, λαµβάνοντας υπόψη τη λειτουργική ορθότητα ανε-

ξαρτήτως µικρών σφαλµάτων ή χειροκίνητων τροποποιήσεων, το Codestral έφτασε ποσοστό

επιτυχίας 91%, ενώ το Qwen πέτυχε 90%. Τα ευρήµατα αυτά καταδεικνύουν την αξιόπιστη

ικανότητα παραγωγής κώδικα και από τα δύο µοντέλα, ακόµη και στο πλαίσιο δηµιουργίας

κώδικα PySpark, ο οποίος είναι συνήθως πιο πολύπλοκος από τον απλό Python.

Η µελλοντική µελέτη και επέκταση της παρούσας διατριβής ϑα επικεντρωθεί στη ϐελτίω-

ση της διαδικασίας παραγωγής κώδικα, η οποία αποτελεί τον πυρήνα της µελέτης, ως ϐασικό

µέσο για την απρόσκοπτη υποβολή ερωτηµάτων στο µοντέλο από τον τελικό χρήστη. Κύρια

προτεραιότητα ϑα αποτελέσει η ϐελτιστοποίηση των µοντέλων Codestral και Qwen 2.5 Co-

der, µε στόχο την επίτευξη ακόµα υψηλότερης απόδοσης. Σε περίπτωση που αναδυθούν νέα

τοπικά (LLMs) µε αντίστοιχη ή ανώτερη αποτελεσµατικότητα, ενδέχεται επίσης να εξεταστο-

ύν για αξιολόγηση. Ο ϐασικός στόχος της διαδικασίας προσαρµογής ϑα είναι η περαιτέρω

ενίσχυση του επιπέδου αυτοµατοποίησης, µέσω της ελαχιστοποίησης παραγωγής περιττο-

ύ κειµένου, και της διασφάλισης ότι τα τελικά αποτελέσµατα ϑα αποθηκεύονται σταθερά

σε µια προκαθορισµένη µεταβλητή, η οποία αναγνωρίζεται από το λογισµικό διαχείρισης

δεδοµένων, όπως υλοποιήθηκε στην παρούσα µελέτη.

Παρόλο που και τα δύο µοντέλα παρουσίασαν υψηλή ακρίβεια στην παραγωγή εξόδων

χωρίς σφάλµατα και στην επίτευξη ορθών τελικών αποτελεσµάτων, εκτιµάται ότι το ﬁne-

tuning µπορεί να ϐελτιώσει περαιτέρω τις επιδόσεις αυτές. ΄Ενα πλήρως προσαρµοσµένο

µοντέλο που εκτελείται τοπικά σε πιο ισχυρή κάρτα γραφικών ενδέχεται να αποδώσει κα-

λύτερα σε σχεδόν όλους τους δείκτες αξιολόγησης, στο πλαίσιο µιας µελλοντικής µελέτης

που ϑα συνεχίζει την παρούσα διατριβή. Ωστόσο, είναι απαραίτητο να ληφθούν κατάλλη-

λα µέτρα ασφαλείας κατά την προσαρµογή του µοντέλου, ώστε να αποφευχθεί η εισαγωγή

προκαταλήψεων, διατηρώντας έτσι την ικανότητά του µοντέλου να παράγει αξιόπιστα και

γενικεύσιµα αποτελέσµατα σε διαφορετικά συµφραζόµενα δεδοµένων.

Παράλληλα, η επέκταση του προτεινόµενου συστήµατος ώστε να περιλαµβάνει ένα αφιε-

ϱωµένο περιβάλλον διεπαφής χρήστη (UI) ϑα µπορούσε να ενισχύσει σηµαντικά την πρακτική

του αξία. Επί του παρόντος, το ερευνητικό πλαίσιο (όπως απεικονίζεται στο Σχ. 3.5) έχει

σχεδιαστεί κυρίως για την αξιολόγηση των δυνατοτήτων των µοντέλων σε ένα ελεγχόµενο

περιβάλλον. Η µετάβαση προς µια λύση έτοιµη για χρήση στην αγορά ϑα απαιτήσει την

ανάπτυξη ενός ϕιλικού προς τον χρήστη περιβάλλοντος διεπαφής, που να εξυπηρετεί τόσο

Extended PhD Thesis Summary in Greek

τεχνικούς όσο και µη τεχνικούς χρήστες. Μια τέτοια διεπαφή ϑα ενίσχυε τη δυνατότητα

πραγµατικής εφαρµογής και χρηστικότητας. Εκτός από την ανάπτυξη UI, η µελλοντική

εργασία ϑα πρέπει να εξετάσει την ενσωµάτωση πρόσθετων λειτουργιών και στοιχείων αξιο-

λόγησης, τα οποία είναι απαραίτητα για τη µετατροπή της παρούσας µελέτης σε µια πλήρως

ολοκληρωµένη λύση (ένα ψηφιακό προϊόν) για επιχειρήσεις και οργανισµούς. Η ανεξαρτη-

σία του συστήµατος από συγκεκριµένα µοντέλα και η επεκτάσιµη σχεδίασή του παρέχουν

ένα σαφές µονοπάτι για την εξέλιξη του υπάρχοντος πρωτοτύπου σε µια πλήρως λειτουργική

πλατφόρµα, κατάλληλη για επιχειρησιακές εφαρµογές.

Εν κατακλείδι, η παρούσα διδακτορική διατριβή ϑα µπορούσε να αποτελέσει τη ϐάση

για µελλοντικές ερευνητικές προσπάθειες, οι οποίες ϑα επιδιώξουν να καθιερώσουν τα το-

πικά µεγάλα γλωσσικά µοντέλα ως µια ϐιώσιµη εναλλακτική για τις διεργασίες ανάλυσης

δεδοµένων, ϐασισµένες σε παραγωγή κώδικα µέσω ερωτηµάτων σε ϕυσική γλώσσα. Αντί

να µεταφέρονται τα δεδοµένα στο LLM, η παρούσα προσέγγιση προτείνει τη µεταφορά του

κώδικα του LLM προς τα δεδοµένα. Η διερεύνηση της εφαρµοσιµότητας των LLMs στο πεδίο

της Επιστήµης ∆εδοµένων αναµένεται να συνεχιστεί. Οι µελλοντικές εφαρµογές είναι πιθανό

να αξιοποιούν πλήρως τη δύναµη των γλωσσικών µοντέλων για ϐέλτιστη επεξεργασία και

ανάλυση δεδοµένων. Με την πάροδο του χρόνου, η παρούσα διατριβή ϑα µπορούσε να απο-

τελέσει µία από τις καθιερωµένες µεθοδολογίες ανάλυσης και διαχείρισης δεδοµένων. ΄Ενα

είναι ϐέβαιο : Τα µεγάλα γλωσσικά µοντέλα ϑα συνεχίσουν να επεκτείνουν την επιρροή τους

σε όλους τους επιστηµονικούς τοµείς, µε την Επιστήµη ∆εδοµένων να αποτελεί - πιθανώς -

ένα ϐασικό πεδίο ενσωµάτωσης. Το πόσο ϐαθιά ϑα ενσωµατωθούν τα LLMs στις σύγχρονες

τάσεις της ανάλυσης δεδοµένων µένει να ϕανεί.

Part I

Introduction

Chapter 1

Introduction: Towards Eﬃcient Data Manage-

ment and Analysis

1.1 Scalable Data Proﬁling for Quality Analytics Extraction

Portions of the following context, originally published in [2], have been adapted for

inclusion in the current PhD thesis, to ensure coherence with its overall structure and

narrative.

In today’s rapidly evolving technological landscape, data stands out as one of the most

valuable assets [6]. As advancements in technology occur with increasing frequency [7],

the creation and dissemination of data have become integral components of emerging

products, services, systems, and frameworks. Consequently, the volume of data gen-

erated on a daily basis worldwide is staggering. Recent estimates suggest that around

328.77 million terabytes of data are produced each day [8], including newly created,

captured, duplicated, or consumed information. This immense ﬂow of data has already

demonstrated substantial value across various domains.

Within the industrial sector, organizations — regardless of size or specialization —

are progressively acknowledging the critical role data plays in supporting evidence-based

decisions and maintaining a competitive advantage. In ﬁelds such as ﬁnance, healthcare,

and manufacturing, data drives innovation by powering advanced analytics, predictive

modeling, and automation technologies that streamline operations. Similarly, the market-

ing and retail industries utilize data to analyze consumer behavior, deliver personalized

experiences, and enhance targeting strategies. Beyond the private sector, government

agencies also capitalize on data to improve public services, design data-driven policies,

and promote economic growth. As the digital realm continues to expand, data emegres

as a vital resource that fuels development, increases eﬃciency, and supports progress

across a wide array of sectors.

Maintaining high data quality is essential for all data consumers. In the context of

data-driven decision-making, characteristics such as accuracy, reliability, and complete-

ness play a pivotal role in shaping the outcomes and insights produced by analytical

processes. Regardless of their speciﬁc role or domain (whether in business, research,

policymaking, analysis, science, or engineering), data consumers must depend on high-

quality information to make sound decisions and establish eﬀective strategies. When

Chapter 1. Introduction: Towards Eﬃcient Data Management and Analysis

data contains errors or inconsistencies, it can lead to misleading analyses and incorrect

conclusions, which may have serious implications. As such, gaining insight into the

quality of a dataset is not merely a best practice, but a fundamental necessity in today’s

data-centric environment. Considering these factors, there is a clear need for solutions

capable of eﬃciently handling datasets, while simultaneously providing comprehensive

assessments of data quality.

To address this need, various methodologies have been developed for extracting quality-

related insights from data. Among the most widely recognized is data proﬁling. While

deﬁnitions may vary slightly among sources, a common understanding exists. For exam-

ple, IBM deﬁnes data proﬁling as the process of reviewing and / or cleaning data, in order

to enhance the understanding of its structure, and also uphold data quality standards

within an organization [54]. The core aim of data proﬁling is to develop a detailed un-

derstanding of the dataset’s quality by applying techniques that summarize, assess, and

evaluate the data. This process typically involves measuring key quality dimensions such

as accuracy, consistency, and timeliness, in order to detect issues such as inconsisten-

cies, errors, or missing values. However, users might wish to obtain further insights on a

given dataset, deepening their understanding on the set’s quality.

Considering all the aforementioned factors, there is a strong need to explore a scal-

able, or volume-independent, data analytics solution that empowers users to deﬁne cus-

tom quality queries with ease, tailored to their speciﬁc analytical needs. This PhD thesis

introduces a software system designed to handle datasets in a scalable manner, and to

perform comprehensive quality assessments across entire datasets with the assistance

of AI language models. The proposed system aligns with established deﬁnitions of data

proﬁling by extracting quality-related insights from datasets, while oﬀering the ﬂexibil-

ity for users to specify their own quality criteria. It is capable of applying its proﬁling

mechanisms across full datasets, regardless of volume, and operates eﬃciently even in

low-resource computing environments, ensuring both scalability and performance. The

intuitive manner in which the data analysis is being carried out, is also the main highlight

of the current thesis.

The following two sections present an introduction to the system’s scalable data man-

agement infrastructure, as well as its AI-powered data analysis approach.

1.2 A Scalable Data Management and Interoperability Solution

The content presented here is based on work published in [1], with a series of adjust-

ments made to maintain consistency throughout the dissertation.

A need for scalable and big data management solutions is observed in the ﬁeld of

Critical infrastructures (CIs). This term refers to the systems, networks, and assets

that are vital to the functioning of modern society and its key sectors. These include,

but are not limited to, energy, transportation, water supply, telecommunications, emer-

gency response, ﬁnance, healthcare, agriculture and food supply, government facilities,

and information technology [9]. Given their foundational role, any disruption to these

infrastructures could lead to societal and economic consequences [10]. It is therefore ap-

1.2 A Scalable Data Management and Interoperability Solution

propriate to regard CIs as the pillars upon which the global economy is built. From a data

perspective, CIs are proliﬁc producers of information, generating large volumes of data

daily across domains such as operations, monitoring, maintenance, security, and more.

As digital transformation and interconnectivity continue to advance, the rate at which

CI-related data is generated, is increasing rapidly. Eﬀectively managing and utilizing

this data is essential for optimizing performance, strengthening security, and supporting

informed decision-making.

To properly handle - but also comprehend - the sheer volume and complexity of data

produced within the CI sector, the integration of data-driven intelligence seems imper-

ative. This approach requires aggregating diverse data streams, originating both from

the infrastructure systems themselves and from the stakeholders involved in their oper-

ation and oversight. Many CI domains are characterized by a multitude of data sources,

as illustrated in Figure 1.1. A prominent example can be found in port infrastructures,

which belong to the transportation sector. Furthermore, the adoption of Internet of Things

(IoT) technologies has accelerated the implementation of smart logistics and sensor-based

monitoring systems within CI environments. These advances contribute to the continu-

ous production of high-volume heterogeneous data that includes communications, sensor

readings, and control signals.

Figure 1.1. The 16 critical infrastructure sectors of global industry and economy [1].

Such data streams support real-time decision-making, and also enable eﬃcient infor-

mation exchange among various entities in the CI supply chain. Notably, data in CIs is

collected through a variety of mechanisms and is stored in multiple formats, including

structured, semi-structured, and unstructured forms. Considering that there are sixteen

designated CI sectors, it becomes evident that the scale of data generated is immense.

Consequently, ensuring the secure and eﬀective management & analysis of this data is

Chapter 1. Introduction: Towards Eﬃcient Data Management and Analysis

both a critical challenge and a necessity. Exploring the need of eﬃcient and scalable

data management solutions through the prism of Critical Infrastructure environments is

a solid approach, since it serves as a demanding proving ground for data management

solutions. Solutions that successfully meet the complex requirements of CI sectors, in-

herently demonstrate their capability to excel in less demanding operational contexts,

making CI an ideal test case for advancing data management technologies.

1.2.1 Data Management Challenges

Before delving into research focused on scalable data solutions, applied to CIs, it is

essential to ﬁrst acknowledge and address the broader challenges associated with scalable

data management and analytics. Extracting valuable insights from large-scale datasets

is a complex task, presenting a range of obstacles that both researchers and industry

professionals should navigate, in order to support data-informed decision-making. These

challenges span the entire data lifecycle, including data acquisition, storage, processing,

analysis, and interpretation. What follows is an overview and classiﬁcation of the key

challenges encountered in the tasks of data management and analysis:

•Volume, Velocity, Variety, Veracity, and Value: Also known as the famous ’5 Vs’ of

big data, since they highlight its key characteristics and challenges. Volume reﬂects

the size of data sets, which may span terabytes to exabytes, often requiring signiﬁ-

cant storage capacity and specialized processing systems. Velocity refers to the high

speed at which data are generated, particularly in domains like IoT or social media,

making real-time or near-real-time analysis a demanding task. Variety points to

the diverse formats of data, including structured, semi-structured, and unstruc-

tured types. Examples of such types are text, images, video, and sensor data.

This diversity sometimes complicates integration and analysis. Veracity relates to

the reliability and quality of data, as inconsistencies, noise, or missing information

can lead to inaccurate results, requiring thorough validation and cleaning. Finally,

Value represents the potential to extract meaningful insights from data, though

doing so is not guaranteed and often depends on advanced analytical techniques,

such as machine learning, data mining, and other novel approaches (such as the

current thesis’s study).

•Data Management and Infrastructure Challenges: These challenges involve ad-

dressing the rapid increase in data volume, while ensuring that systems can scale

eﬀectively. Managing resources like CPU, GPU, memory, and storage requires care-

ful planning and optimization to maintain performance. Storing, indexing, and

retrieving large amounts of data across distributed storage systems adds further

complexity, especially at scale. A key ongoing challenge is balancing the need for

high-performance data processing with the costs and limitations of the supporting

infrastructure.

•Data Integration and Processing Challenges: These arise from the need to apply

advanced algorithms and models for analyzing volume-agnostic data, which often

1.2.1 Data Management Challenges

requires specialized knowledge in areas like machine learning, statistics, and data

science. Integrating data from various sources, each with diﬀerent formats and

structures, demands reliable tools and methods to ensure consistency and compat-

ibility. Extracting useful insights from large and complex datasets through explo-

ration and visualization adds another layer of diﬃculty. Moreover, ensuring smooth

interoperability among the diﬀerent tools and systems used in big data workﬂows

remains an ongoing challenge.

•Data Security, Ethics, and Governance Challenges: These involve managing the

risks associated with storing and processing sensitive information, which requires

strong safeguards to protect data conﬁdentiality, integrity, and compliance with

regulations such as the GDPR. Ethical and legal issues, including data privacy,

algorithmic bias, and the responsible use of data, must also be carefully addressed.

Implementing eﬀective data governance practices is essential to ensure data quality,

security, and adherence to legal standards. In addition, the ongoing shortage of

skilled professionals capable of working with scalable data frameworks adds another

layer of diﬃculty to managing these challenges eﬀectively.

•Environmental and Long-term Challenges: These challenges relate to the energy

demands of large-scale data centers and computational resources used in big data

processing, raising concerns about environmental sustainability. Managing the

long-term preservation of big data is also diﬃcult, as evolving technologies and

changing data formats require reliable strategies to ensure that data remains ac-

cessible and usable for future analysis.

Addressing the challenges of big data analysis requires a multidisciplinary eﬀort that

brings together expertise in computer science, data engineering, machine learning, and

domain-speciﬁc knowledge, while also encouraging collaboration among diﬀerent stake-

holders. As technology advances, new issues are likely to arise, making it crucial for both

researchers and professionals to remain informed and adaptable in the evolving ﬁeld of

big data.

To better understand the scale of data produced by domains like critical infrastruc-

tures, it is helpful to look at real-world examples. In the energy sector, for instance, the

power grid in the United States alone is estimated to generate between 100 and 1,000

petabytes of data annually [55, 56]. Globally, the energy sector produces an even greater

volume, with estimates ranging from 100 to 200 exabytes per year [57]. This data in-

cludes information on power generation, transmission, distribution, weather conditions,

equipment performance, and consumer usage patterns. Similarly, transportation systems

produce several petabytes of data each year, capturing traﬃc ﬂow, vehicle emissions, and

travel behavior [58, 59]. The telecommunications sector also generates vast amounts

of information, including call records, messaging activity, and internet usage [60]. As

critical infrastructure sectors increasingly adopt sensors, smart devices, and other IoT

technologies, the volume of data they produce will continue to grow. These developments

oﬀer new opportunities to improve the eﬃciency, reliability, and security of vital systems

Chapter 1. Introduction: Towards Eﬃcient Data Management and Analysis

through data-driven insights.

Eﬀectively managing scalable volumes of data is an interesting ﬁeld of research, which

aids organizations, companies and critical infrastructures to maintain their security and

resilience. By collecting, storing, and analyzing several amounts of data, these domains

can enhance their operational eﬃciency, identify and mitigate potential risks, and improve

their responses to incidents. Applying scalable data techniques to process such data oﬀers

the potential to automate decision-making, and manage tasks more eﬃciently within the

domains’ workﬂows. For instance, in any given critical infrastructure sector, integrating

not only operational but also global data from various stakeholders (along its value chain)

could signiﬁcantly support its growth and development. The existing literature provides

several solutions that address the harmonization, interoperability, processing, ﬁltering,

cleaning, and storage of data. Although several scalable data management systems have

been proposed, there is still a gap in research for a system that combines management

with harmonization and AI-enabled analytics operations. This gap in existing scientiﬁc

work has motivated the development of a new proposal, aimed at unlocking the potential

of data, enabling end-users to make proﬂing and analytics queries with ease, while also

optimizing resource use and improving infrastructure management.

1.2.2 Eﬃcient Data Management Layer Proposal

The gap between proper scalable data handling and their utilization by both organi-

zations (from which they are generated) and external beneﬁciaries can be bridged by this

study’s eﬃcient & scalable data management layer proposal. Conceptually, it is a mid-

dleware framework, consisting of a software infrastructure that enables volume-agnositc

data management. Initially tailored to relieve CIs from the time-consuming task of prop-

erly and wholly managing the data generated within their sectors, it has evolved into a

software tool with applications across various domains. Its early conceptual adoption

was presented in 2022 [3]. The theoretical model proposed was based on the use case of

smart ports (and therefore the CI transportation sector, in which ports belong to), studied

within the context of a European Research Project [61]. This original approach has been

expanded, restructured, improved, and implemented, leading to the current scalable data

management framework.

The proposed software tool has the potential to enhance the beneﬁts extracted from

the data generated by organizations, critical infrastructures and other domains. As men-

tioned previously, the framework can hold promise for the business sector as well. The

advantages of improving data interoperability and management extend beyond the eco-

nomic gains of CI authorities, beneﬁting a wide range of stakeholders and leading to

improved services. Telecommunications operators, for example, play a crucial role in

this growing data-driven market, as they can leverage their extensive data resources. By

providing data or services through APIs, they oﬀer valuable insights that aid decision-

making for external stakeholders, including public authorities, municipalities, shipping

companies, transportation agencies, cultural and trade associations, and more. Thus,

this study’s scalable data management layer can assist beneﬁciaries coming from several

1.3 Large Language Models and their Impact in Modern Applications

kinds of ﬁelds.

1.3 Large Language Models and their Impact in Modern Appli-

cations

This research, previously published in [4], has been selectively modiﬁed to align with

the thematic and scientiﬁc requirements of the present PhD thesis.

1.3.1 Applications and Impact of LLMs

The rapid development of Large Language Models (LLMs) has created new opportu-

nities for research and practical applications across a variety of ﬁelds. The advanced

capabilities of these models have sparked signiﬁcant interest in the global research com-

munity, leading to increased investments aimed at advancing LLM-based applications.

Notable LLMs, such as OpenAI’s GPT-4 [11] and Google’s Gemini [12], have reached a

level of sophistication where they can understand, generate, and manipulate human lan-

guage to an extraordinary degree. As a result, these models are now being applied in a

wide range of domains, including healthcare and engineering. These advancements have

set new standards in both research and industrial applications, with a primary focus on

enhancing the accuracy of the models’ outputs [13] [14].

The scope of LLM applications extends well beyond traditional natural language pro-

cessing. In scientiﬁc research, for instance, LLMs are accelerating discoveries in ﬁelds

such as medicine, biology, materials science, and forensics. In drug discovery, LLMs can

help expedite the development of new therapeutic agents by predicting molecular proper-

ties and simulating molecular interactions. In computational chemistry, these models are

being used to improve the description of chemical processes, while in materials science,

LLMs assist in designing new materials with customized properties [62]. Additionally,

integrating LLMs into forensic science could enhance modern investigations and aid law

enforcement agencies in identifying suspects [63]. These are just a few examples of how

LLMs are revolutionizing scientiﬁc research and industrial innovation, oﬀering the po-

tential for a signiﬁcant leap in more intelligent and eﬃcient problem-solving across all

sectors [64].

Moreover, Large Language Models have shown signiﬁcant potential in enhancing var-

ious aspects of Data Science. For instance, OpenAI’s GPT-4 can perform tasks such as

data cleaning, feature extraction, and even model training with minimal human involve-

ment [15] [16]. In data analysis, LLMs oﬀer the advantage of automating the generation

of insights from datasets. These models can carry out traditional data proﬁling tasks,

detect anomalies, and even provide predictive analytics for future use by data engineers

and other professionals. Consequently, LLMs could be applied to processing customer

feedback, ﬁnancial reports, and social media data to extract actionable insights. Their

ability to understand and interpret various types of datasets makes them valuable tools

for data scientists [17] [18].

Despite their impressive capabilities and vast potential in Data Science (as well as

Chapter 1. Introduction: Towards Eﬃcient Data Management and Analysis

other domains), LLMs face considerable challenges in providing quality analytics on large

datasets. A major limitation is their inability to process entire datasets in one go, due

to memory and computational constraints. Most LLMs can only handle a limited context

window, which can lead to the loss of context, incomplete insights, and inaccurate results

[65]. Additionally, LLMs may struggle to comprehend complex data relationships within

a dataset, as their training has primarily focused on understanding natural language

rather than complex data structures [66] [67]. These limitations can aﬀect the quality

and precision of the analytics produced by an LLM. Therefore, while models like GPT-

4 exhibit advanced natural language processing capabilities, their application in data

analytics requires further development, particularly for complex tasks such as detailed

data proﬁling, in-depth analysis, and quality insight extraction [68].

In addition, the sensitive nature of many datasets further complicates the use of on-

line LLMs — such as OpenAI’s GPT — for data analysis and proﬁling. For instance, some

datasets contain personal information, proprietary business data, or sensitive research

ﬁndings, which are subject to strict regulations, including the General Data Protection

Regulation (GDPR) and various data privacy laws [69]. The security risks involved in pro-

cessing such data online raise signiﬁcant concerns, including unauthorized access, data

breaches, and the potential misuse of information [70]. Moreover, ethical issues related to

the use of sensitive data require strict adherence to consent and usage guidelines, which

can be diﬃcult to enforce when using online LLMs. For these reasons, using online LLMs

for proﬁling and analyzing sensitive data should be avoided.

1.3.2 Exploring Oﬄine LLMs in Data Analysis

The concerns highlighted above stress the importance of secure, on-premise solutions

to maintain data integrity and privacy. Organizations that handle sensitive data should

still be able to take advantage of the capabilities of large language models. End-users

could greatly beneﬁt by simply querying data in natural language and receiving visualized

outputs or direct data samples in return. For example, instead of manually selecting

speciﬁc columns and applying predeﬁned rules through a specialized interface, end-users

and data scientists could simply write queries in natural language and view the results

on their screens as they normally would. In short, oﬄine LLMs can signiﬁcantly assist

data scientists in data analysis and insight extraction, streamlining the querying process

and reducing eﬀort.

This leads to the conclusion that businesses, organizations, and enterprises should

consider adopting oﬄine LLM solutions. A main part of this PhD thesis explores how

oﬄine LLMs can enhance data analysis, by generating code for data analytics tasks based

on natural language queries. By keeping the data on-premise, the risks associated with

data integrity and security are eliminated, as data is not transmitted to online LLM models.

Moreover, this approach addresses the limitation of LLMs in processing large datasets,

due to their architectural and computational constraints. In the proposed method, the

LLM does not load entire datasets. Instead, it works with a comprehensive summary.

For each data proﬁling / analysis query, the LLM generates specialized code, which is

1.3.2 Exploring Oﬄine LLMs in Data Analysis

executed by a software job on the full dataset (running at the scalable data management

layer) to produce the ﬁnal results. Thus, both data integrity and processing limitations

are eﬀectively managed.

The proposed on-premise solution consists of an oﬄine LLM that is integrated with the

outlined data management software infrastructure. The data remains within the organi-

zation’s control, ensuring security. For each dataset, the LLM is provided with concise

metadata, allowing it to understand the context of the data. Whenever the organization’s

data scientist submits a query in natural language, the LLM generates specialized code

for data proﬁling and analysis. This code is then applied to the data through a pipeline,

with the ﬁnal results returned to the user. As previously mentioned, this approach not

only ensures data security but also enhances the eﬃciency of data analysis, addressing

two key limitations of LLMs. Additionally, it simpliﬁes the process for end-users to submit

their data analysis preferences, as they can do so using natural language.

Part II

Main Analysis

Chapter 2

Literature Review: Assessing Similar and Ex-

isting Approaches to Data Management and AI-

Powered Analysis

2.1 Generic Approaches to Scalable Data Management and An-

alytics

2.1.1 Data Proﬁling and Quality Analysis

The topic of data proﬁling and analysis has been widely explored, with numerous

implementations, studies, and software solutions available. Data sampling has proven

to be an eﬀective and accurate method, supporting exploratory analysis, proﬁling, and

model prototyping on manageable data subsets. This enables the extraction of useful

insights and the validation of predeﬁned methods, before scaling up to full datasets.

However, when it comes to proﬁling large-scale datasets and applying proﬁling techniques

to entire tabular data collections, research eﬀorts are more limited, with less comparable

approaches available.

One of the most relevant works in the area of full-scale data proﬁling for big data

is the approach presented by Dai in 2016 [71]. In their study, the authors conducted

an extensive review of the existing literature, exploring how diﬀerent works deﬁne data

proﬁling, and oﬀering updated classiﬁcations for proﬁling tasks. They also evaluated

various proﬁling tools available at the time, including both open-source and commercial

solutions. Moreover, the paper introduced a set of new data quality metrics, as well as a

method for computing data quality scores. Finally, the authors proposed a framework for

data proﬁling tailored to big data environments.

In their 2017 publication, Abedjan [72] underscored the vital role of data proﬁling

and analysis in a wide range of data-centric applications. The authors provided an in-

depth overview of existing proﬁling techniques and systems, while also addressing key

challenges in the ﬁeld, such as developing algorithms for discovering data dependencies

and adapting proﬁling methods to dynamic data streams. They also stressed the grow-

ing relevance of data proﬁling in the big data era, a point that has become even more

signiﬁcant in today’s data-driven landscape. However, their study mainly concentrated

Chapter 2. Literature Review: Assessing Similar and Existing Approaches to Data Management and

AI-Powered Analysis

on the conceptual and technical aspects of data proﬁling, without speciﬁcally examining

its application to large-scale big data environments. In fact, they identiﬁed this gap as a

direction for future investigation.

An important contribution in this area is the work by Taleb in 2019 [73], who presented

a comprehensive framework for data quality proﬁling, designed speciﬁcally for big data

contexts. The proposed model included multiple components, such as sampling, proﬁling,

exploratory quality analysis, and a repository for quality proﬁles. A series of experiments

were conducted to evaluate various parts of the framework, aiming to improve the overall

quality of big data.

In their 2020 survey, Liu and Zhang [74] [75] explored the application of sampling

techniques for data proﬁling in big data environments, analyzing their use across various

proﬁling categories. Their experimental results demonstrated that insights derived from

sampled data can closely approximate, or even outperform, those obtained from proﬁling

entire datasets. As a result, the study emphasized the growing importance of sampling

technologies in the big data domain, suggesting that sampling will likely become a core

component of scalable data processing in the future. This perspective highlights the

eﬃciency and practicality of sampling-based proﬁling.

Complementing this line of research, Elbaghazaoui conducted an extensive theoretical

analysis in 2021 [76], where the authors highlighted the importance of data proﬁling, and

examined its role speciﬁcally in big data contexts. Their work reviewed a wide range of

use cases, systems, and methodologies, while identifying key challenges to be addressed

in future research. Their ﬁndings oﬀer valuable theoretical foundations for developing

scalable proﬁling solutions. It should be noted that the researchers did not present a

comprehensive method for eﬀectively proﬁling entire large-scale datasets.

Finally, a comparable analysis to that of Elbaghazaoui’s was conducted by Couto in

2022 [77]. The authors emphasized the critical need for data scientists to develop a

deep understanding of the data they work with: An understanding that must be contin-

uously updated in response to the dynamic and evolving nature of big data ecosystems.

Motivated by this observation, they carried out a thorough review of the existing litera-

ture, focusing on the application of data proﬁling within big data contexts. Their survey

examined emerging trends in scalable data proﬁling, classifying and evaluating recent

advancements in the ﬁeld. This included an analysis of widely adopted tools, real-world

scenarios, representative datasets, and the types of metadata and information extracted

through proﬁling. The study concluded with a discussion on future research challenges

and directions. As with the work of Elbaghazaoui, Couto’s survey provides a valuable

theoretical foundation that can signiﬁcantly support the initial phases of developing a

robust and scalable data proﬁling framework, aimed at quality analytics extraction.

2.1.2 User-Deﬁned Quality Rules

When the data sources are stable and the use case is well-known, quality measure-

ment can be eﬀectively handled using pre-conﬁgured quality requirements, based on

common dimensions and metrics. However, in cases where diﬀerent users work with

2.2 Related Scalable Data Management Proposals and Solutions

diverse data sources and have varying needs, it is more beneﬁcial to allow the creation of

custom analytics rules and metrics in a ﬂexible, dynamic manner.

Thus, it is important to consider the ability of the end-user —whether a company, data

engineer, or consumer — to deﬁne or submit their own quality standards and rules with

ease, thereby gaining deeper insights into speciﬁc quality aspects they deem crucial. In

a 2020 publication, Anastasĳa Nikiforova [78] introduces a data object-driven approach

to evaluating data quality. She proposes replacing the concept of "quality dimensions"

with "quality requirements," which eﬀectively adds an additional layer to the traditional

approach. These requirements are user-deﬁned, tailored to the speciﬁc purposes of the

data users.

Similarly, in a 2024 study by Altendeitering [79], the researchers present a software

reference architecture for data quality tools, based on a comprehensive review of ten diﬀer-

ent (and pre-existing) data quality tools. This architecture includes a module for creating

"user-deﬁned rules" within an interaction layer. These rules serve as an extension of

standard data quality rules, but incorporate domain-speciﬁc constraints. The "Rule Gen-

eration" module is responsible for executing these user-deﬁned constraints, enhancing

the ﬂexibility and customization of data quality evaluation.

Allowing users to contribute their perspective to the data quality evaluation process is

a clear advantage, as it enables the achievement of the quality levels needed for speciﬁc

scenarios. When developing a comprehensive and dynamic software tool, it is crucial

to consider the user as a key stakeholder in the data quality evaluation process. The

deﬁnition of user-deﬁned quality rules can be formalized and standardized, so that they

become an integral part of the data quality analysis extraction pipeline, submitting analyt-

ics queries they can be transparently understood and processed by the software platform.

2.2 Related Scalable Data Management Proposals and Solu-

tions

2.2.1 EU Research Projects and Initiatives

The primary testing of the proposed scalable data management framework has been

conducted in ports. Thus, related projects should be linked to the maritime industry.

Numerous initiatives have been undertaken at the European level, with some remaining

actively operational, all aimed at creating a comprehensive ecosystem centered around

ports. European entities and associations, such as ESPO [80], IAPH [81], and AIVP [82],

have led eﬀorts to establish connections and advocate for port authorities, while promot-

ing relationships with the European Union and other nations. Their critical role in global

trade places them at the forefront of smart port development. Additionally, in collabora-

tion with several EU ports, ENISA [83] has published a report providing valuable insights

into the cybersecurity strategies employed by port authorities and terminal operators,

further enhancing their contributions.

Moreover, the European Union (EU), especially through Horizon research programs,

has allocated signiﬁcant funding to a range of projects focused on the future of EU ports.

Chapter 2. Literature Review: Assessing Similar and Existing Approaches to Data Management and

AI-Powered Analysis

These projects are designed to create comprehensive management platforms [84], tailored

to maritime and port environments, with the overarching goal of promoting interoper-

ability and advancing ports into cognitive, intelligent entities. Notable examples, such as

the SmartCities project, have culminated in the Marketplace of the European Innovation

Partnership on Smart Cities and Communities [85]. Projects like e-Mar, FLAGSHIP, and

INMARE are dedicated to improving maritime transportation, while the MASS initiative

focuses on enhancing human conduct aboard ships, particularly in emergency scenarios.

MARINE-ABC showcases the potential of mobile ship-to-shore communication.

Meanwhile, the BigDataStack project [86] aims to streamline cluster management

for data-related operations. However, it does not provide a comprehensive solution for

big data interoperability, harmonization, and management. The SmartShip initiative

[87] focuses on developing data analytics-based decision support systems, along with

an optimization platform built on the principles of a circular economy. These collective

eﬀorts, alongside other similar initiatives, highlight the shared goal among the research

community, port authorities, shipping entities, and supply companies: the creation of

an ecosystem enriched with advanced data-centric services, ultimately beneﬁting both

critical domains (acting as data providers) and local communities. In line with this mo-

mentum, the European maritime sector is advancing to deliver seamlessly integrated,

high-quality services as part of the broader European transportation network.

2.2.2 Research Work on Scientiﬁc Literature

Apart from initiatives of the European Union (through research projects), several pro-

posals for scalable data analysis and management in critical infrastructure sectors have

been published in recent years. In 2014, Baek [19] proposed a cloud-based framework

for managing big data in smart grids, aiming to optimize the generation, distribution,

and consumption of electricity. Smart grids, which integrate digital technology to en-

hance eﬃciency and sustainability, generate large data streams from intelligent devices

like power assets and smart meters. The framework introduced a hierarchical network of

cloud centers, in order to provide computing services for data management and analysis.

While cloud computing oﬀers advantages such as scalability, cost-eﬃciency, and energy

conservation, the proposal did not include a performance evaluation, focusing mainly on

data handling and cleaning in smart grids.

Lockow (2015) [20] explored the generation of big data in the automotive industry,

which spans CI sectors like transportation and critical manufacturing, and the need for

eﬀective management of large data volumes. The paper surveys use cases and applica-

tions of Apache Hadoop [21] in the automotive sector. Hadoop’s scalability in computing

and storage has made it a standard for scalable data processing, with a robust ecosystem

supporting parallel, in-memory, and stream processing, SQL/NoSQL engines, and ma-

chine learning. Key topics addressed include suitable applications for Hadoop, managing

diverse frameworks in a multi-tenant cluster, integration with relational systems, security

considerations, and performance benchmarks.

A study held by Dinov in 2016 [22] examined the challenges and opportunities in

2.2.2 Research Work on Scientiﬁc Literature

modeling and interpreting big healthcare data, particularly within the healthcare and

public health CI sectors. Managing and processing extensive healthcare data is costly

and complex. The study outlines the challenges of integrating complex healthcare data

with advanced analytical tools and distributed computing. Using examples like imaging,

genetic data, and healthcare information, the paper demonstrates how heterogeneous

datasets can be processed using cloud services, automated classiﬁcation methods, and

open-science protocols. While advancements are made, the author emphasizes the need

for continuous development of innovative technologies to optimize data management,

with the study suggesting a multifaceted approach involving proprietary, open-source,

and community-driven technologies.

Kaur (2019) [23] proposed a big data-capable framework for energy-eﬃcient, software-

deﬁned data centers (SDDCs) in IoT environments. SDDCs use virtualization and intel-

ligent management to reduce energy consumption, while maintaining performance and

scalability. With the rapid growth of IoT and the resulting inﬂux of big data, real-time data

analysis and processing are crucial for cloud platforms. However, data centers, essential

to cloud infrastructure, face high energy costs and environmental impact. The paper

introduces an SDDC-based model to optimize virtual machine deployment and network

bandwidth, aiming to reduce energy consumption while ensuring quality of service. The

proposal does not speciﬁcally focus on scalable data management and harmonization,

but the proposed approach is relevant to the current work.

Bhat’s publication in 2021 [88] highlights the challenges of managing agricultural and

food supply chain data using blockchain technologies. It proposes a future architecture

for this purpose. The study addresses issues like storage and scalability optimization,

interoperability, security, and privacy concerns, particularly in single-chain systems. It

also explores security threats associated with IoT infrastructure, as well as blockchain-

based defense mechanisms. The paper concludes by discussing the key features of the

proposed architecture and future considerations. While blockchain is recognized for

enhancing transparency in agri-food supply chains, the paper emphasizes the importance

of assessing the limitations of these technologies, such as IoT, RFID, and NFC, to ensure

their successful adoption and scalability.

Finally, Donta (2023) [24] explores the use of Distributed Computing Continuum Sys-

tems (DCCS) for big data processing, focusing on data from edge devices like IoT and

sensor nodes. The study addresses challenges related to diverse data formats and at-

tributes, leveraging big data analytics tools. It also outlines existing monitoring tools

and their features. The authors propose a governance and sustainable architecture for

DCCS, which includes three stages: system data analysis, knowledge-based monitoring

and prediction, and proactive issue resolution. Although the proposal is valuable, it does

not address data harmonization and interoperability issues, which have been discussed

in other studies.

Given the extensive initiatives undertaken by the European Union, particularly through

Horizon research programs, and of course other research teams globally, there is a grow-

ing need for the development of a framework focused on eﬀectively managing data across

various domains. The maritime industry serves as one key use case. This need under-

Chapter 2. Literature Review: Assessing Similar and Existing Approaches to Data Management and

AI-Powered Analysis

scores the importance of creating advanced management platforms, designed to address

the complex dynamics of sectors like maritime and port environments, as well as institu-

tions and business entities. The introduction of a scalable data management framework

has the potential to enhance the data lifecycle in critical infrastructure sectors and cor-

porate organizations. By leveraging advanced data processing and analytics, this thesis’s

framework could play an assistive role in improving eﬃciency, resilience, and informed

decision-making within CI and business domains.

2.3 Applications of LLMs for Code Generation and Data Analy-

sis Operations

Recently, several proposals have been made to assess the code generation capabilities

of LLMs, as discussed in this section. However, the speciﬁc task of generating code

for data analysis and proﬁling using oﬄine models has not been thoroughly explored

within the research community. As a result, direct comparisons with existing studies are

diﬃcult, as no prior work has speciﬁcally focused on this approach. Alongside a review of

proposed systems and studies on LLMs and code generation, available market solutions

are also presented. Furthermore, works where LLMs are applied to data analysis tasks are

highlighted. Given the ongoing and extensive research into the capabilities of language

models, it is expected that additional studies will emerge in the future.

2.3.1 Code Generation Techniques

This subsection presents existing works relevant to the scope of the current study,

focusing primarily on code generation using LLMs. These studies highlight various pro-

posals and approaches for generating functional code, emphasizing the integration of

LLMs with additional software techniques, and demonstrating the increasing potential of

these models in diverse coding applications.

Feng (2023) [25] published a study on the code generation capabilities of ChatGPT.

Focusing on OpenAI’s LLM, the research presents a scalable framework for crowdsourcing

social data, in order to evaluate the code generation abilities of large language models.

The framework utilizes data from several social media platforms, and supports tasks such

as academic assignment solving, interview preparation, and code debugging. While this

work relates to the current thesis in terms of code generation by language models, it does

not focus on data analytics tasks or the use of oﬄine LLMs. In another study, Gu (2023)

[26] introduces a code generation approach for compiler testing using a language model,

aiming to improve both the quality and quantity of the generated code. The technique

follows a ﬁlter strategy, cleaning the source code to create a high-quality dataset for model

training. This approach explores the capabilities of encoder-decoder models to generate

testing-oriented code, aligning with the current thesis’s focus on leveraging AI tools to

investigate code generation abilities.

An intriguing article by Ross (2023) [27] presents a prototype system designed to ex-

plore the eﬀectiveness of conversational interactions between professionals and LLMs,

2.3.2 Literature Surveys

and to evaluate how software engineers engage in dialogue with a code-ﬂuent LLM. The

ﬁndings suggest that future frameworks with LLM-powered features could become highly

valuable tools for software engineers, akin to the goals of the current thesis. In a pub-

lication by Soliman (2024) [28], the research team introduces hybrid models for code

generation through the integrated use of other pre-trained language models. The per-

formance of these hybrid models is evaluated using two widely used datasets [29] [30],

and benchmarked against existing state-of-the-art models. The study aims to assess the

feasibility of improving the precision and eﬃciency of LLM code generation, particularly

in complex coding scenarios. While this work does not speciﬁcally focus on oﬄine LLMs,

it is relevant to the current study, as it addresses the broader context of code generation.

This current thesis extends the discussion by examining a use case scenario centered on

data analytics operations.

A paper by Pinna (2024) [31] examines the use of large language models for automatic

code generation, using problem descriptions as input queries. The study focuses on ad-

dressing the issue of LLMs generating incorrect code. The results show an improvement

in code quality, emphasizing the potential of combining LLMs with other techniques to en-

hance the accuracy of the generated code. Building on this idea, the current study adopts

the approach of integrating LLMs with additional software methodologies. Other signiﬁ-

cant contributions include proposals for benchmarks to evaluate the code generation pro-

cess of LLMs. Yu (2024) [32] introduces a benchmark designed to assess code generation

models, particularly focusing on non-standalone functions, which are often overlooked

in existing benchmarks, but constitute the majority of functions in open-source projects.

Additionally, Omari (2024) [33] explores the capabilities of a large language model (mainly

ChatGPT) in detecting and ﬁxing bugs at simple Python programs, further contributing

to the body of research on LLMs in code generation tasks.

2.3.2 Literature Surveys

This subsection presents key literature surveys that, while not directly aligned with

the speciﬁc scope of the current study, provide valuable context. These studies explore the

role of language models in transforming tasks such as broad-spectrum code generation,

debugging, and design, while also highlighting their strengths, limitations, and future

potential.

Fan (2023) [34] published a survey on the potential of large language models in soft-

ware engineering, identifying several research challenges related to using LLMs to assist

software engineers with various technical tasks, including coding, design, bug ﬁxing, and

refactoring. The report emphasizes the creative capabilities LLMs bring to these tasks.

It also stresses the importance of future hybrid approaches that combine LLMs with tra-

ditional software engineering methodologies. Wong (2023) [35] conducted a review of

Natural Language Processing (NLP) techniques — primarily LLMs trained on large code-

bases — in AI-assisted programming operations. The review highlights the signiﬁcant role

of LLMs in a variety of applications, such as code generation, completion, translation, re-

ﬁnement, summarization, defect detection, and clone detection. Additionally, it mentions

Chapter 2. Literature Review: Assessing Similar and Existing Approaches to Data Management and

AI-Powered Analysis

examples of AI-enhanced tools, including GitHub Copilot [36] and DeepMind AlphaCode

[37].

Another review on the code generation capabilities of large language models was con-

ducted by Wang (2023) [38]. This study examines recent research focused on code gener-

ation using LLMs, with particular attention to how the generated code is evaluated, and

how LLMs are applied in software engineering tasks. The review highlights the need for

further investigation to address current gaps in the evaluation of LLM-generated code.

Similarly, Liu (2024) [39] carried out a comprehensive literature review on recent devel-

opments in deep learning-based code generation. The study systematically evaluated a

range of recent scientiﬁc publications, and applied a structured methodology to analyze

them. Its goal was to provide insights into code generation using LLMs, identify existing

knowledge gaps, and oﬀer guidance for future research in this evolving ﬁeld.

2.3.3 Other Applications and Market Tools

This subsection reviews recent eﬀorts to ﬁne-tune LLMs, along with practical appli-

cations and commercial tools that build on these models to deliver innovative solutions.

These developments highlight the adaptability of LLMs across various areas, including

hardware design, database optimization, personalized content creation, and AI-driven

data analysis, areas closely related to the scope of this study.

Initial research on adapting LLMs speciﬁcally for code generation has started to

emerge. For instance, Thakur (2024) [89] investigates how LLMs can support hardware

design by completing incomplete Verilog code, a language commonly used in digital circuit

design. Similarly, Mu (2024) [90] introduces a method that enhances code generation,

by allowing an LLM to detect unclear requirements and ask follow-up questions before

producing code. Their results show that this approach improves the code generation

capabilities of online LLMs, like GPT-4, across several benchmark datasets.

Rau (2024) [91] revisits the Unsupervised Passage Retrieval (UPR) method initially

proposed by Sachan (2022) [92], aiming to validate and extend its eﬀectiveness. UPR

leverages the generative capabilities of large language models to create zero-shot ques-

tions, which are then used to re-rank text passages and improve retrieval accuracy. Rau’s

study conﬁrms the method’s strong performance on the BEIR benchmark [93], while also

broadening the evaluation to include the 2019 and 2020 TREC Deep Learning tracks [94].

In a related line of work, Qu (2024) [95] explores the use of large language models for

generating personalized product reviews in e-commerce. The proposed method introduces

a graph-enhanced prompt learning strategy that improves the diversity and relevance of

generated reviews, by utilizing a pre-trained language model (PLM). Meanwhile, Zhou

(2024) [95] presents a framework that applies LLMs to optimize database systems. This

work addresses key challenges, such as constructing eﬀective prompts, modeling both

logical and physical aspects of databases, and preserving data privacy during optimiza-

tion.

In addition to existing research and published approaches, AI-powered tools for data

proﬁling and analysis are increasingly being integrated into modern data workﬂows.

2.3.4 Prompt Engineering

These tools are also becoming available as deployment options within major platforms.

One example is OpenAI’s Codex [96], a ﬁne-tuned version of GPT-3 that powers GitHub

Copilot [36]. Codex demonstrates how large language models can assist in generating code

for data proﬁling and analysis, particularly in cloud-based environments. It operates by

processing user inputs on remote servers equipped with high-performance computing

resources, oﬀering code suggestions that can enhance productivity.

However, this cloud-based approach may raise privacy concerns, especially when

sensitive data is transmitted to external servers for processing, potentially compromising

data conﬁdentiality. Beyond Codex, other tools built on GPT models — most of which

also rely on cloud infrastructure — oﬀer natural language processing capabilities that

support various data analysis tasks. Although these tools provide valuable functionality,

their reliance on cloud processing introduces potential risks related to data privacy and

security.

In contrast to cloud-only solutions, several data analytics platforms (that incorporate

AI-powered features) oﬀer greater deployment ﬂexibility, which can help address concerns

related to data privacy. Platforms such as DataRobot [97], ThoughtSpot [98], Tableau

[99], and Microsoft Power BI [100], provide options for both cloud-based and on-premises

deployment. For instance, DataRobot supports on-premises installations, allowing orga-

nizations handling sensitive information to beneﬁt from AI capabilities, while keeping their

data within their own infrastructure. ThoughtSpot and Tableau also oﬀer on-premises so-

lutions, giving users more control over their data by enabling local processing. Likewise,

Microsoft Power BI supports a hybrid model, allowing users to choose between cloud and

on-premises environments based on their speciﬁc privacy and security requirements.

2.3.4 Prompt Engineering

Prompt engineering is a key component of the system proposed in this thesis and

deserves particular attention. It involves designing and reﬁning the inputs given to a gen-

erative AI model, in order to ensure that its outputs are accurate, relevant, and aligned

with the user’s intent [40] [41]. By carefully shaping prompts, developers can guide the

model to better understand not only the language used, but also the context and pur-

pose of each query. This approach has proven to be a practical method for improving

the quality of model responses in various applications, including software development

and customer support systems. Eﬀective prompt engineering can signiﬁcantly reduce the

need for extensive post-processing later in the workﬂow, contributing to more eﬃcient

and streamlined AI operations [42] [43]. Well-constructed prompts can help AI systems

produce more accurate insights, generate useful code fragments, or even simulate cyber-

security scenarios, depending on the task at hand.

Various prompting techniques have been developed to enable more advanced inter-

actions with generative AI models. Zero-shot prompting involves assigning a task to the

model without providing examples, relying entirely on its pre-trained knowledge. Few-

shot prompting enhances accuracy by supplying a small number of examples, aiming

to guide the model’s responses when handling similar tasks. Chain-of-thought prompt-

Chapter 2. Literature Review: Assessing Similar and Existing Approaches to Data Management and

AI-Powered Analysis

ing encourages the model to break down complex problems into a sequence of logical

steps, supporting reasoning and multi-step problem-solving. These techniques allow AI

systems to perform tasks beyond their original training and to follow more intricate rea-

soning processes. Ongoing research continues to expand the ﬁeld of prompt engineering,

with recent studies proposing new strategies and frameworks to improve its eﬀectiveness

[44] [45] [46]. As generative AI models become more powerful and versatile, prompt engi-

neering is expected to play an increasingly important role in maximizing their capabilities.

In summary, the ﬁeld of LLMs is rapidly advancing, with ongoing research exploring

a wide range of capabilities. Current studies address topics such as general-purpose

code generation, compiler testing, and the integration of hybrid model architectures,

highlighting the dynamic nature of AI-assisted software development. Surveys in the

literature also underline the potential of LLMs in software engineering tasks. Beyond

programming, applications extend to areas like hardware design automation and data

analytics powered by AI, supported by commercial tools that oﬀer both cloud-based and

on-premises deployment options. Additionally, prompt engineering techniques continue

to enhance the performance of LLMs by improving their accuracy and eﬃciency. As this

ﬁeld progresses (and as already mentioned), future research is expected to investigate

new directions and expand the practical use of LLMs across various domains.

Chapter 3

Study Design, Methodology, and Implementa-

tion: A Complete Analysis of the Framework

3.1 Initial Generic Overview

3.1.1 Outline

The proposed software framework is conceptualized to oﬀer data professionals mean-

ingful insights into the quality of a chosen dataset. Users would be able select the dataset

they wish to analyze, after which the system would perform a series of operations to gen-

erate quality-related metrics. While data quality assessment is a common task in the

ﬁeld of data science, the current study distinguishes itself from most existing tools, by

enabling end-users to submit their own analysis and proﬁling queries using natural lan-

guage. Additionally, it is capable of handling datasets of varying sizes eﬃciently, oﬀering

a modern approach to traditional data proﬁling.

Figure 3.1 illustrates the conceptual architecture of this thesis’s framework. The pro-

cess would begins when the data recipient or end-user selects a dataset from a designated

data lake for quality analysis. Based on the user’s preference, the system would retrieve,

either the full dataset, or a sample from the backend. Once the dataset is successfully

loaded, the ﬁrst processing stage, known as the “Data Proﬁler,” would be activated. This

component aims to performs three key tasks: i)it applies a set of predeﬁned quality rules

to the dataset, ii)it generates quality-related metrics based on those rules, and iii)it

delivers the resulting analytics to the end-user.

At this point, the user gains an overview of the dataset’s quality through the applica-

tion of general rules. If a more tailored analysis is desired, the user can deﬁne custom

quality queries using a simple prompt interface, expressing their quality metric in nat-

ural language. These user-deﬁned rules, once conﬁgured, would be sent back to the

middleware, triggering the system’s second processing stage, the “Data Evaluator.” This

component aims to extend the initial proﬁle generated by the Data Proﬁler, by incorpo-

rating the newly provided, user-speciﬁc quality queries. If necessary, the Data Evaluator

would reprocess the dataset and then: i)apply the custom rules, ii)perform additional ﬁl-

tering based on those rules, and iii)return the enhanced quality insights to the frontend,

in the form of a structured output.

The initial objective during the conceptual phase of this study was to design a software

Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework

Figure 3.1. The initial architecture of the proposed scalable data management and analysis

framework, along with its two operational ﬂows, ’Data Proﬁler’ and ’Data Evaluator’ [2].

asset as a lightweight and easily deployable tool for end-users. To achieve this, the system

should operate eﬃciently, while requiring minimal computational resources, such as

those typically available on consumer-grade machines. The development target includes

support for systems with an average CPU of up to six cores and 5 GiB of RAM, hardware

commonly found in budget-friendly personal computers.

A performance benchmark has been established, setting the goal for the software to

process and generate a quality proﬁle for big data at a rate of 1 GB per minute (1GB/min).

This benchmark, when met under the deﬁned hardware constraints, will demonstrate the

tool’s capacity to eﬃciently analyze complete large-scale datasets, reinforcing its suitabil-

ity for widespread deployment. To meet this performance target, the study expanded to

leverage insights from a scalable processing framework, being tested in critical infras-

tructure datasets [1], as presented in the next section. As mentioned before, both the

present work and the implemented data management system are conceptually grounded

in a previous research [3].

3.1.2 Framework Operational Flows

Data Proﬁler

The Data Proﬁler aims to serve as the initial component in the proposed framework’s

operational workﬂow. Its execution would begin as soon as the end-user selects a dataset

for proﬁling. Depending on the user’s preference — whether to analyze the complete

dataset or a sample — the system would retrieve the corresponding data from a desig-

nated repository. The Data Proﬁler then would initiate its process, following a series of

structured steps, as illustrated in Figure 3.1:

•Perform basic pre-processing on the incoming dataset to prepare it for analysis.

•Load a predeﬁned set of general quality rules or key performance indicators (KPIs),

by parsing a conﬁguration ﬁle containing these deﬁnitions.

•Apply each rule to the appropriate columns in the dataset. For example, timestamp-

related rules are applied to timestamp ﬁelds. Rules intended for all columns are

applied accordingly.

•Generate proﬁling metrics based on the applied rules, compiling the necessary in-

formation to construct the dataset’s quality proﬁle.

•Assemble the proﬁle template using the extracted quality analytics.

After generating and forwarding the proﬁle template to the end-user, the Data Proﬁler

would conclude its operation. It is conceptualized to run once per dataset selection during

a given session, and would be reactivated only when the user initiated proﬁling on a new

dataset.

Data Evaluator

The Data Evaluator aims to serve as the second component in the proposed frame-

work’s processing workﬂow. It would be activated only after the Data Proﬁler had com-

pleted its operation, and the end-user oped to apply custom, user-deﬁned queries, in

order to enhance the dataset’s quality analysis. When the user would decide to gain

deeper insights through personalized criteria, the Data Evaluator would assume control.

Upon receiving analysis queries — expressed in natural language — by the end-user, the

Data Evaluator would carry out the following tasks, as illustrated in Figure 3.1:

•Determine whether the dataset requires re-processing. If so, perform a basic pre-

processing step, similar to that of the Data Proﬁler.

•Retrieve the new quality queries deﬁned by the user, which are sent to the compo-

nent via the platform’s frontend.

•Apply each rule to the relevant ﬁelds in the dataset. As with the Data Proﬁler, if a

rule is applicable to all columns, it will be applied universally. However, the queries

Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework

would have to be transformed from their natural language form, to an executable

format ﬁrst.

•Perform additional ﬁltering based on the user-deﬁned queries, and extract new

quality metrics to extend the existing dataset proﬁle.

•Update the previously generated proﬁle template by incorporating the new analytics.

Unlike the Data Proﬁler, which would execute only once per session when the user

selected a dataset, the Data Evaluator could be invoked multiple times within the same

session. This ﬂexibility would allow users to continuously reﬁne the quality analysis, by

submitting new rules based on feedback received after each evaluation.

3.1.3 Technologies To Use

To ensure eﬃcient and reliable management of the entire process, the proposed data

management and analysis asset should be developed on top of the Apache Spark frame-

work. This decision supports one of the study’s objectives: To perform proﬁling operations

across a wide range of data volumes, from small datasets and samples to large-scale data.

The Python programming language should be used to implement both the Data Proﬁler

and the Data Evaluator, leveraging its extensive ecosystem of data science libraries and

processing tools.

The implementation should involve setting up a standalone, but scalable, Apache

Spark instance on a single physical machine. Within this environment, the Data Proﬁler

and Data Evaluator will be executed as PySpark jobs, enabling seamless communication

with end-users, and maintaining high availability. This setup is designed to provide

fast processing times, even for larger datasets. Importantly, the software infrastructure

should be developed as a general-purpose module, capable of supporting tabular datasets

of various formats and structures.

During the selection of the underlying processing framework, the study considered

Apache Spark [47], Apache Flink [101], and Apache Storm [102]. All three are well-

established tools in the data processing domain. After evaluating their capabilities in

relation to the proposed system’s requirements, Spark was selected as the most appro-

priate choice. While Storm excels in real-time data processing [103], the current system

is intended for proﬁling existing or historical datasets, where batch processing is more

suitable.

Both Spark and Flink oﬀer strong performance for batch processing tasks, with their

eﬀectiveness varying depending on factors such as dataset size, data type, and workload

patterns [104] [105]. However, Spark stands out due to its better scalability, broader

analytical support, faster execution in many scenarios, and strong community and doc-

umentation support [48]. For these reasons, Spark should be ultimately chosen as the

foundation for the future system’s development.

In the development phase of the system, Python [106] should be chosen as the primary

programming language. Python’s clean and concise syntax promotes rapid development

and easy maintenance. Its extensive collection of libraries and frameworks — such as

3.2 The Scalable Data Management Software Infrastructure

NumPy, Pandas, and SciPy — oﬀers powerful tools for data manipulation and analysis,

which are essential for Big Data processing and other related tasks [107]. Additionally,

Python’s compatibility with Apache Spark, via PySpark [108] (Python’s API for Spark),

enhances its suitability for parallel and distributed computing. This integration allows

developers to utilize Spark’s powerful capabilities for large-scale data processing, while

beneﬁting from Python’s simplicity and ease of use. Consequently, Python should be

selected as the preferred language for the development of the proposed system.

3.2 The Scalable Data Management Software Infrastructure

3.2.1 Data Processing and Virtualization

Once the desired data have entered the organization’s database, the next step is to

make these data accessible as a service. This capability should allow developers to build

cognitive and data-driven applications. To support this functionality, a data processing

and virtualization middleware is required. This middleware should serve as an inter-

mediary layer that connects data providers with data consumers, enabling eﬃcient and

seamless interaction. This role is fulﬁlled by the proposed thesis’s scalable data manage-

ment layer.

Data virtualization is a method of data integration that provides access to information

through a virtual service layer, independent of the physical location of the underlying data

sources. This approach enables applications to retrieve data from multiple heterogeneous

systems through a single access point, oﬀering a uniﬁed and abstracted view for querying.

In addition to unifying access, the virtualization layer also handles data transformation

and processing, ensuring that the data are appropriately prepared for use. One of the

key challenges in data virtualization involves managing the integration of diverse storage

systems, such as key-value stores, document databases, and relational databases. Each

system has distinct characteristics, making seamless integration complex. Furthermore,

data-intensive applications (that rely on virtualized sources) require guarantees regarding

performance, availability, and data quality, placing additional demands on the virtualiza-

tion system.

The proposed scalable data management layer is designed to address these challenges

and support the platform’s data interoperability goals. Its primary responsibility is to sat-

isfy the data quality requirements speciﬁc to each source, whether a critical infrastructure

or a business organization. The layer conﬁrms proper preparation of data inputs from

various sources within the project’s architecture, maintains relevant metadata for all in-

coming data feeds, and makes the cleaned and processed datasets available to consumers.

Most of the layer’s workload originates from persistent data streams, which include pre-

viously collected and stored data. The complete architecture and data ﬂow of the scalable

data management layer are illustrated in Figure 3.2.

Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework

Figure 3.2. Architectural view of the scalable data management layer, with its three

subcomponents [1].

For the data processing component of the layer, Apache Spark [47] was selected as

the core framework, which is on par with the initial planning. Spark supports a variety

of programming languages, including Java, Scala, Python, and R, and is backed by ex-

tensive documentation and a large user community. These characteristics contribute to

its popularity and ease of adoption. Moreover, Spark is capable of delivering advanced

analytical outcomes, making it suitable for complex data processing tasks. While other

frameworks oﬀer distinct strengths and limitations, Spark distinguishes itself through its

ﬂexibility, scalability, and eﬃcient execution times [48], as previously stated.

The scalable data management layer is structured around three main subcomponents,

which work in coordination to fulﬁll its objectives: the Pre-Processing and Filtering Tool,

the Virtual Data Repository, and the Virtual Data Container.

3.2.2 Pre-Processing and Filtering Tool

The Pre-Processing and Filtering Tool is the ﬁrst subcomponent of the scalable data

management layer, responsible for preparing datasets received from external sources.

Designed with a generic and adaptable structure, this tool can handle a wide range of

data types eﬀectively. Upon receiving the complete dataset, it constructs a dataframe,

which serves as a tabular representation of all collected data. To maintain data consis-

tency, the tool examines the dataset’s metadata to identify the appropriate data types for

each column. When inconsistencies are detected, it performs data type corrections, an

3.2.2 Pre-Processing and Filtering Tool

essential step to ensure the reliability of subsequent data processing tasks that depend

on the structural integrity of the dataset. After aligning data types, the tool initiates the

cleaning and ﬁltering process, applying a set of standard pre-processing techniques to

prepare the dataset for further use:

•Removal of whitespaces from all cells containing string-type values, ensuring con-

sistent formatting.

•Conversion of empty cells and instances with the literal string ’NULL’ to NaN (Not a

Number) values across all columns.

•Deletion of records (rows) that either lack datetime values or contain invalid datetime

entries.

•Transformation of valid datetime values to the UTC format, to ensure consistency

and support standardization.

Before completing its execution, the Pre-Processing and Filtering Tool performs two

additional key operations to enrich the dataset with analytical insights. First, it generates

a correlation matrix, which evaluates the statistical relationships among the dataset’s

columns. This correlation data is then stored alongside the cleaned dataset in the Virtual

Data Repository (described in a subsequent subsection), enabling further analysis at later

stages. In addition, the tool carries out outlier detection on all numerical columns. For

each column assessed, a new auxiliary column is created, containing "yes" or "no" values

to indicate whether the corresponding data point is classiﬁed as an outlier. The default

method for detecting outliers is based on the three standard deviations from the mean

rule, which serves as a threshold for identifying anomalous values:

A value xis considered an outlier if |x−µ|>3σ

where:

x:the data point (observation)

µ:the mean of the dataset (for the column)

σ:the standard deviation of the dataset (for the column)

However, this threshold can be modiﬁed to suit speciﬁc requirements deﬁned by users

or data consumers. It is important to note that the primary implementation of the Pre-

Processing and Filtering Tool is developed in Python, as a PySpark job, allowing seamless

integration with the Apache Spark framework while leveraging Python’s data processing

capabilities.

In summary, the Pre-Processing and Filtering Tool carries out thorough pre-processing,

cleaning, and ﬁltering operations for each incoming dataset. During the initial phase,

the entire dataset is ingested and transformed into a structure compatible with Python,

adopting a tabular column-row format suitable for further processing. The cleaning stage

Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework

focuses on detecting and resolving data inconsistencies, including NULL values, empty

ﬁelds, outliers, and other invalid entries. Following this, a comprehensive ﬁltering pro-

cess is applied to address all identiﬁed issues, either by replacing erroneous values or by

removing the aﬀected records from the dataset or dataframe entirely.

3.2.3 Virtual Data Repository

The second subcomponent of the scalable data management layer is the Virtual Data

Repository (VDR), which functions as the temporary storage unit for all datasets that

have been pre-processed, cleaned, and ﬁltered by the Pre-Processing and Filtering Tool.

Once a dataset has completed these operations, it is stored in the VDR together with its

corresponding correlation matrix, which captures inter-column relationships.

To meet the performance and scalability demands of the layer, the VDR is imple-

mented using MongoDB [49], a widely adopted document-oriented database. MongoDB

was selected for its robust support of auto-scaling, sharding, and ﬂexible conﬁguration,

all of which align well with the virtualization goals of the system. Furthermore, Mon-

goDB’s compatibility with Kubernetes [50], a powerful container orchestration platform,

enhances its ability to operate in dynamic, distributed environments. As a result, the VDR

is deployed within a Kubernetes cluster, beneﬁting from the platform’s advanced load bal-

ancing, replication, scaling, and scheduling capabilities, thereby ensuring consistent and

eﬃcient data management.

The choice of MongoDB as the database for the Virtual Data Repository (VDR) was

made with relative ease. A document-based NoSQL database was deemed the most suit-

able solution for implementing temporary data storage within the scalable data man-

agement layer, given its alignment with the inherent characteristics of the data being

handled. This type of database is particularly well-suited for semi-structured data that

does not adhere to a rigid schema, yet follows speciﬁc formatting conventions, such as

XML, JSON, and BSON. In contrast, a relational database would require a comprehensive

understanding of the dataset structures in advance, limiting its ﬂexibility to accommo-

date datasets with varied schemas. Given that the data used for testing is formatted in

JSON or CSV, MongoDB was chosen for its superior performance, ﬂexibility, and scala-

bility. When compared to other key-value NoSQL databases, document-based databases

like MongoDB are more capable of supporting complex queries and diverse entity types,

critical features for the functionality of this study’s software infrastructure.

The resulting system is a modiﬁed MongoDB setup, conﬁgured with multiple replicas

to improve its resilience and ensure continuity in the face of system failures. The overall

robustness of the system depends on key factors, such as the number of replicas and

their distribution within the cluster. Since it utilizes more than one replica, the Virtual

Data Repository (VDR) can continue to function seamlessly even if one of the MongoDB

replicas fails. In such instances, the remaining replicas maintain uninterrupted opera-

tion, protecting the data and mitigating the risk of data loss or temporary unavailability.

All MongoDB replicas are treated as a uniﬁed database within the system.

The Kubernetes platform can play an essential role in this architecture, handling load

3.2.4 Virtual Data Container

balancing and distributing data eﬃciently according to optimal conﬁguration strategies.

As a result, from the user’s perspective, the exact location of the queried data, whether

stored on a speciﬁc node or replica, remains abstracted, ensuring a smooth and consis-

tent user experience. The implementation of VDR has been thoroughly examined in a

publication from 2022 [109].

3.2.4 Virtual Data Container

The third and ﬁnal subcomponent is the Virtual Data Container (VDC), which plays

a pivotal role in enabling data recipients to access the stored data in the Virtual Data

Repository (VDR). The VDC is a ﬂexible and general-purpose subcomponent, tasked with

further processing and ﬁltering the data according to speciﬁc queries deﬁned by data

consumers. These ﬁltering rules serve two main functions. First, they enable the selection

of relevant data for a particular user, eﬀectively creating a tailored "data pool" that aligns

with the user’s speciﬁc requirements. Second, the user-based queries help identify and

remove erroneous data that are known to the users from experience, such as extreme

outliers (e.g., outdoor temperatures of minus 100 degrees Celsius), which could indicate

sensor malfunctions.

When a data consumer submits a request to the VDC, they can specify both the

ﬁltering criteria and the desired data format, allowing for customized data transformation.

In addition, the VDC is responsible for providing essential metadata for each dataset

stored in VDR. This metadata includes details such as the dataset’s size, the number of

rows and variables, the timestamp of the last update, and more. The metadata is made

available through a RESTful API, ensuring easy access and integration for data recipients.

Figure 3.3. The information ﬂow of the Virtual Data Container [3] .

Regarding the foundation of the Virtual Data Container (VDC), Apache NiFi [110] has

been chosen as the preferred platform for building data ﬂows, allowing the scalable data

management layer to eﬃciently deliver the cleaned and processed data to data consumers.

The ﬂow responsible for implementing the VDC Rules System (which manages data pro-

Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework

cessing and ﬁltering according to user-deﬁned queries) is also introduced to the user

through NiFi. The actual tasks of processing, ﬁltering, and transforming the data are

carried out by Apache Spark.

When selecting the most suitable solution, several factors were considered. The three

primary tools examined were Apache Flume [111], Kafka [112], and NiFi, all of which are

known for their strong performance, capacity for horizontal scalability, and support for

plug-in frameworks that allow functionality extensions via custom elements. Ultimately,

the decision came down to Apache Flume and Apache NiFi. While Flume is conﬁgured

through conﬁguration ﬁles, NiFi oﬀers a more user-friendly experience with its graphical

interface for conﬁguring and monitoring data ﬂows. Given the advantage of having a

visual interface and NiFi’s proven eﬃciency and scalability, it emerged as the ﬁnal choice

for the VDC foundation.

As for the VDC’s rule structure, it has been implemented with two diﬀerent ap-

proaches. One approach, the one selected for the ﬁnal version of the framework (lever-

aging oﬄine large language models that interpret natural language quality queries and

convert them to executable code), is analysed in the next subsection. Another approach,

is designed to be more simple and straightforward, and is presented below. It consists

of three key elements for each rule / query: a “subject column”, an “operator”, and the

“object”. The expected format of the rules list is a JSON array, containing rules repre-

sented as JSON objects, each including these speciﬁc string values. The VDC interprets

and implements the rules from this list onto the requested dataset.The architecture of the

incoming rules JSON ﬁle is based on the following:

•A JSON object, which includes:

–A string ﬁeld denoting the dataset’s name, and another one for the dataset’s

ID.

–A JSON array containing the rules as individual JSON objects.

•Each JSON object (rule) in the array contains:

–A string ﬁeld for its name.

–Another JSON object representing the rule itself, comprising the “subject_-

column”, “operator”, and “object” ﬁelds.

–If the “operator” is a disjunction (using the ”or” expression), the “object” ﬁeld

should be a JSON array, containing two or more objects, each with single string

“operator” and “object” ﬁelds.

By following this procedure (apart from the main approach with utilization of AI mod-

els, as presented later on), the Virtual Data Container is able to accurately interpret and

apply the speciﬁed rules to the dataset, ensuring eﬃcient ﬁltering and transformation of

the data according to each data consumer’s requirements. Figure 3.4 presents the list of

supported operators that rule authors — including data scientists, developers, and end-

users — can utilize. Any rule object containing unrecognized operators is automatically

3.3 The Final System with the Utilization of Oﬄine LLMs

excluded from the processing pipeline. The rules system is based on two core principles,

designed to guide users in understanding the intended structure and purpose of the rules:

1. Each rule is designed to ﬁlter a single subject (i.e., one dataset column) at a time,

without combining multiple columns within the same rule. If a rule targets more

than one column, it is better categorized as a pre-processing operation rather than

a straightforward ﬁltering task. Likewise, low-level operations, such as altering

individual row values or removing rows based on speciﬁc value types, fall outside

the scope of this ﬁltering-focused system.

2. To maintain the generic nature of the scalable data management layer, the rules

must conform to a standard subject–operator–object format. Introducing complex

rules that deviate from this structure would undermine the layer’s reusability across

diverse datasets. While dataset-speciﬁc pre-processing logic can be implemented

with minimal eﬀort, such conditional handling (e.g., “if dataset X, apply logic Y”)

would compromise the layer’s core design principle of broad applicability and mod-

ularity.

Figure 3.4. Accepted operators by VDC’s Rules System [1].

3.3 The Final System with the Utilization of Oﬄine LLMs

As previously mentioned, one main objective is to design a scalable framework for data

proﬁling and analysis that can eﬀectively handle datasets of various sizes. This thesis’s

ﬁnal system implementation aims to allow end-users to submit analytical queries using

natural language. These queries are then interpreted and converted into executable code

by an oﬄine large language model (LLM). The resulting code is passed to the aforemen-

tioned data management and processing layer for execution. This process enables users

to perform analytical tasks simply by describing them in plain language, enhancing ac-

cessibility and ease of use. While the underlying data platform can be securely deployed,

the integration of an LLM in this context requires further validation. Although tradi-

tional statistical techniques are not applied in this study, methodological rigor is ensured

through a well-structured experimental design.

In brief, the ﬁnal part of this study begins by establishing its primary aim: to assess

the capability of an oﬄine LLM in generating valid code for data analysis tasks. The next

step involves selecting suitable datasets for testing and evaluation. This is followed by the

Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework

formulation of ﬁve unique queries per dataset. An execution plan is then developed, which

outlines the number of test iterations and deﬁnes the communication process between

the LLM and the data processing system. Evaluation metrics are also speciﬁed, with

justiﬁcation provided for their use. Following this, results are collected and analyzed to

evaluate the LLM’s performance. Finally, conclusions are drawn based on the ﬁndings,

and suggestions for future improvements are presented.

In detail, the methodology followed in this study is outlined as follows:

1. Objective Deﬁnition: This ﬁnal and main part of this study aims to evaluate the

eﬀectiveness and eﬃciency of an oﬄine large language model in generating code for

data analysis tasks. The LLMs used in this evaluation are Codestral by Mistral AI

and Qwen 2.5 Coder by Alibaba. These models are tested on their ability to produce

Python code using the PySpark framework, focusing on data proﬁling and analytical

operations. The reasons behind the selection of Codestral, Qwen, and PySpark are

discussed in subsection 3.4.1.

2. Dataset Selection: Five publicly available datasets were chosen to test and validate

the proposed approach. These datasets cover a range of domains, including social

media analytics, weather data, and supermarket sales records. All selected datasets

are introduced in subsection 3.3.2. Using multiple datasets instead of a single

one helps evaluate the LLMs’ ability to generalize across diﬀerent data types and

contexts, ensuring that their performance is not limited to one domain.

3. Query Design: For each of the ﬁve datasets, a set of ﬁve queries has been created,

resulting in a total of 25 queries used for testing. These queries are grouped into

three categories based on their complexity: “Basic”, “Intermediate”, and “Advanced”.

This classiﬁcation reﬂects the level of diﬃculty in the code generation tasks required

from the LLM. Their deﬁnition is as follows:

•Basic: These queries focus on simple tasks such as ﬁltering data, counting

records, or extracting distinct values. They require fundamental operations

that help reveal the basic structure and contents of the dataset.

•Intermediate: These queries involve moderately complex tasks, including group-

ing data, calculating aggregates, or performing basic arithmetic functions. Al-

though more involved than basic queries, they still rely on standard data pro-

cessing methods.

•Advanced: These queries address more sophisticated operations, such as

multi-level grouping, column transformations (e.g., merging or exploding), and

performing statistical analyses. They are designed to assess the LLM’s capabil-

ity to generate code for advanced data manipulation, and to extract meaningful

insights from the data.

The full set of 25 authored queries is presented in subsection 3.4.3.

4. Execution Plan: Each of the 25 queries will be submitted to the oﬄine LLM ten times.

During each iteration, the entire process will be carried out, and the outcomes will

3.3 The Final System with the Utilization of Oﬄine LLMs

be documented. In total, ten tests will be performed for each query. This repetition

aims to evaluate the model’s consistency by observing variations in the generated

code, as well as the stability and reproducibility of the outputs. This approach is in

line with established experimental practices, where repeated testing enhances the

reliability of ﬁndings [51]. Overall, 250 tests will be conducted—ten for each of the

25 queries—based on the datasets selected for this study.

5. Evaluation Metrics: The outcome of each test will be assessed using a set of deﬁned

evaluation metrics. Once all 250 tests are completed, the results will be compiled

into a uniﬁed evaluation dataset. As the study is carried out independently for two

oﬄine LLMs, the total number of tests will reach 500. Each LLM’s performance will

be analyzed based on the following evaluation criteria:

•Functional Correctness: This metric evaluates whether the generated code

produces the expected output. In other words, it assesses if the code success-

fully delivers the result that the user intended, based on the original natural

language query.

•Readability: This metric reﬂects how easily a human can read and understand

the generated code. The score is calculated using a custom function imple-

mented within the data processing platform, described in Appendix A.1. The

function considers three key aspects: line length, the depth of method call

chains, and the level of code nesting. Penalties are applied to code with lines

exceeding 80 characters, method chains with more than three calls, and nest-

ing deeper than two levels. The function returns a score from 1 to 3, where a

score of 3 indicates high readability.

•Eﬃciency: This refers to how eﬃciently the code generation process utilizes

computational resources. Key performance indicators include response time,

GPU and CPU usage, and memory consumption for both processing units.

These are monitored using a Python-based resource tracking tool. This metric

helps evaluate the computational cost associated with each test.

•Contextual Performance: This metric assesses how well the LLM handles

queries of varying complexity levels—Basic, Intermediate, and Advanced—as

previously deﬁned. The results provide insight into the model’s ability to adapt

to diﬀerent contextual demands, and allow for performance comparisons across

these categories.

•Automation: This metric evaluates whether the code produced by the LLM

can be executed automatically or with minimal manual adjustments. In some

instances, the model may include natural language explanations outside of

Python comments, despite explicit instructions to avoid such outputs. Ad-

ditionally, the LLM may use diﬀerent variable names than those speciﬁed in

the prompt. When such deviations occur, manual intervention is required to

modify or clarify the generated code before proceeding.

Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework

•Error Handling: This metric checks whether the generated code runs without

errors. It identiﬁes any issues encountered during execution, ranging from

minor warnings to critical errors that could interrupt or terminate the process.

6. Data Collection and Analysis: The ﬁnal step of the methodology involves collecting

the results from all 250 tests, combining them into a uniﬁed dataset, and conducting

an exploratory analysis. Therefore, each LLM will have a dataset containing 250

rows. This analysis will oﬀer valuable insights into the LLM’s performance across

the various evaluation metrics. Ultimately, the primary objective of this thesis’s

section will be assessed: Whether oﬄine LLMs can eﬀectively support data science

by generating code for data analysis tasks.

3.3.1 System Architecture

Figure 3.5. The proposed system’s ﬁnal architecture overview [4].

Figure 3.5 provides a high-level overview of the ﬁnal architecture utilized in this the-

sis’s study. The process begins when the end-user submits a query in natural language. A

relevant dataset is pre-selected, allowing for the determination of the appropriate dataset

summary, as well as and the main dataset ﬁle necessary for code execution within the

data processing platform. The user’s query is then combined with the dataset summary

into a single message, which is forwarded to the LLM during the prompt engineering

phase, as detailed in subsection 3.4.2. This prompt not only merges the query and the

summary, but also supplies crucial instructions to the Codestral and Qwen models.

Once the LLM receives the message, the response generation process begins. Dur-

ing this phase, a simple Python script monitors the computational resources of the LLM

server. After the model generates its response, it is sent back through the communication

3.3.1 System Architecture

pipeline, along with the monitoring data. The end-user then reviews the response to deter-

mine whether the code can proceed directly to the next phase (signifying full automation),

or if it requires minor adjustments, such as the removal of extraneous text not marked by

the LLM, thus classifying the process as semi-automated, and rejecting full automation.

Following this step, the generated code and monitoring results are consolidated into a

single JSON object (see Listing 3.1). The LLM communication pipeline then concludes by

transmitting the newly formed object to the data processing platform, which triggers the

execution of a new Python Spark job. Subsequently, the pre-deﬁned dataset is retrieved

and processed accordingly, while the received object is parsed to extract the generated

code, along with the performance monitoring results.

Listing 3.1: Code from the LLM communication pipeline, preparing to forward the gener-

ated code and performance metrics to the data processing platform.

submit_spark_job_kpi(

spark_master_ip="<spark−ip−here>",

code_repetition_id=crid,

dataset="netﬂix",

user_query=user_message,

generated_code= llm_response,

llm_response_cpu=absolute_cpu,

llm_response_memory=absolute_memory,

llm_response_gpu=avg_gpu,

llm_response_response_time=response_time,

llm_response_gpu_mem=avg_gpu_mem,

#llm_version="Codestral V0 1 22B Q6_K",

#llm_version="Qwen 2.5 Coder 14B Q6_K",

output_format="csv",

output_directory="<provided−output−directory"+str(repetition_number),

automated="true"

#automated="false

)

The next step involves the application of the generated data analysis code, which

executes the speciﬁed commands on the loaded dataset, in order to produce the desired

analysis results. Once this step is completed, the analysis results are extracted. A ﬁnal

JSON object is created, as each test’s result template, which includes the LLM server’s

monitoring metrics, the generated code, and other relevant attributes of the process. This

object, along with the analysis results from the code application step, is stored locally.

Both ﬁles are reviewed by the end-user, who assesses the analysis results to determine

whether they are accurate and align with the intended outcome, based on the context of

the original query. In either case, a clariﬁcation regarding functional correctness is added

to the test result object, marking it as ’True’ for correct results, and ’False’ for incorrect

ones. This step concludes the test ﬂow, as shown in Figure 3.5. As previously mentioned,

ﬁve queries will be tested for each of the ﬁve datasets, with ten iterations per query. This

process will be repeated twice, once for each oﬄine LLM, resulting in a total of 500 tests:

250 for each model.

Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework

3.3.2 Dataset Selection

In this study, ﬁve diﬀerent datasets were selected for the evaluation process. Each

dataset is brieﬂy described in the prompt message provided to the LLM, as discussed in

subsection 3.4.2. The code produced by the LLM is then executed through a PySpark job,

which loads the corresponding dataset and runs the generated code. Although Apache

Spark (the underlying framework of the proposed platform) is designed to manage large-

scale data, the size of the datasets was not a selection criterion for the evaluation of this

thesis’s part. This is because the focus of this part is on the LLM’s ability to generate

code for data analytics tasks, rather than on processing entire datasets. The datasets are

not loaded into the LLM itself, which makes their size largely irrelevant to the evaluation.

The datasets were sourced from Kaggle and Maven Analytics’ Data Playground, both of

which provide free and publicly accessible data.

The ﬁrst dataset used is titled “Netﬂix: Movies and TV Shows”, available on Kaggle

[113]. It includes detailed records of nearly 9,000 titles available on Netﬂix, covering both

movies and television shows. Each entry is identiﬁed by a unique ID and categorized as

either a ‘Movie’ or a ‘TV Show’. The dataset includes the title, director, and cast, oﬀering

insights into the creative contributors behind each work. It also speciﬁes the country or

countries of production for each entry. The dataset provides the date each title was added

to the platform, as well as its original release year. A rating ﬁeld classiﬁes the content

based on age suitability. The duration is noted in minutes for movies and in seasons

for TV shows. Furthermore, a genre column groups titles by content type, and a short

description summarizes the plot or main theme of each entry. This dataset can support

the exploration of patterns and trends in Netﬂix’s content oﬀerings.

The second dataset selected is titled “COVID-19 Twitter” and is available on Kaggle

[114]. It contains a large collection of tweets related to the COVID-19 pandemic, gathered

during three time periods: April–June 2020, August–October 2020, and April–June 2021.

The dataset includes nearly one million tweets, each identiﬁed by a unique ID and times-

tamp. A column labeled ‘source’ indicates whether the tweet was posted using an Android

or iPhone device. For each tweet, both the original and a cleaned version of the text are

provided, along with the language in which it was written. The dataset also includes

engagement metrics, such as the number of likes (favorites) and retweets. Additional

ﬁelds identify the author of the tweet, as well as any hashtags or user mentions. A ‘place’

column records the location associated with the tweet. Furthermore, the dataset includes

sentiment analysis scores — such as compound, negative, neutral, and positive values

— which contribute to an overall sentiment classiﬁcation. This dataset is well-suited for

testing data analysis queries in the context of social media content.

The third dataset, named “Shared Cars Locations,” is also sourced from Kaggle [115].

It contains over 20 million records related to the AutoTel shared car project, which was

implemented in Tel Aviv [116]. Each entry includes geographic coordinates (latitude and

longitude) that show the exact location of shared vehicles across the city. The dataset also

indicates how many cars are available at each location, and lists the speciﬁc car identiﬁers

present. A timestamp is included for every entry, showing when the data was collected.

3.3.2 Dataset Selection

This dataset enables location-based analysis, and can be used to examine patterns in car

availability and usage density across diﬀerent urban areas.

The fourth dataset, titled “Madrid Daily Weather”, was obtained from Maven Analyt-

ics’ Data Playground [117]. It contains nearly 7,000 records of daily weather data for

the city of Madrid, covering the years 1997 to 2015. Each entry corresponds to a speciﬁc

date, recorded in Central European Time (CET), and includes various meteorological mea-

surements. These include the maximum, mean, and minimum temperatures in degrees

Celsius, as well as dew point values and their ﬂuctuations. Humidity levels—maximum,

mean, and minimum—are also recorded. Additional variables include sea level pressure,

visibility, and wind speed data, with maximum gust speeds noted as well. The dataset

also reports precipitation amounts, cloud cover, and the presence of weather events such

as rain, snow, or fog. Wind direction is provided in degrees. This dataset oﬀers a valuable

resource for studying weather patterns and trends in a speciﬁc urban region, over an

extended period.

The ﬁfth dataset, named “Supermarket Sales”, is available on Kaggle [118]. It contains

1,000 records of individual sales transactions from a supermarket chain, collected over

a three-month period across three diﬀerent branches. Each transaction is identiﬁed by

a unique invoice ID, and includes information about the branch and the city where the

sale took place. The dataset also records the customer type (‘Member’ or ‘Normal’) and

the gender of the customer. Products are grouped by product line, and each sale includes

the unit price, quantity purchased, and the amount of tax applied. The total amount of

each transaction is also calculated. Additionally, the dataset provides the date and time

of each purchase, the payment method used, and ﬁnancial indicators such as cost of

goods sold (COGS), gross margin percentage, and gross income. Customer satisfaction is

represented through a rating score for each transaction. This dataset supports analysis

of consumer behavior, sales trends, and ﬁnancial performance within the retail sector.

To strengthen the diversity and reliability of the evaluation process, the ﬁve datasets

were selected based on two main criteria. First, they needed to be open-source and

publicly accessible to support reproducibility. Second, they were chosen to represent dif-

ferent domains, ensuring a varied dataset collection. As previously mentioned, all selected

datasets are publicly available. Their domains cover a wide range of ﬁelds: entertainment

(Netﬂix: Movies and TV Shows), social media (COVID-19 Twitter), transportation (Shared

Cars Locations), weather (Madrid Daily Weather), and retail (Supermarket Sales). This

diverse set of datasets reﬂects the complexity of real-world data analysis tasks, and aims

to help reduce potential bias that may arise from using a narrow or homogeneous data

selection [119]. Such an approach is essential for fairly evaluating the performance of

oﬄine LLMs across diﬀerent application areas, while maintaining the reproducibility and

objectivity of the study.

These datasets form the foundation of the evaluation process for the proposed sys-

tem. From each dataset, ﬁve natural language queries will be selected. These queries,

presented in subsection 3.4.3, will be used to assess the ability of the oﬄine LLM to

generate code for data proﬁling and analysis tasks.

Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework

3.4 Technical Components

As brieﬂy discussed in the previous section ??, the study presented in this thesis’s

part involves the use of an oﬄine large language model (LLM), selected for its ability to

generate code. Alongside the model, a data processing platform has been developed to

manage datasets eﬀectively and execute the code produced by the LLM. This platform

is practically based on the scalable data management layer, which was presented in

section 3.2. The interaction between the language model and the processing platform is

facilitated through a carefully constructed prompt, designed as part of the study’s prompt

engineering process. The evaluation includes a set of dataset-speciﬁc queries, which are

organized by dataset and assessed based on their level of complexity.

3.4.1 LLMs and Data Processing Platform

The oﬄine LLMs chosen for this study are Codestral by Mistral AI [52] and Qwen 2.5

Coder by Alibaba [53]. Codestral is a 22-billion parameter model developed speciﬁcally

for generating code. It has been ﬁne-tuned on more than 80 programming languages,

ranging from widely used languages such as Python, Java, and C++, to more specialized

ones like Fortran and Swift. This broad training enables it to handle a wide range of

programming tasks across various environments. Codestral has demonstrated strong

performance on several standard benchmarks, including HumanEval [120], RepoBench

[121], and CruxEval [122], excelling in both code completion and error reduction. It

currently outperforms other major models, such as CodeLlama 70B [123], DeepSeek

Coder 33B [124], and LLaMA 3 70B [125], on most evaluation tasks [126]. Additionally,

Codestral oﬀers a large context window of 32,000 tokens, which allows it to process longer

and more complex codebases. This feature is especially valuable for tasks involving full-

code repositories and long-range dependencies.

Qwen 2.5 Coder, developed by Alibaba, is a specialized model family for code-related

tasks. The 14-billion parameter version used in this study delivers competitive perfor-

mance across many code generation benchmarks, even outperforming larger models like

this study’s Codestral 22B and DeepSeek Coder 33B [124] in several areas [127]. Based

on the Qwen 2.5 architecture, this model has been trained on over 5.5 trillion tokens, col-

lected from public code repositories and web sources. It supports various programming

languages, and shows strong capabilities in both general reasoning and mathematical

problem-solving. Qwen 2.5 Coder 14B has achieved notable results in evaluations such

as HumanEval, MBPP, and BigCodeBench, demonstrating strengths in code generation,

completion, and repair. Its extensive training and instruction tuning make it suitable for

use in both coding assistants and practical applications. It is also worth mentioning that

the larger 32B version of Qwen 2.5 has outperformed even GPT-4o in some benchmark

tests [128]. However, the 14B variant was selected for this study, based on the hardware

limitations of the system used.

Codestral and Qwen exhibit strong performance characteristics, establishing a high

standard in code generation, while maintaining low response times. This is an essen-

3.4.1 LLMs and Data Processing Platform

tial factor for developers who rely on quick feedback during the development process.

Benchmark results highlight the eﬀectiveness of both models in tasks such as Python

and SQL code generation, as well as ﬁll-in-the-middle (FIM) operations, which are crucial

for eﬃciently completing partial code segments. Their open-weight availability, under the

Mistral AI Non-Production and Apache 2.0 licenses (respectively), enables straightforward

oﬄine deployment, making them suitable for research and non-commercial applications

without requiring constant internet access. This is particularly advantageous for devel-

opers and institutions with data privacy concerns, or limited online connectivity. The

models’ combination of speed, precision, multilingual support, and oﬄine accessibility,

makes them well-suited for this study’s focus on oﬄine code-oriented language models.

Continuing from this analysis, despite having fewer parameters compared to some

larger models, both Codestral and Qwen consistently achieve strong results among oﬄine

LLMs. Speciﬁcally, the Qwen 2.5 Coder 7B model frequently surpasses the performance

of CodeLlama 70B and, in some cases, outperforms Codestral 22B. Given these out-

comes, comparing them with smaller versions of already outperformed models — such as

lower-parameter variants of CodeLlama — would oﬀer limited scientiﬁc value and could

introduce methodological inaccuracies. Furthermore, the hardware limitations in this

study restrict the evaluation of models exceeding 30 billion parameters. These factors

collectively justify the selection of Codestral 22B and Qwen 2.5 Coder 14B, as their pa-

rameter sizes and demonstrated capabilities enable a thorough and balanced evaluation

within the scope of the available computational resources.

It is important to acknowledge that, given the rapid advancements in LLM research,

newer models may eventually outperform Codestral and Qwen. As a result, ongoing

evaluation of emerging architectures remains essential. This thesis’s study investigates

the broader capabilities of oﬄine LLMs in the context of data science, using Codestral

and Qwen as representative examples. Since the methodology is not tied to any speciﬁc

model, the ﬁndings and approaches presented here can be extended to future models as

well. The decision to focus on oﬄine LLMs, as opposed to online alternatives, is driven

by several factors. Primarily, oﬄine models allow local data processing, which enhances

privacy and security, while supporting compliance with data protection regulations. This

setup also reduces the risks associated with transmitting sensitive information to external

servers.

In addition, oﬄine LLMs oﬀer greater ﬂexibility and control, as they can be ﬁne-tuned

for speciﬁc tasks and seamlessly integrated into tailored software systems. These capabil-

ities are often restricted in commercial, cloud-based solutions. From a cost perspective,

oﬄine models can also be more economical over time, as they do not incur recurring

subscription fees or usage-based charges. Their independence from internet access im-

proves reliability and ensures low latency in time-sensitive applications, while also insu-

lating users from potential disruptions caused by changes in third-party service oﬀerings.

Moreover, oﬄine deployment supports ethical and legal alignment, allowing institutions

to adapt models to comply with regional laws and standards. Finally, oﬄine LLMs create

an open environment for experimentation and innovation, enabling researchers to explore

new ideas without the constraints imposed by proprietary platforms.

Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework

Although online LLMs are often recognized for their high accuracy and quick response

times — beneﬁts largely due to their cloud-based, dynamically updated infrastructure —

these aspects were not the primary concern of this research. As previously discussed, a

key objective of this study is to ensure that both data (and associated metadata) remain

within a locally controlled environment. This approach prioritizes privacy, regulatory

compliance, and full control over the data processing pipeline. While cloud-hosted models

may oﬀer performance beneﬁts in certain scenarios, their dependence on external servers

and stable internet connectivity introduces uncontrolled variables that are beyond the

intended scope of this research.

As outlined earlier, the oﬄine LLM in this framework is designed to return code-based

responses, which are then executed within a data processing platform, based on the —

already presented — scalable data management layer. For the purposes of this study,

Apache Spark [47] is selected as the most suitable tool for managing and executing data

operations. Based on the previous justiﬁcations provided, Apache Spark is a powerful

platform optimized for high-performance data processing. It supports in-memory compu-

tation, signiﬁcantly improving execution speed, particularly for tasks involving repeated

iterations. Spark is highly scalable, capable of eﬃciently handling large datasets across

distributed computing clusters. Its ﬂexibility is another major advantage, as it supports

a broad range of processing tasks, including batch processing, real-time data streaming,

machine learning, and graph analytics, making it a comprehensive solution for scalable

data workﬂows.

When integrated with Python through the PySpark interface, Apache Spark becomes

more user-friendly. PySpark allows users to leverage the simplicity and ﬂexibility of

Python, while utilizing Spark’s powerful distributed computing capabilities. This integra-

tion enables eﬃcient execution of a variety of tasks, including complex data transforma-

tions, machine learning workﬂows, and the processing of large-scale datasets: All within

the familiar Python ecosystem, which beneﬁts from extensive libraries and strong com-

munity support. In the context of this thesis’s study, the code generated by the language

model is executed through PySpark jobs, running directly within the Spark environment.

Apache Spark has been widely adopted in numerous big data management frameworks,

due to its scalability, eﬃcient resource allocation, and straightforward deployment pro-

cess. These features make it a reliable choice for enabling eﬀective data management and

analysis across diﬀerent domains [129] [1].

3.4.2 Prompt Formulation

In the oﬄine setting of this study, the main optimization is achieved through the

careful design of prompts, which provides clear and detailed instructions to the language

model. Since the models used are Instruct variants, they are speciﬁcally built to respond

more eﬀectively to well-structured prompts, which helps improve their performance and

the accuracy of the generated commands. Since it clearly states the task and oﬀers

the necessary context, this approach ensures that the model’s output is both relevant

and suited to the dataset being used. As such, the emphasis on prompt design is a

3.4.2 Prompt Formulation

strategic choice for enhancing performance in an oﬄine environment, where conventional

optimization methods are limited.

Beyond improving performance, the structured prompt design also helps reduce the

likelihood of biased outputs. This is achieved by supplying explicit guidance and suﬃcient

context to the model. Combined with the use of a varied dataset — covering a range

of domains — this approach reduces ambiguity and lowers the chances of unintended

results. Additionally, the current study’s repeated testing process allows for consistent

evaluation of the outputs, making it easier to detect and correct any deviations or potential

biases.

Based on this context, in order to improve the code generation capabilities of the

Codestral and Qwen language models, a speciﬁcally designed prompt message has been

developed to initiate interaction with each model, as shown in Listing 3.2. Each time

a user submits a natural language query, it is combined with this prompt message to

guide the model in producing accurate and relevant code. This method follows the ‘few-

shot’ prompt engineering strategy, which involves providing the model with key details

about the dataset being analyzed. The prompt includes a concise overview of the dataset’s

structure, format, and columns, along with descriptions of each column. This ensures

that the code generated is eﬀectively tailored to the speciﬁc characteristics of the dataset.

The full prompt message is presented below:

Listing 3.2: The prompt message authored for this study’s interaction with each LLM.

system_message = (

"<s>[INST]YouareavirtualassistantspecializedingeneratingPythonSparkcommandsbased

onthecontextprovidedbelow.YourtaskistooutputonlyPythonSparkcommands.Donot

addadditionaltextorexplanation.StrictlyuseonlycommandsthatcanbeappliedtoaSpark

Dataframe.RefertotheSparkDataFrameas’df’,andstoretheresultsintheSparkDataframe

named’processeddf’.Context:[/INST]" \

+str(dataset_summary) \

+ "[INST]Exampleinput:’ﬁlterthedatawheretemperatureisinthebottom20%ofthetotal

temperaturevalues’.Exampleoutput:’processeddf=df.ﬁlter(df["Temperature"]<=df.

approxQuantile("Temperature",[0.2],0.0)[0])’.Ensurethatnobackslashesorescape

charactersareincludedinthegeneratedcommands.Generateeﬃcientcommands,usingas

fewstepsaspossible.Separatecommandswithasemicolonornewline.Ifyouprovidetextor

explanation,doitusingpythoncommenting,with’#’.Iftheinputiscompletely

incomprehensible,respondwith’LLMERROR:’.Thisshouldberare.Otherwise,generate

onlytherequiredPythonSparkcommands.[/INST]</s>"

)

The prompt message presented includes both a series of instructions directed at the

language model, as well as contextual information (clearly speciﬁed as ‘context’) related

to the dataset for which the generated code is intended. The main components of the

prompt’s structure can be summarized as follows:

•Objective Deﬁnition: The prompt should explicitly state the intended goal for the

language model. In this case, the objective is to generate Python Spark commands

without any explanatory text. This helps the model recognize the speciﬁc scope and

Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework

purpose of its task. It is important to highlight that generating PySpark code intro-

duces additional complexity, as PySpark DataFrame operations diﬀer in subtle ways

from those used in Pandas. The model is also assessed on its ability to distinguish

between the two, and produce valid PySpark DataFrame code exclusively.

•Context Provision: The prompt must supply the language model with adequate con-

text to perform the task accurately. This involves summarizing the dataset’s struc-

ture and content that the model will use. Through the inclusion of this background

information, the prompt equips the model to generate suitable outputs, speciﬁcally

targeting a Spark DataFrame named ‘df‘.

•Instruction Clarity: The prompt must include clear and speciﬁc instructions, out-

lining both what the model is expected to do and what it should avoid. For instance,

it is instructed to generate only Python Spark commands, without any accompany-

ing explanations. The prompt also directs the model to separate commands using

either semicolons or newlines. Additionally, it provides guidance on how to handle

potential errors and emphasizes the need for concise and eﬃcient code generation.

•Examples Inclusion: Including illustrative examples within the prompt helps demon-

strate the expected input-output format. These examples assist the model in under-

standing the style, structure, and type of commands required. With the inclusion

of representative samples, the prompt helps align the model’s responses with the

intended output format and content.

•Error Handling: The prompt should include clear instructions for situations where

the input cannot be understood. This provides the model with a predeﬁned strat-

egy for managing unexpected or ambiguous input. By addressing error handling

within the prompt, the model is better equipped to respond appropriately when it

encounters input it cannot interpret, thus improving the reliability of its responses.

•Message Formatting: The prompt should comply with any formatting rules required

by the speciﬁc language model or system in use. In this case, the prompt includes

formatting elements, such as the ‘[INST]’ tag, to ensure proper interpretation by the

model. Adhering to these conventions supports accurate parsing of the instructions,

and contributes to more consistent and reliable code generation.

As shown in the middle section of Listing 3.2, the prompt message also includes a

variable named ‘dataset_summary’. This variable holds a summary of the dataset being

used at each instance. To support this, ﬁve dataset summaries have been carefully

prepared, ensuring they accurately capture the key characteristics of each dataset. Each

summary follows a uniform format, beginning with an introductory statement about the

dataset, followed by a detailed description of each column. This includes the column

name, data type, a brief explanation, and representative sample values. Additionally,

a note is included to clarify that the sample data is synthetic and provided solely for

illustrative purposes. This structured approach is intended to help the language models

3.4.3 Dataset Queries

better understand the dataset’s format and content, facilitating more accurate analysis.

An example of a column from the “Shared Cars Locations” dataset is provided in Listing

3.3 below:

Listing 3.3: The ’timestamp’ column of the "Shared Cars Locations" dataset, part of the

dataset’s summary.

{

"name": "timestamp",

"type": "object",

"simpliﬁed_type": "datetime",

"description": "TimestampofthedatarecordinUTC.",

"sample_values": [

"2019−12−0621:51:02UTC",

"2019−11−2914:00:02UTC",

"2019−09−2013:21:03UTC"

"datetime_format": "%Y−%m−%d%H:%M:%S%Z"

}

In the example presented, the ‘timestamp’ column records the exact date and time

at which each data entry was captured. It is stored as an object type, with a simpliﬁed

type label of ‘datetime’ to indicate that it contains timestamp values. The description

notes that all timestamps follow Coordinated Universal Time (UTC), and the format used

is “%Y-%m-%d %H:%M:%S %Z” (year-month-day hour:minute:second timezone). The full

dataset summary for the “Shared Cars Locations” dataset is available in Appendix A.2.

This level of detail is particularly important in an oﬄine setting, as it allows the language

model to interpret the dataset structure accurately. When used as part of the overall

prompt, these summaries help the model generate code that is appropriately adapted to

the speciﬁc attributes and data formats involved.

In summary, the prompt is designed to guide the model toward producing only Python

Spark commands, ensuring that the output is directly executable within the Spark en-

vironment for data processing tasks. The focus is placed on clarity, precision, and the

exclusion of unnecessary text. Additional instructions are included to help the model

handle errors or ambiguous input eﬀectively. This strategy enhances the quality of inter-

action with the language model, contributing to more eﬃcient and reliable code genera-

tion. The described prompt message functions as a key component in the communication

ﬂow between the language model and the Spark-based data management and analysis

platform.

3.4.3 Dataset Queries

As described in subsection 3.3, the queries developed for this study are grouped into

three categories: Basic, Intermediate, and Advanced. A total of 25 queries were created,

ﬁve for each of the ﬁve datasets. Each dataset includes one Basic, two Intermediate, and

two Advanced queries. This classiﬁcation is based on the complexity of the operations

and data transformations required to answer each query. Due to the lack of a universally

Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework

accepted standard for deﬁning query diﬃculty, existing research was reviewed to support

the development of a safe and practical three-level classiﬁcation system [130] [131].

While this categorization involves some degree of subjectivity, it was reﬁned through an

iterative process to ensure that each level accurately reﬂects a distinct degree of challenge

for the language model. As noted earlier, to mitigate potential bias, each query is executed

ten times, and its performance is assessed using a set of evaluation metrics. These

include functional correctness, readability, eﬃciency, contextual relevance, automation

capabilities, and error handling. This structured methodology aligns with established

experimental practices in the ﬁeld, where tasks are grouped by complexity to allow for a

more comprehensive evaluation of system performance [132] [133]. The complete list of

queries, organized by dataset, is presented below:

For the “Netﬂix: Movies and TV Shows” dataset, the following queries were designed

to evaluate diﬀerent aspects of the LLM’s code generation capabilities:

1. “Count the Number of TV Shows per Country.” (Basic): This query evaluates whether

the LLM can correctly produce code for group-by operations and basic aggregation.

2. “Find the Average Duration of only the Movies, in minutes.” (Intermediate): This

task checks the LLM’s ability to perform arithmetic calculations and manage mixed

data types, such as extracting numerical values from text.

3. “Return the Top 5 Most Frequent Genres for Movies only, Released After 2010.”

(Intermediate): This query involves ﬁltering, counting, and sorting operations, along

with managing multiple values stored within a single column, adding moderate

complexity.

4. “Identify the top ﬁve directors who have worked with the greatest number of diﬀerent

actors.” (Advanced): This query tests the LLM’s ability to handle more complex data

processing, including data transformation, aggregation, and sorting, to ﬁnd the

most collaborative directors.

5. “Provide the top 10 most busy actors in solely American Movies, from 1995 on-

wards.” (Advanced): This query involves ﬁltering American movies from 1995 on-

wards, transforming the cast column to separate actor names, and aggregating the

data to identify the ten most frequently appearing actors.

As for the “COVID-19 Twitter” dataset, the following queries were formulated to assess

diﬀerent capabilities of the LLM in code generation:

1. “Determine the Number of Tweets Containing User Mentions.” (Basic): This query

evaluates the LLM’s ability to generate code that ﬁlters rows based on non-null

values in the ‘user_mentions‘ column, and performs a count.

2. “Calculate the top 7 authors with the highest retweet count.” (Intermediate): This

query involves grouping data by the original author, summing retweet counts, and

identifying the top 7 authors with the highest totals.

3.4.3 Dataset Queries

3. “Analyze the Daily Tweet Volume Over Time.” (Intermediate): This query checks

whether the LLM can produce code that groups tweets by date, and counts the

number of tweets per day (thus performing temporal analysis).

4. “Provide the names of the top 5 users mentioned in tweets, every month (also include

the corresponding month and year).” (Advanced): This query requires code that

converts timestamps to extract month and year, expands the list of user mentions,

groups by month and mentioned user, counts the mentions, and selects the top 5

users for each month.

5. “Provide the top 5 weeks (and their years) with the most dense tweets posted, in

terms of total clean words included” (Advanced): This query tests the LLM’s ability

to extract week and year from timestamps, compute word counts for each tweet,

group the data accordingly, and identify the top 5 weeks with the highest total word

count.

Regarding the “Shared Cars Locations” dataset, the queries crafted are the following:

1. “Filter Locations with More Than 3 Cars.” (Basic): This query evaluates the LLM’s

ability to generate code for simple ﬁltering, based on numerical thresholds.

2. “Find the Top 5 Locations with the Most Cars Recorded.” (Intermediate): This query

involves computing the total number of cars per location and ranking them to iden-

tify the top ﬁve.

3. “List 100 Records at most, from December 2019 to January 2020.” (Intermediate):

This query checks the LLM’s handling of date-based ﬁltering using a deﬁned time

range, along with limiting the output to 100 entries.

4. “Provide the number of the most dense week, and year, in terms of total cars parked,

along with the number of total cars.” (Advanced): This query assesses the LLM’s

ability to process timestamps, extract week and year values, perform weekly aggre-

gations of parked cars, and identify the week with the highest total.

5. “Find the total number of (unique) cars that visited each location.” (Advanced): This

query tests the LLM’s capacity to preprocess data, expand lists of car identiﬁers,

group data by location, and calculate the number of unique cars associated with

each one.

For the “Madrid Daily Weather” dataset, it has been evaluated using the following

queries:

1. “Filter Days of 2006 with Max Temperature Above 30 °C.” (Basic): This query tests

the LLM’s ability to generate code for basic ﬁltering, based on numerical temperature

values.

2. “Count the Number of Foggy Days per Year” (Intermediate): This query evaluates

the LLM’s capacity to extract the year from a date column, identify foggy days by

Chapter 3. Study Design, Methodology, and Implementation: A Complete Analysis of the Framework

checking for the term “Fog” in the events column, and group the data by year to

count the occurrences.

3. “Calculate the monthly average of mean wind speed and mean sea level pressure,

per year” (Intermediate): This query involves extracting month and year from the

date, grouping the data accordingly, and calculating average values for wind speed

and sea level pressure.

4. “Analyze the Yearly Variation in Average Humidity and Identify the Top 5 Years

with the Highest Increase in Average Humidity Compared to the Previous Year.”

(Advanced): This query requires code that extracts the year, computes the yearly

average humidity, calculates the change from one year to the next, and identiﬁes

the ﬁve years with the greatest increase.

5. “Analyze the Monthly Variation in Temperature Range and Identify the Top 3 Months

with the Highest Average Range.” (Advanced): This query tests the LLM’s ability

to calculate the daily temperature range, group the data by month, compute the

monthly variations, and determine the three months with the highest average range.

It thus tests the LLM’s ability to produce code for handling complex calculations,

group-by operations, and sorting.

When it comes to the “Supermarket Sales” dataset, the ﬁve queries crafted are pre-

sented below:

1. “Count the Number of Sales per Product Line” (Basic): This query tests the LLM’s

ability to generate code for grouping by product line, and applying an aggregate

function to count the total number of sales.

2. “Find the Average Rating for Each Payment Method” (Intermediate): This query

involves grouping the data by payment method and calculating the average rating

for each group, introducing slightly more complexity.

3. “Find the average quantity of Electronic accessories purchased by card, for each city”

(Intermediate): This query assesses the LLM’s capability to ﬁlter data for electronic

accessories bought with a credit card, group it by city, and compute the average

purchase quantity per group.

4. “Analyze the Sales Performance by Branch and Payment Method” (Advanced): This

query requires the generation of code that groups data by both branch and payment

method, calculates total sales, and computes average ratings, thereby testing multi-

level grouping and multiple aggregation operations.

5. “Calculate the Correlation Between Unit Price and Quantity Sold for Each Prod-

uct Line” (Advanced): This query evaluates the LLM’s ability to perform statistical

analysis within grouped data, by calculating the correlation between unit price and

quantity for each product line.

3.4.3 Dataset Queries

By reviewing the outlined set of queries, it becomes evident that task complexity

increases progressively, moving from basic data proﬁling to more advanced analytical

operations. The initial queries require the LLMs to perform simple code generation tasks,

while later ones demand more intricate logic and processing. Consequently, Codestral

and Qwen will be assessed based on their capability to generate code for a range of

data analysis tasks, all provided in natural language. Each model will be evaluated

independently. This approach aims to examine not only their eﬀectiveness in handling

various levels of task complexity, but also their ability to adapt to diﬀerent analytical

requirements.

Chapter 4

Testing and Results: Assessing the Study’s Per-

formance

4.1 Validating the Scalable Capabilities of the Underlying In-

frastructure

Regarding this thesis’s scalable data management layer, it has been carefully struc-

tured to ensure the completion of the entire workﬂow within a short timeframe, depending

on the size of the incoming datasets. The framework has been validated using the pre-

viously described architecture in section 3.2. Three main datasets were employed for

the evaluation process. The ﬁrst is a small tabular dataset consisting of approximately

31,200 records, selected to demonstrate the layer’s eﬃciency in terms of processing speed

and data storage. It is important to note, however, that this dataset does not qualify as

“big data”. To address this, a second main dataset was utilized, comprising 64.1 million

JSON objects and totaling 55.4 GB in size. This dataset, titled “urn-ngsi-ld-ITI-Customs”,

contains records of items processed through the customs oﬃce at the port of Valencia,

Spain over speciﬁc time periods. In essence, each JSON object in the dataset represents

information about a particular item that passed through customs. The structure of a

typical JSON object from this dataset is illustrated in Figure 4.1 below:

Chapter 4. Testing and Results: Assessing the Study’s Performance

Figure 4.1. A JSON object from the “urn-ngsi-ld-ITI-Customs” dataset, containing informa-

tion for a given port’s customs item [1].

To thoroughly assess the performance of the scalable data management layer, exper-

iments are carried out using both the full Customs dataset and three of its subsets. The

ﬁrst subset includes data only from November 2022, and is approximately 1.4 GB in size.

The second subset consists of 6 million records, totaling 5.2 GB, while the third contains

11 million records with a size of 9.5 GB. Finally, the layer is evaluated using the complete

Customs dataset, which comprises 64.1 million records and has a total size of 55.4 GB

(as previously outlined). An overview of these Customs datasets used in the evaluation is

presented in the screenshots below:

4.1 Validating the Scalable Capabilities of the Underlying Infrastructure

To further evaluate the scalable data management layer, an additional main dataset

from the critical infrastructure Communications sector was used. Speciﬁcally, the telecom-

munications provider OTE Group [134] supplied three subsets, each covering a full year

from 2019 to 2021, related to user mobility within the cellular network in the region

of Thessaloniki, Greece. As a result, the proposed framework is tested across both the

Transportation sector (ports) and the Communications sector (telecommunications). All

three subsets underwent thorough anonymization by OTE’s data engineering team. Each

JSON document in the subsets represents a speciﬁc instance of a user’s movement from

one network cell to another. Along with mobility data, the records include the number of

incoming and outgoing text messages and phone calls. They also include contextual de-

tails, such as whether the timestamp falls on a public holiday, and the number of distinct

users present in the current cell at that moment. An example JSON document is shown

in Figure 4.2.

Each of the three original subsets is labeled as “urn-ngsi-ld-OTE-MobilityData-XXXX”,

where “XXXX” corresponds to the respective year (2019, 2020, and 2021). The 2019

dataset includes 6.9 million documents and occupies 2.8 GB of disk space. The 2020

dataset contains 10.1 million documents and has a size of 4.1 GB, while the 2021 dataset

consists of 5.7 million documents totaling 2.4 GB. The main dataset has been created

by merging the three annual subsets. This combined dataset is used for performance

evaluation, allowing the scalable data management layer to be tested on a larger data

volume. It contains 22.8 million documents and requires 9.2 GB of disk space. Details

regarding the size and structure of each dataset are illustrated in the screenshots below,

following Figure 4.2.

Chapter 4. Testing and Results: Assessing the Study’s Performance

Figure 4.2. A JSON object that contains information for a random-anonymized-OTE cellular

network user in the area of Thessaloniki, Greece [1].

The testing process began with the the ﬁrst main dataset consisting of 31,200 docu-

ments. Since the layer operates within an Apache Spark Cluster environment, the com-

puting resources it utilizes and the corresponding results can be monitored through the

cluster’s WebUI. Figure 4.3 presents an overview of the Spark driver’s ID, the execution

timestamp, and the resources allocated to the driver during the run of the Pre-Processing

and Filtering Tool. As explained in section 3.2, this tool serves as the initial subcomponent

of the layer, responsible for processing, ﬁltering, cleaning, and storing the dataset.

4.1 Validating the Scalable Capabilities of the Underlying Infrastructure

Figure 4.3. Screenshot from scalable data management layer’s Apache Spark Cluster

WebUI [1].

As shown in Figure 4.3, the driver identiﬁed as “driver-20230301132908-0011” was

responsible for loading and executing the Pre-Processing and Filtering Tool, which han-

dled the processing, ﬁltering, cleaning, and storing of the dataset. The driver is automat-

ically initiated through an API call, which can be triggered either by a software engineer,

or by an automated software agent, enabling a seamless and continuous workﬂow. Each

subcomponent or software module subsequently activates the next step in the pipeline.

For this speciﬁc task, the Spark cluster driver utilized 6 CPU cores and 5 GB of RAM. The

total execution time is available in the driver’s logs, as shown in Figure 4.4.

Figure 4.4. Screenshot from Spark cluster driver’s logs [1].

The log ﬁle’s title conﬁrms that the information pertains to the relevant Spark driver.

Near the end of the log, an entry indicates the total execution time required by the Pre-

Processing and Filtering Tool to complete its tasks (processing, ﬁltering, cleaning, and

storing the dataset). The entire process was completed in 0.76 minutes, approximately

45 seconds. This brief duration is notable given the range of operations performed by the

tool. Following this process, the dataset is securely stored and fully cleaned within the

Virtual Data Repository, making it readily accessible for any authorized data consumer.

Subsequently, testing continued with the Customs datasets, starting with the smaller

subset containing data from November 2022. As in the previous case, the Spark clus-

ter’s WebUI is used to monitor the operation of the layer during execution. As illus-

Chapter 4. Testing and Results: Assessing the Study’s Performance

trated in Figure 4.5, the system automatically generated a driver with the ID “driver-

20230901183325-0000” to load the Pre-Processing and Filtering Tool and carry out the

processing, ﬁltering, cleaning, and storage of the Customs dataset’s subset. According

to Figure 4.6, the tool successfully completed the entire procedure and stored the subset

in the Virtual Data Repository within 2.1 minutes. Considering the dataset’s size (1.4

GB) and approximately 1.4 million records, this indicates that the layer’s subcomponent

performed the task eﬃciently and reliably.

Figure 4.5. A screenshot from the Apache Spark Cluster’s WebUI, for the November 2022

subset [1].

Figure 4.6. Screenshot from the logs of the Spark driver “driver-20230901183325-0000”,

for the November 2022 subset [1].

The next evaluation involves the subset containing 6 million records, from the com-

plete 64 million-object Customs dataset. As before, the process is monitored via the

Spark cluster WebUI. In this case, the Pre-Processing and Filtering Tool is tasked with

applying all previously described operations to a subset of 5.4 GB in size. The Spark

driver responsible for executing this job is identiﬁed as “driver-20230901184317-0001”,

as shown in Figure 4.7 below.

4.1 Validating the Scalable Capabilities of the Underlying Infrastructure

Figure 4.7. A screenshot from the Spark Cluster’s WebUI, for the Customs subset contain-

ing 6 million objects [1].

The entire process, executed by the driver responsible for the operation of the Pre-

Processing and Filtering Tool, was completed in 5.24 minutes, with the subset success-

fully stored in the Virtual Data Repository, as illustrated in Figure 4.8. Similar to the

previous test involving the November 2022 subset, the framework demonstrates a no-

tably short execution time, highlighting its eﬃciency in handling larger datasets, given

the computational resource constraints.

Figure 4.8. A screenshot from “driver-20230901184317-0001” Spark cluster driver’s logs,

for the 6 million subset [1].

Before proceeding to the full Customs dataset, the ﬁnal subset — comprising 11 mil-

lion JSON objects with a total size of 9.5 GB — is evaluated. To process this subset,

the Spark cluster automatically generated the driver “driver-20230901185350-0002”, as

shown in Figure 4.9. Assuming the system operates without performance bottlenecks or

disruptions, the Pre-Processing and Filtering Tool is expected to complete the task within

a relatively short time frame.

Chapter 4. Testing and Results: Assessing the Study’s Performance

Figure 4.9. A screenshot from Spark Cluster’s WebUI, for the Customs subset comprising

11 million objects [1].

As shown in the Spark driver’s logs in Figure 4.10, the Pre-Processing and Filtering

Tool successfully completed all required operations on the subset and stored the results

within approximately 10.5 minutes. This outcome further supports the observation that

the framework’s execution time increases in a linear manner relative to the size and

volume of the input dataset.

Figure 4.10. A screenshot from the logs of the Spark driver “driver-20230901185350-

0002”, for the 11 million subset [1].

The next evaluation of the scalable data management layer involved the complete

Customs dataset, consisting of 64.1 million JSON objects and totaling 55.4 GB in size.

Upon activation through the previously described API call, the Spark cluster initiated

the process by generating the driver “driver-20230901191216-0003”, as shown in Figure

4.11. The WebUI conﬁrms that the processing task was successfully launched and is

actively running.

4.1 Validating the Scalable Capabilities of the Underlying Infrastructure

Figure 4.11. A screenshot from the Spark Cluster’s WebUI, for the complete Customs

dataset [1].

As in previous cases, the Pre-Processing and Filtering Tool is responsible for process-

ing, ﬁltering, cleaning, and ultimately storing the dataset in the Virtual Data Repository.

The progress of this operation, along with the ﬁnal completion time, is available in the

driver’s logs, as illustrated in Figure 4.12.

Figure 4.12. A screenshot from the logs of the Spark driver “driver-20230901191216-

0003”, for the complete Customs subset [1].

The Pre-Processing and Filtering Tool successfully completed the entire process in

under one hour, requiring 58.48 minutes to receive, process, and store the full Customs

dataset in the Virtual Data Repository. Once again, the results support the observation

that processing time scales linearly with the size of the input data.

Following this, the layer was evaluated using the OTE Group Mobility datasets from

2019 to 2021, covering the region of Thessaloniki, Greece. The evaluation began with the

2019 dataset. As shown in Figure 4.13, the driver “driver-20231004114021-0000” was

responsible for executing the Pre-Processing and Filtering Tool on this dataset.

Chapter 4. Testing and Results: Assessing the Study’s Performance

Figure 4.13. A screenshot from the Spark Cluster’s WebUI, for the Mobility subset of 2019

with 6.9 million objects [1].

The layer completed the processing, ﬁltering, cleaning, and storage of the dataset in

approximately 3.7 minutes, as indicated in the driver’s logs (Figure 4.14). This means that

2.8 GB of data, corresponding to 6.9 million documents, was successfully parsed, pro-

cessed, and stored in under four minutes, which is a notably short duration considering

the complexity and number of tasks involved.

Figure 4.14. A screenshot from the logs of the Spark driver “driver-20231004114021-

0000”, regarding the Mobility 2019 subset [1].

Next, the layer was tested using the 2020 Mobility subset. The driver assigned to

execute the entire process was “driver-20231004114951-0001”. As in previous cases,

the driver was automatically triggered via an API call, which intends to trigger the Spark

Job and initiate the framework’s operation. Figure 4.15 presents the corresponding status

of the Spark cluster’s WebUI.

4.1 Validating the Scalable Capabilities of the Underlying Infrastructure

Figure 4.15. The Spark Cluster’s WebUI, depicting the driver “driver-20231004114951-

0001” in running state, for the 2020 Mobility subset [1].

According to the driver’s logs (Figure 4.16), the 2020 Mobility subset — comprising

10.1 million documents and totaling 4.1 GB — was successfully preprocessed, ﬁltered,

cleaned, and stored within 5.3 minutes. This is considered a highly reasonable processing

time. The results suggest that the scalable data management layer maintains consistent

eﬃciency regardless of the dataset’s speciﬁc content. However, it is important to note

that this observation currently applies to tabular-style datasets, which are commonly

produced across various critical infrastructure sectors and business organizations.

Figure 4.16. A screenshot from the “driver-20231004114951-0001” Spark driver’s logs,

for the Mobility 2020 subset [1].

Regarding the 2021 Mobility subset, it is the smallest among the Mobility sets, com-

prising approximately 5.7 million records and occupying 2.4 GB of storage space, as

previously mentioned. As usual, the set was processed using the Pre-Processing and

Filtering Tool through the execution of a speciﬁc Spark driver, identiﬁed as “driver-

20231004115659-0002”, which is illustrated in Figure 4.17.

Chapter 4. Testing and Results: Assessing the Study’s Performance

Figure 4.17. A screenshot from the Spark Cluster’s WebUI, where the driver “driver-

20231004115659-0002” is depicted in active state, for the Mobility 2021 subset [1].

The process was completed in approximately 3.5 minutes, once again demonstrating

the eﬃciency of the scalable data management layer (Figure 4.18). For the Mobility

subsets, testing followed a chronological order rather than an order based on dataset

size, similar to the approach taken with the Customs subsets. This sequence was chosen

to reﬂect the order in which the datasets were provided by OTE Group.

Figure 4.18. A screenshot from the “driver-20231004115659-0002” Spark driver’s logs,

which handled the Mobility subset of 2021 [1].

Finally, the scalable data management layer was evaluated using the custom fused

Mobility dataset, which combines all three previously utilized datasets. As a reminder,

this combined dataset contains 22.8 million records and occupies 9.2 GB of disk space.

The Pre-Processing and Filtering Tool was applied to the dataset via a Spark driver named

“driver-20231004120403-0003”, as shown in Figure 4.19.

4.1 Validating the Scalable Capabilities of the Underlying Infrastructure

Figure 4.19. A screenshot from the Spark Cluster’s WebUI, where the driver “driver-

20231004120403-0003” is depicted in active state, for the complete Mobility dataset [1].

According to the driver’s logs shown in Figure 4.20, the entire process was completed

in less than 11 minutes (speciﬁcally, 10.78 minutes). This outcome further underscores

the operational eﬃciency of the proposed scalable layer, and suggests that the framework

remains eﬀective regardless of the source of the critical infrastructure or business domain

dataset.

Figure 4.20. A screenshot from the Spark driver’s “driver-20231004120403-0003” logs,

as it ceased its operation on the complete Mobility dataset [1].

With this, the performance evaluation phase is concluded. The results clearly indicate

that processing time increases linearly with the size of the input data. This trend is

illustrated in Figures 4.21, 4.22, and 4.23, which present the ﬁndings using three scatter

plots. The ﬁrst plot displays the processing time (in minutes) for each of the ITI Customs

and OTE Mobility datasets, represented by separate lines, in relation to their respective

sizes (in GBs). The second plot consolidates all tests into a single line. It is important to

note that the ﬁnal Customs dataset, which is 55.4 GB in size, is excluded from the ﬁrst

and second plots, since its inclusion would drastically aﬀect the depiction of the other

Chapter 4. Testing and Results: Assessing the Study’s Performance

sets’ smaller values. The third and ﬁnal plot includes the ﬁnal Customs dataset as well.

All three visualizations are provided below.

Figure 4.21. Scatter plot with the tests of ITI Customs and OTE Mobility subsets, in

diﬀerent plot lines [1].

Figure 4.22. Scatter plot including all tests, in the same plot line [1].

Figure 4.23. Scatter plot including the ﬁnal ITI Customs dataset [1].

100

4.2 Evaluating the Final System

The linear relationship between completion time and input size is clearly evident.

More speciﬁcally, it appears that for every minute that elapses, approximately 1 GB of

data is processed, ﬁltered, cleaned, and stored. As the dataset size (in GB) increases,

the completion time (in minutes) increases proportionally. This linear growth was also

set as the “acceptance threshold” for a successful implementation. The framework was

expected to operate at a rate of “1 GB per minute”, a goal which was achieved, meeting

the expectations. Furthermore, it is worth noting that the scalable data management

layer experienced no performance degradation and consistently completed all tasks and

tests. Regardless of the input size, the system remained operational. Additionally, each

Spark driver utilized only 5 GB of RAM and 6 CPU cores from the system, underscoring

the eﬃciency of the proposed architecture.

4.2 Evaluating the Final System

Testing of the complete architecture was performed on a single physical machine to

ensure consistency and eliminate any potential hardware variability. Each of the 250

tests were closely monitored, carefully evaluating the generated results. Each query was

executed exactly 10 times, with a few exceptions caused by minor human errors that led

to the premature termination of some tests. For instance, there were occasions when

the current author of this PhD thesis (and responsible for experiments) inadvertently

failed to change the dataset in the test’s initial settings, applying a natural language

query intended for a diﬀerent dataset. Additionally, some tests were repeated without

updating the test iteration number, resulting in new results overwriting the previous ones.

Regarding the generated code, it was executed on the retrieved dataset using the exec()

command, with the Processing Platform’s code speciﬁcally looking for a “processeddf”

variable, where the LLM was instructed to store the ﬁnal analysis results. Intervention

in the LLM-generated code occurred only when the model’s response included text not

formatted as Python comments, or when the model failed to store the ﬁnal analysis results

in the designated “processeddf” variable, despite the rest of the code being correct and

functional.

4.2.1 Physical Machine Speciﬁcations

The complete system speciﬁcations of the physical machine used to run the LLM

server are provided below. While the machine is powerful, it has certain limitations

when it comes to oﬄine LLM testing, particularly with models like Codestral and Qwen.

Speciﬁcally, these models cannot be fully loaded onto the GPU, meaning that both the

GPU and CPU are utilized during the code generation process. Despite these hardware

limitations aﬀecting speed and eﬃciency, the primary goal of this thesis’s section is to

assess the capabilities of Codestral and Qwen in code generation as oﬄine LLMs. While

upgrading to more powerful hardware could improve the speed of the code generation

process, such a consideration is beyond the scope of this thesis, and could be addressed

in future research.

101

Chapter 4. Testing and Results: Assessing the Study’s Performance

Server Hardware Speciﬁcations

System Information:

–Motherboard: TUF GAMING X570-PLUS

–Processor: AMD Ryzen 7 5800X 8-Core Processor

–Memory: 32 GiB DDR4

–Storage:

* 1TB NVMe SSD (Samsung SSD 970 EVO Plus)

* 4TB HDD (ST4000DM004-2U91)

–GPU: NVIDIA GeForce RTX 3080 with 10 GiB of dedicated memory

Software Information:

–LLM Server: LM Studio 0.2.21

–Large Language Models:

* Codestral v01 22B Q6_K

* Qwen 2.5 Coder Instruct 14B Q6_K

–Partial GPU Oﬄoad:

* 17 layers for Codestral

* 25 layers for Qwen

–LLM Response Temperature: 0.7

The Codestral v01 22B Q6_K and Qwen 2.5 Coder 14B Q6_K models were selected

from the Hugging Face hub [135] for use in this study. Both models were deployed us-

ing the LM Studio Server [136], which supports the deployment of multiple oﬄine LLMs.

This capability, along with the fact that the architecture of this study is not tied to a

speciﬁc model, reinforces that the proposed methodology is model-agnostic and can be

generalized across various oﬄine LLM platforms. This was demonstrated through eval-

uations using both Codestral and Qwen (see 4.2.3). The choice of Q6_K model versions

was driven by their eﬃcient performance, enabled by 6-bit quantization. These quantized

models reduce memory usage and computational demands without substantially compro-

mising accuracy, making them well-suited for oﬄine environments where computational

resources are often limited. The Q6_K versions’ ability to maintain high performance,

while being resource-eﬃcient, makes them an optimal choice for the development context

of this study.

102

4.2.2 Testing Results Collection

Regarding the LLMs’ response temperature setting, it was set to 0.7. This choice was

made to strike a balance between creativity and reliability, which is crucial when assess-

ing a model’s code generation capabilities. Moderate temperature settings, such as 0.7,

allow the model to produce diverse outputs, while still maintaining suﬃcient coherence to

ensure the generated code is functional and relevant. This setting avoids overly determin-

istic responses that could limit the exploration of alternative coding solutions, enabling

more ﬂexibility in the generated results [137].

4.2.2 Testing Results Collection

Each test produced a set of data, which was structured into a single result object,

as shown in Listing 4.1. For each of the two LLMs, 250 of these objects were created

and compiled into a dataset with 250 entries, where each entry represents one test case.

Every entry includes detailed information about the test, covering aspects such as the

accuracy, clarity, and execution performance of the generated code, along with system

resource usage metrics. The following bullet points describe each attribute in detail,

providing a thorough summary of the collected data during testing.

•“correctness”: This column records either ‘True’ or ‘False’, indicating whether the

generated code successfully produced the intended output based on the given nat-

ural language query.

•“readability”: This column presents numerical values that reﬂect how easily a hu-

man can understand the generated code. The scores are calculated using a custom

function, ranging from 1 to 3, where 3 represents the highest readability.

•“code_execution_errors”: This column captures any errors encountered during the

execution of the generated code. If no error occurred, the value is ‘None’. If an

error is present, the column includes the corresponding error message to explain

the issue.

•“executed_command”: This column contains the complete code that was executed

for each test case. It may also include Python comments.

•“code_repetition_id”: This column indicates the repetition number for each test case.

Since each query was tested ten times, this ﬁeld takes values from 1 to 10, showing

which iteration the entry corresponds to.

•“dataset”: This column lists one of the ﬁve datasets used in the study. Each entry

names a single dataset: ‘supermarket’, ‘netﬂix’, ‘shared-cars-locations’, ‘covid19-

twitter’, or ‘madrid-daily-weather’. The dataset named in each entry identiﬁes which

dataset was used for the corresponding test.

•“user_query”: This column holds the exact query entered by the user and forwarded

to the oﬄine LLM for code generation. The corresponding generated code for each

query is available in the “executed_command” column.

103

Chapter 4. Testing and Results: Assessing the Study’s Performance

•“llm_response_cpu”: This column shows the CPU usage percentages recorded dur-

ing the LLM’s code generation process.

•“llm_response_memory”: This column reports the memory usage of the LLM server

while generating code. The values are expressed as percentages.

•“llm_response_gpu”: This column reﬂects the GPU usage percentages observed on

the LLM server during code generation.

•“llm_response_gpu_mem”: This column displays the percentage of GPU memory

used by the LLM server during the code generation phase.

•“llm_response_response_time”: This column records the response times of the LLM

server, measured from when the query was received to when the generated code

was returned. The times are given in seconds.

•“automated”: This column indicates whether the code for each test was executed

automatically or with minimal human intervention. A value of ‘True’ means full

automation, while ‘False’ signiﬁes semi-automated execution.

•“query_no”: This column identiﬁes the query number for each test. Since ﬁve

queries were created for each dataset, the possible values are ‘q1’, ‘q2’, ‘q3’, ‘q4’,

and ‘q5’.

•“query_level”: This column reﬂects the complexity level of the query in each test.

The values are ‘basic’, ‘intermediate’, or ‘advanced’, based on the query’s diﬃculty.

Listing 4.1: A testing result (JSON object) from the ’Netﬂix’ dataset, and the basic query

’Count the Number of TV Shows per Country’.

{

"correctness": true,

"readability": 3,

"code_execution_errors": "None",

"executed_command": "#CountthenumberofTVShowspercountry\n\nprocesseddf=df.ﬁlter

(df[’type’]==\"TVShow\").groupBy(’country’).count()",

"code_repetition_id": "7",

"dataset": "netﬂix",

"user_query": "CounttheNumberofTVShowsperCountry",

"llm_response_cpu": 34.45,

"llm_response_memory": 0.01,

"llm_response_gpu": 20.25,

"llm_response_gpu_mem": 89.03,

"llm_response_response_time": 19.41,

"automated": "true"

}

104

4.2.3 Evaluation

The evaluation was carried out using the collected results and the dataset constructed

from them, based on the main criteria described in subsection 3.3. The analysis produced

a set of key insights and recommendations. A summary of these ﬁndings, along with a

discussion on potential future directions, is provided in section ??. It is worth noting

that the ‘Contextual Complexity’ metric was examined in relation to all other evaluation

metrics. This combined analysis oﬀered a more complete understanding of query com-

plexity in relation to other metrics, along with its impact, leading to more meaningful

conclusions. Therefore, a separate subsection dedicated to Contextual Complexity was

not included.

Functional Correctness

This metric aims to evaluate how eﬀectively the tested LLMs generate correct code in

response to user queries. It measures whether the output code produces the expected

results. The ﬁrst step involves calculating the overall success rate, by comparing the

number of correct and incorrect outputs. Next, the analysis focuses on correctness across

diﬀerent datasets and query complexity levels. These two additional analyses provide a

more detailed view of correctness, oﬀering insight into how success rates vary depending

on speciﬁc aspects of the study. The results of all three tasks are presented below.

Figure 4.24. Functional correctness of the LLM’s generated code plot [4].

Figure 4.24 shows that the number of correct outputs from the code generation and

execution process clearly exceeds the number of incorrect ones. In particular, Codestral

successfully produced correct results in 228 out of 250 tests, with only 22 tests failing to

meet expectations. This is an indication of its strong performance in generating accurate

code. Similarly, Qwen 2.5 Coder produced correct outputs in 223 tests, while 27 tests

resulted in incorrect results. This comparison indicates that both LLMs can eﬀectively

105

Chapter 4. Testing and Results: Assessing the Study’s Performance

interpret user queries and generate functioning code, with Codestral achieving a slightly

higher success rate. To ensure the reliability of the results, manual validation was per-

formed throughout the process. After each test, a veriﬁcation process was conducted, on

whether the generated output aligned with the intended result of the original query. For

each case, Python scripts using the Pandas library were written to reproduce the expected

results, which were then compared with the outputs generated by the LLMs.

Figure 4.25 presents a comparison of correctness scores across all ﬁve datasets. The

models showed consistent performance, suggesting that they can produce accurate code

regardless of the dataset context. For Codestral, the lowest correctness rate was observed

with the ‘Shared Cars Locations’ dataset at 0.86 (43 out of 50 tests), followed by the

COVID19-‘Twitter’ dataset at 0.88 (44 out of 50). The ‘Madrid Daily Weather’, ‘Netﬂix’,

‘and Supermarket Sales’ datasets each achieved a rate of 0.94 (47 correct outputs). In

contrast, Qwen 2.5 Coder displayed a slightly diﬀerent distribution: the ‘Netﬂix’ dataset

had the lowest rate at 0.82 (41 out of 50 correct tests), followed by ‘COVID19-Twitter’ at

0.84 (42 correct). The ‘Madrid Daily Weather’ dataset reached a score of 0.92 (46 out

of 50 correct), while both the Shared ‘Cars Locations’ and ‘Supermarket Sales’ datasets

recorded the highest rate of 0.94 (47 correct tests each). These ﬁndings suggest that

both LLMs can consistently generate correct code across diﬀerent datasets, with some

variation in performance depending on the dataset.

Figure 4.25. Plot depicting the functional correctness, by dataset [4].

Figure 4.26 presents the relationship between correctness and query complexity. For

Codestral, the lowest correctness rate is observed with advanced queries at 0.84 (84

out of 100 correct outputs), while intermediate queries reach a rate of 0.95 (95 out of

100), and basic queries achieve the highest performance at 0.98 (with only one incorrect

output in 50 tests). Qwen 2.5 Coder follows a similar pattern: correctness for advanced

queries stands at 0.83 (83 out of 100), drops slightly (compared to Codestral) to 0.91 for

106

4.2.3 Evaluation

intermediate queries (91 correct outputs), and reaches 0.98 for basic queries (49 out of

50 correct). It is important to highlight that the number of tests for basic queries is half

the amount of those for advanced and intermediate queries. Nevertheless, both models

demonstrate a consistent trend: As query complexity increases, the likelihood of incorrect

outputs also rises. This indicates that both LLMs face more challenges when handling

advanced queries, which is an expected outcome, given the higher level of contextual and

logical complexity involved.

Figure 4.26. Plot for the functional correctness scores by the queries’ complexity levels [4].

Readability

The objective of this evaluation metric is to examine the readability of code generated

by each LLM, as this can inﬂuence the ease with which a human can understand the

code. The analysis consists of three tasks, starting with the distribution of readability

scores across all tests. Also, the average readability scores are provided, categorized

by dataset and query complexity. The approach to this analysis is analogous to the

one used in assessing Functional Correctness, as previously examined. As described in

subsection 3.3, the readability score is derived from a custom function, which produces

scores ranging from ‘1’ to ‘3’, where a score of ‘3’ indicates highly readable code. This

function evaluates the code based on factors such as line length, method call chains,

and nested structures, assigning penalties for lines exceeding 80 characters, method call

chains longer than three calls, and nesting depths greater than two levels.

The results shown in Fig. 4.27 indicate that the majority of tests using Codestral

received a readability score of ‘2’. Speciﬁcally, 176 out of 250 tests were assigned this

score, while the remaining 74 tests earned the highest score of ‘3’. No tests were rated with

a score of ‘1’. In a similar evaluation, Qwen 2.5 Coder displayed a comparable pattern,

with 177 out of 250 tests scoring ‘2’ and 71 tests scoring ‘3’. However, two tests generated

by Qwen were assigned a score of ‘1’. While this is a small and potentially insigniﬁcant

number, it suggests that, on rare occasions, the code produced by Qwen may exhibit more

107

Chapter 4. Testing and Results: Assessing the Study’s Performance

noticeable readability issues. For both models, the primary reason for not achieving the

highest score was the presence of long lines of code. This indicates that the generated code

often exceeds the recommended 80-character limit, a factor penalized by the readability

function, as such lines are considered harder to read and maintain. While longer lines

can sometimes improve code eﬃciency by consolidating commands, they often detract

from the readability for humans, which is the main reason most generated code did not

attain the highest readability score of ‘3’.

Figure 4.27. Plot for the distribution of readability scores across the tests conducted [4].

Fig. 4.28 presents the average readability scores for each dataset. For Codestral,

the ‘Madrid Daily Weather’ dataset shows the lowest average readability score at 2.08,

with 46 out of 50 tests receiving a score of ‘2’. In comparison, Qwen 2.5 Coder achieves

a slightly higher average of 2.16 on the same dataset, with 42 out of 50 tests scoring

‘2’, suggesting a marginally better format optimization. Similarly, for the ‘Shared Cars

Locations’ dataset, Codestral’s average score is 2.28, with 36 out of 50 tests rated ‘2’,

whereas Qwen records a slightly higher average of 2.38, with 31 out of 50 tests receiving

a ‘2’ rating. In the case of the Netﬂix’ dataset, Codestral attains an average score of 2.30,

with 35 out of 50 tests rated ‘2’, while Qwen has a lower average of 2.16, with 38 out of

50 tests rated ‘2’, but also includes 2 tests with a score of ‘1’. The ‘Supermarket Sales’

dataset yields similar averages for both models, with Codestral scoring 2.38 and Qwen

slightly ahead at 2.40. For the ‘COVID19-Twitter’ dataset, Codestral surpasses Qwen

with an average score of 2.44, compared to Qwen’s 2.28. While the results do not lead to

deﬁnitive conclusions or recommendations, they may suggest that certain datasets allow

for more eﬃcient operations (in terms of the number of commands used), potentially

reﬂecting the speciﬁc nature and context of the data in these datasets.

Fig. 4.29 displays the average readability scores categorized by query complexity

levels. For Codestral, basic queries achieve the highest average readability score of 2.8,

with 40 out of 50 tests generating code that received a readability score of ‘3’. Intermediate

108

4.2.3 Evaluation

Figure 4.28. Plot depicting the average readability scores by dataset [4].

queries have an average score of 2.26, with 74 out of 100 tests scoring ‘2’, while advanced

queries yield the lowest average of 2.08, with 92 out of 100 tests rated ‘2’, and only 8

tests scoring ‘3’. In contrast, for Qwen 2.5 Coder, basic queries achieve a slightly lower

average readability score of 2.58, with 29 out of 50 tests receiving a ‘3’ rating. Intermediate

queries have an average of 2.29, with 69 out of 100 tests rated ‘2’, and one test scoring ‘1’.

Advanced queries average 2.11, with 87 out of 100 tests rated ‘2’, 12 scoring ‘3’, and one

test rated ‘1’. These results suggest that both models generally produce less readable code

as query complexity increases, which is expected. It is worth noting that, while Codestral

tends to generate more concise, single-line code for basic queries, Qwen’s approach for

simpler queries appears to be slightly less eﬃcient. As for intermediate and advanced

queries, both models predominantly generate code with longer lines of commands, which

contributes to the higher occurrence of readability scores of ‘2’.

Figure 4.29. Plot displaying the average readability by query complexity level [4].

109

Chapter 4. Testing and Results: Assessing the Study’s Performance

Eﬃciency

This metric is designed to evaluate the performance of the LLMs in terms of their

computational resource usage during the code generation process. As detailed in subsec-

tion 4.2.1, the Codestral and Qwen 2.5 Coder models are not fully loaded onto the GPU.

This results in both the GPU and CPU being utilized during the code generation process,

particularly the GPU’s memory. The ﬁrst task in the analysis is to examine the distribu-

tion of GPU usage, CPU usage, GPU memory, system memory (RAM), and response time

across the tests conducted for each LLM. The second task is to analyze computational

resource usage based on query complexity and readability, in order to understand how

performance varies across diﬀerent query complexity levels and readability scores. It is

important to note that readability scores of ‘2’ generally correspond to longer lines of code,

which tend to require longer periods for code generation.

Fig. 4.30 shows the distribution of response times (in seconds) during the code gen-

eration process for both LLMs. Response time is measured from the moment the natural

language query is sent to the LLM, until the complete response is received. As depicted

in the ﬁgure, most code generation processes had response times under 100 seconds

for both models. For Codestral, the average response time was 79.64 seconds, with a

median of 67.90 seconds, and some instances exceeding 200 seconds, reaching a maxi-

mum of 319.53 seconds. On the other hand, Qwen 2.5 Coder demonstrated signiﬁcantly

better eﬃciency, with an average response time of 27.66 seconds, a median of 22.98

seconds, and a maximum of 81.38 seconds. This considerable diﬀerence suggests that

Codestral’s slower performance is largely due to the partial oﬄoading of operations to the

GPU, whereas Qwen beneﬁts from a larger proportion of operations being oﬄoaded to the

GPU. Since Qwen has fewer parameters (14B compared to Codestral’s 22B), its smaller

total size allows it to achieve response times closer to real-time standards.

Figure 4.30. Distribution of the LLM server’s response time during code generation [4].

Figures 4.31 and 4.32 illustrate the distribution of CPU and memory usage (as per-

110

4.2.3 Evaluation

centages) for the LLM server, during the code generation process. For Codestral, most

queries resulted in an average CPU usage of approximately 28.5%. Only a small number

of processes exceeded 50% usage, indicating that the CPU demand during code generation

was generally moderate. In contrast, Qwen 2.5 Coder displayed an average CPU usage

of around 37.6%, with its interquartile range indicating that many processes consumed

between 27% and 47.8% of the available CPU resources. Regarding system memory,

both models showed minimal RAM usage. Codestral’s average memory usage was about

0.44%, with nearly all queries falling within the 0–2% range. Qwen’s average was slightly

higher at around 0.45%, though its distribution was somewhat narrower, with a median

of 0.48% and a maximum of 4.65%. These ﬁndings suggest that, despite the moderate

CPU usage, particularly for Qwen, both systems utilized other resources (as expected),

since both LLMs were deployed across both GPU and CPU.

Figure 4.31. Distribution of the LLM server’s CPU usage during code generation [4].

Figure 4.32. Distribution of the LLM server’s memory usage during code generation [4].

111

Chapter 4. Testing and Results: Assessing the Study’s Performance

Figures 4.33 and 4.34 oﬀer additional insights into resource usage, by displaying the

overall GPU utilization (as percentages) during the code generation processes across all

tests. In terms of GPU usage, most processes utilized approximately 20% of the GPU’s

capacity. For Codestral, the average GPU usage was around 21.3%, with the interquartile

range spanning from roughly 18.7% to 21.8%. Qwen 2.5 Coder demonstrated a slightly

lower average of 18.0%, with most tests falling between 14.4% and 19.3%. These mod-

erate GPU usage levels are consistent with the observed CPU utilization, reﬂecting the

shared workload between CPU and GPU for both models. In contrast, GPU memory us-

age was consistently high for both LLMs. Codestral used an average of approximately

88.8% of the available GPU memory, while Qwen averaged around 90.1%. This suggests

that a signiﬁcant portion of each model’s parameters was loaded into GPU memory. The

combination of moderate GPU and CPU usage with high GPU memory consumption im-

plies that the code generation tasks likely involved short bursts of computation, rather

than sustained, high-load processing. The system appears to have maximized the use of

GPU memory, while distributing the computational workload between the GPU and CPU.

Furthermore, the relatively moderate levels of GPU usage may indicate that the generated

code was not particularly complex for the models to handle, resulting in lower overall

computational demands.

Figure 4.33. Distribution of the LLM server’s GPU usage during code generation [4].

Figures 4.35 and 4.36 show the average response times based on the readability of

the generated code and the query complexity level. In terms of readability, Codestral

required signiﬁcantly more time to generate code rated with a readability score of ‘2’,

averaging around 93.7 seconds per query. In contrast, code with a readability score of

‘3’ was generated in approximately 46.2 seconds. A similar pattern is observed for Qwen

2.5 Coder, though with generally lower response times. Speciﬁcally, tests rated ‘2’ took

about 31.6 seconds on average, while those rated ‘3’ completed in roughly 18.1 seconds.

This diﬀerence is expected, as a score of ‘2’ typically reﬂects longer lines of code, which

require more processing, whereas a score of ‘3’ indicates shorter, more concise code that

112

4.2.3 Evaluation

Figure 4.34. Distribution of the LLM server’s GPU memory usage during code generation

[4].

is quicker to generate. Qwen’s faster overall response times are consistent with its smaller

size (14B parameters) and greater capability to oﬄoad operations to the GPU, compared

to Codestral.

Figure 4.35. The LLM server’s average response time, by code readability [4].

113

Chapter 4. Testing and Results: Assessing the Study’s Performance

Figure 4.36. The LLM server’s average response time, by query complexity level [4].

The response time results based on query complexity follow a predictable pattern.

For Codestral, basic queries had the shortest average response time at approximately

26.5 seconds. Intermediate queries took around 67.5 seconds, while advanced queries

required close to 118.3 seconds per query. Qwen 2.5 Coder showed a similar progression,

with basic queries averaging 10.3 seconds, intermediate ones at about 23.6 seconds,

and advanced queries taking roughly 40.5 seconds. These results are consistent with

expectations, as simpler queries typically lead to shorter, less complex code, while more

advanced queries involve greater processing and longer generation times.

Figures 4.37 and 4.38 present the average CPU and GPU usage levels based on code

readability and query complexity. A notable observation from both ﬁgures is that CPU and

GPU usage levels are relatively similar between the two readability scores. For CPU usage,

Codestral’s values for readability score ‘2’ fall between approximately 27.5% and 32.3%,

while those for score ‘3’ range from around 23.3% to 32.6%. A similar pattern is seen

with Qwen 2.5 Coder: For readability score ‘2’, CPU usage ranges from roughly 34.0% to

37.6%, whereas for score ‘3’, it ranges from 41.6% to 42.6%. The two Qwen cases with

a readability score of ‘1’ were excluded from analysis, due to their limited representation.

As for GPU usage, Codestral’s values for readability score ‘2’ range from about 20.2% to

21.5%, and for score ‘3’ from 20.3% to 22.4%. Qwen’s GPU usage under readability score

‘2’ ranges between 15.9% and 18.4%, while under score ‘3’, it increases slightly, ranging

from 17.4% to 20.98%. Overall, the CPU and GPU utilization appear consistent across

readability levels.

114

4.2.3 Evaluation

Figure 4.37. Average CPU usage of the LLM server, by readability and query complexity

[4].

Figure 4.38. The LLM server’s average GPU usage, by readability and query complexity

[4].

These results can be explained by the system’s operational behavior during code gen-

eration. As the process progresses, the system tends to reach a steady state in terms of

CPU and GPU usage. Once this stable condition is established, the level of resource con-

115

Chapter 4. Testing and Results: Assessing the Study’s Performance

sumption generally remains consistent until the response is fully generated, regardless

of the code’s complexity or length. This behavior may also help explain another ﬁnding

related to query complexity: In certain cases, basic queries show higher average CPU

and GPU usage than intermediate or advanced queries, and occasionally even record the

highest usage among the three.

Figure 4.39. Average response times of the LLM server, by readability and query complex-

ity [4].

Since basic queries typically require much shorter generation times (as shown in Fig.

4.39), their CPU and GPU usage is more aﬀected by the system’s initial performance

surge. In the case of intermediate queries, this initial spike tends to have less impact, as

the longer duration of the process allows these early values to be smoothed out. The same

reasoning applies to advanced queries. As time passes, the system enters a more stable

phase where CPU and GPU usage levels oﬀ, while GPU memory usage remains consistently

high throughout. This interpretation is supported by the average GPU memory usage

shown in Fig. 4.40, where all readability scores and query complexity levels exhibit

similarly high and steady memory demand. Nonetheless, it is important to highlight that

the diﬀerences in usage values across all cases are relatively minor, so these observations

should be considered with caution.

116

4.2.3 Evaluation

Figure 4.40. Average GPU memory usage of the LLM server, by readability and query

complexity [4].

Automation

This evaluation metric aims to measure how eﬀectively the generated code can be

executed automatically, without the need for manual adjustments. After each response

is generated by the LLM, it is reviewed by the end-user. If the code is immediately usable

(executable) for data processing, it is labeled as True for automation. Conversely, if minor

manual modiﬁcations are required, such as slight adjustments or isolating a functional

part of it, the code is labeled as False, even if it ultimately produces the correct result.

The analysis begins by examining how many tests fall into each of the two categories.

Subsequently, the automation status is evaluated in relation to functional correctness

and query complexity. These steps are intended to assess each LLM’s ability to deliver

code that directly fulﬁlls the prompt instructions, and could also be used as part of an

automated data analysis pipeline.

Figure 4.41 shows the automation categorization across all 250 tests. For Codestral,

202 responses were marked as True, indicating that the generated code could be used

without any manual intervention. The remaining 48 tests were labeled as False, sug-

gesting that some human involvement was necessary before execution. This corresponds

to an automation success rate of 80%, meaning that 1 in 5 responses still required

some adjustment. In contrast, Qwen 2.5 Coder produced 239 fully automated responses,

achieving a success rate of roughly 95.6%, with only 11 tests requiring minor edits. These

results suggest that Qwen may currently be more suitable for use in a fully automated

environment, while Codestral might beneﬁt from further reﬁnement or optimization of its

initial, pre-trained conﬁguration.

117

Chapter 4. Testing and Results: Assessing the Study’s Performance

Figure 4.41. The amount of automated and semi-automated tests, based on human inter-

vention to their code [4].

As previously mentioned, there are two main reasons why human intervention was

occasionally required for executing the generated code. First, despite explicit instructions

in the prompt to avoid it, the LLMs occasionally included additional explanatory text not

formatted as Python comments. In such cases, this extraneous text was manually re-

moved, retaining only Python-compliant comments where applicable. If this adjustment

had not been made, the code application step would have failed, as the exec() function

cannot interpret plain text that is not syntactically valid Python code. Second, the work-

ﬂow expected the ﬁnal output of the generated code to be assigned to a variable named

“processeddf”. If the LLM omitted this, or assigned the result to a diﬀerent variable, the

ﬁnal output was manually redirected to the “processeddf” variable. Without this variable

assignment, the system would be unable to continue with the subsequent result extrac-

tion step. Although the prompt explicitly instructed the LLMs to assign the ﬁnal result to

“processeddf”, this requirement was not always met.

Figures 4.42 and 4.43 examine the relationship between automation outcomes and

two additional metrics: functional correctness and query complexity. Concerning func-

tional correctness, the results demonstrate that automation status did not signiﬁcantly

aﬀect the correctness of the generated code. In other words, whether or not manual re-

ﬁnement was needed, the likelihood of producing a functionally correct result remained

relatively stable. For Codestral, 75% of all tests (187 out of 250) produced correct out-

puts when the code was fully automated, while another 16.5% (41 out of 250) were correct

despite requiring minor human intervention. Qwen 2.5 Coder exhibited a similar trend,

achieving 213 correct outcomes from its automated responses, corresponding to 85%,

and an additional 4% (10 out of 250) correct outcomes after slight manual adjustments.

These ﬁndings suggest that human intervention was generally limited to formatting or

output variable compliance, rather than corrections to the underlying logic of the code.

118

4.2.3 Evaluation

Figure 4.42. Comparing Functional Correctness results with Automation occurences [4].

Figure 4.43. Presenting the automation occurrences by the query complexity levels [4].

As for the latter ﬁgure (4.43), the automation rate for Codestral was 86% for both

basic and intermediate queries (43 out of 50, and 86 out of 100 tests, respectively),

while advanced queries demonstrated a slightly lower automation rate of 73% (73 out

of 100 tests). In comparison, Qwen 2.5 Coder exhibited consistently higher automation

performance across all levels of query complexity, achieving 100% automation for basic

queries (50 out of 50), 91% for intermediate queries (91 out of 100), and 98% for advanced

queries (98 out of 100). While the diﬀerences in Codestral’s automation percentages

119

Chapter 4. Testing and Results: Assessing the Study’s Performance

are relatively minor, they may indicate a tendency for the model to introduce additional

explanatory content in more complex scenarios, which occasionally necessitates minor

human intervention. Nonetheless, these variations are not pronounced enough to support

strong conclusions. Overall, both LLMs demonstrated high levels of automation across

all complexity levels, reinforcing their potential to generate executable code with little to

no manual reﬁnement.

Error Handling

This evaluation metric aims to assess the robustness of the generated code, specif-

ically focusing on the frequency and severity of errors, as well as their impact on the

functional correctness of the results. The three plots presented in this section illustrate

the distribution of errors across all tests, segmented by dataset, query complexity level,

and the potential inﬂuence of errors on output accuracy. As shown in Figure 4.44, only

20 out of 250 tests conducted with Codestral contained errors in the generated code,

yielding a success rate of 92%, which stands as a strong indication of the model’s relia-

bility. Furthermore, when considering both the presence of errors and the correctness of

results, 218 out of 250 tests featured error-free code that produced accurate outcomes,

corresponding to an overall success rate of 87%. These ﬁndings highlight Codestral’s ro-

bustness, and its capacity to consistently generate syntactically and functionally sound

code.

Figure 4.44. Comparing Functional Correctness results with error counts [4].

In comparison, Qwen 2.5 Coder demonstrated a somewhat higher incidence of errors,

with 41 out of 250 tests containing erroneous code, corresponding to a success rate of

85%. When functional correctness is considered, 200 out of 250 tests involved error-free

code that produced the correct result, leading to an 80% success rate. These results

remain encouraging for both LLMs, particularly given the complexity of the task: Gener-

120

4.2.3 Evaluation

ating PySpark code using Spark DataFrames. The structural and functional diﬀerences

between Spark and the more commonly used pandas DataFrames could have introduced

considerable challenges. Consequently, the observed low error rates represent a notewor-

thy achievement. However, it is important to note that, in Qwen’s case, the prompt had

to be supplemented with an additional instruction to guide the model toward generating

valid PySpark commands, as it occasionally intermingled them with invalid pure Python

syntax. This point will be elaborated upon in subsection 5.3.1.

Figures 4.45 and 4.46 illustrate the distribution of error occurrences and their cor-

responding impact on functional correctness, across the ﬁve datasets used in testing, as

well as across the three levels of query complexity. While the relatively small number of

errors makes it premature to draw deﬁnitive conclusions, some preliminary observations

can be made to guide future research eﬀorts. Regarding the dataset-wise distribution,

Codestral exhibited the highest number of error-labeled tests in the ‘Shared Cars Lo-

cations’ dataset, with a total of 13 errors. This is markedly higher than in the other

datasets, where the ‘COVID-19 Twitter’ dataset yielded 3 errors, the ‘Supermarket Sales’

dataset 2 errors, and both the ‘Madrid Daily Weather’ and ‘Netﬂix’ datasets only 1 error

each. Similarly, Qwen 2.5 Coder recorded the highest error count in the ‘Shared Cars

Locations’ dataset, with 22 errors. As for the remaining ones, Qwen produced 7 errors in

the ‘Netﬂix’ dataset, 6 in the ‘COVID-19 Twitter’ dataset, 4 in the ‘Madrid Daily Weather’

dataset, and 2 in the ‘Supermarket Sales’ dataset.

Figure 4.45. The number of errors found in the tests, per dataset [4].

121

Chapter 4. Testing and Results: Assessing the Study’s Performance

Figure 4.46. The total count of errors, grouped by query complexity levels [4].

Regarding the distribution of functional correctness, it was observed that in most

cases, the presence of errors in the generated code led to incorrect outcomes, indicating

that such errors adversely aﬀected the desired output. However, an exception was noted

in the case of the ‘Shared Cars Locations’ dataset, where 10 tests from Codestral and 20

tests from Qwen 2.5 Coder included code that contained errors, but still produced correct

results. As shown in Fig. 4.46, the majority of these cases corresponded to the dataset’s

two intermediate queries. Nevertheless, Qwen also exhibited a noticeable number of such

instances in basic queries. The primary reason for this phenomenon lies in how both

LLMs interpret and handle the dataset’s timestamp ﬁelds. Although the models occa-

sionally failed to correctly modify, or explicitly deﬁne these ﬁelds as instructed, PySpark’s

internal handling of datetime data enabled the system to operate correctly, despite these

shortcomings. Consequently, the functional correctness of the results remained intact,

even when the code deviated from the expected format. For Codestral, these 10 instances

account for 20% of the total tests conducted on the ‘Shared Cars Locations’ dataset, while

for Qwen, the 20 successful (yet erroneous) tests represent 40% of the same dataset.

An early interpretation of this behavior suggests that both Codestral and Qwen may

require further reﬁnement in understanding the contextual and relational dynamics be-

tween columns in datasets involving location information, particularly in aligning times-

tamp data with their corresponding spatial records. To validate and expand upon this

observation, additional studies should be conducted using alternative forms of location-

based datasets, such as those involving indoor positioning data [5]. An indoor positioning

use case for future application of the current thesis’s study, is presented in chapter 6.

122

Chapter 5

Thesis Findings: Critical Assessment of the Pro-

posed System

5.1 General Note

As a concept, this thesis’s scalable data management and AI-powered analytics pro-

posal can oﬀer a promising alternative approach in the domain of data analysis, proﬁling,

and quality assessment. Its capability to analyze entire datasets, support user-deﬁned

quality rules, and operate independently of data volume, can make it a useful framework

for both data analysts and engineers. In addition, the insights gained from its use are ex-

pected to highlight the beneﬁts of analyzing complete datasets with ease, making queries

in natural language and having an oﬄine LLM translate them in executable analysis code.

As previously noted, the development of this framework consists of an implementation

introduced in a publication in 2023 [1], as well as a comprehensive study in 2025 [4].

5.2 Validating Scalable Data Management Frameworks

Regarding the thesis’s underlying software infrastructure, further research could be

conducted to enhance the capabilities of the scalable data management layer and demon-

strate its applicability across multiple business sectors, beyond its current tested environ-

ments. Preliminary testing with telecommunications data from the OTE Group supports

this broader potential. Evaluating the layer within industrial environments would pro-

vide valuable insights into its versatility. Future development eﬀorts should prioritize the

long-term integration of assets like the scalable data management layer in CI sectors and

business organizations, assessing their overall impact.

Continued support from the European Union in academic institutions and companies

(through participation in research initiatives) can prove to be important, particularly with

a shift towards solutions that are closer to real-world deployment. This recommendation

also extends to global research teams interested in eﬃcient and scalable data manage-

ment. Future frameworks should undergo comprehensive testing across diverse domains

to ensure they can handle large volumes of heterogeneous data that vary in structure,

type, and format. A current limitation of the proposed scalable data management layer

is its focus on tabular data, which, as of now, restricts its ability to process other data

123

Chapter 5. Thesis Findings: Critical Assessment of the Proposed System

types. This issue should be addressed by adapting the architecture to accommodate a

wider range of data formats. Lastly, while the layer has proven to be eﬀective within a set

of domain-speciﬁc datasets, its broader potential remains to be fully explored.

5.3 Entrusting Oﬄine LLMs for Future Data Analytics

5.3.1 Strengths and Limitations

As the main and ﬁnal component of this thesis’s framework, the AI-powered data ana-

lytics asset has demonstrated solid signs of operational eﬃciency. The results highlighted

the strong performance of the two models tested, mainly in generating functional code

scripts aligned with the given objectives. In the case of Codestral, 87% of the 250 test

cases were fully successful, indicating both functional correctness and error-free execu-

tion. Similarly, Qwen 2.5 Coder achieved a full success rate of 80% across the same test

set. Moreover, the generated code from both models generally received readability scores

between ‘2’ and ‘3’, suggesting it was easy for humans to interpret. Only 2 of Qwen’s

250 tests received the lowest readability score of ‘1’. In terms of automation, Codestral

successfully automated 80% of its outputs, while Qwen surpassed this with a 96.5%

automation rate, reducing the need for human intervention. Overall, when considering

correctness irrespective of minor errors or manual edits, Codestral reached a 91% success

rate, and Qwen achieved 90%. These ﬁndings demonstrate the reliable code-generation

capabilities of both models, even in the context of generating PySpark code, which is

typically more complex than standard Python.

Overall, both oﬄine LLMs proved capable of supporting data analysis workﬂows, with-

out requiring data to be uploaded into the models themselves. Instead, they generated

tailored code that was executed within the dedicated, scalable data management platform.

This outcome reinforces one of this thesis’s original aims: To enable secure and eﬀective

data analysis through end-user queries expressed in natural language. By facilitating

code generation in secure, on-premises environments, this oﬄine LLM-powered approach

addresses critical concerns related to data security, privacy, and integrity. The results

suggest that oﬄine LLMs can be further explored as reliable tools for code generation in

data analytics tasks, oﬀering a compelling alternative to online models, and potentially

replacing conventional data analysis frameworks.

Nonetheless, this thesis’s study also revealed certain limitations. Despite the robust

hardware speciﬁcations of the machine hosting the language models, performance was

not suﬃcient for fast, real-time code generation. On average, Codestral required approx-

imately 60 seconds to produce a response, while Qwen 2.5 Coder responded in about 25

seconds. These response times, particularly in the case of Codestral, are not ideal for

practical applications. While both CPU and GPU usage remained moderate, GPU memory

was consistently under heavy load. This indicates an early and clear need for a more

powerful GPU to achieve faster performance. As such, hardware limitations currently

represent the primary obstacle to scaling this approach further, based on the present

capabilities of oﬄine LLMs.

124

5.3.2 Future Work

The ﬁeld of large language models is advancing rapidly. Under current technological

conditions, a major performance improvement would likely require a signiﬁcant GPU

upgrade, ideally one with suﬃcient dedicated memory to fully accommodate the LLM.

Achieving near-real-time performance would necessitate dedicating a high-performance

server exclusively to the model itself. However, the cost associated with such a system

would be substantial, potentially making it prohibitive for many organizations. Whether

such an investment is justiﬁed depends on each organization’s priorities and available

resources. Nevertheless, the ability to conduct data analysis securely — without exposing

sensitive data to external environments — remains a strong incentive for adopting oﬄine

LLMs in on-premise applications.

In addition to performance limitations, improvements in process automation, partic-

ularly for Codestral, are necessary. In 20% of its test cases, human intervention was

required to make minor adjustments to the generated code, such as removing extraneous

text or assigning the ﬁnal output to a speciﬁed variable, even when these instructions

were clearly stated in the prompt. Achieving full automation is essential for practical,

real-world use. By contrast, Qwen 2.5 Coder required manual edits in only 4.4% of its

test cases, indicating a higher level of readiness for deployment in automated environ-

ments.

Finally, while both Codestral and Qwen 2.5 Coder consistently followed the instruc-

tions provided in the main prompt, it was noted that Qwen initially tended to generate

partial solutions using standard Python, rather than valid PySpark operations. To miti-

gate this, the prompt for Qwen was modiﬁed to include the instruction: ‘Ensure the code

runs as valid PySpark DataFrame operations, not standard Python, and verify its execu-

tion within Spark.’ This adjustment highlights the importance of model-speciﬁc prompt

tuning. It should not be viewed as a shortcoming, but rather as a necessary reﬁnement

within the broader evaluation process.

5.3.2 Future Work

Future work will focus on enhancing the code generation process, which lies at the

core of this thesis’s study, as a key to enabling eﬀortless end-user query submission.

An essential priority will be the ﬁne-tuning of Codestral and Qwen 2.5 Coder, with the

aim of achieving even higher performance. Should new large language models emerge

that demonstrate comparable or superior eﬃciency, they may also be considered for

evaluation. The primary objective of the ﬁne-tuning process will be to improve automation

levels by minimizing the generation of unnecessary text, and ensuring that ﬁnal results are

consistently stored in a designated variable recognized by the data processing platform,

as implemented in this study.

Although both models demonstrated high accuracy in producing error-free outputs

and correct ﬁnal results, ﬁne-tuning is expected to further improve these outcomes. Ul-

timately, a ﬁne-tuned oﬄine model, fully oﬄoaded to a more powerful GPU, could yield

better performance across nearly all evaluation metrics in a follow-up study of this thesis.

However, it is essential that appropriate safeguards are put in place during ﬁne-tuning to

125

Chapter 5. Thesis Findings: Critical Assessment of the Proposed System

prevent the introduction of bias, thereby preserving the model’s ability to generate reliable

and generalizable outputs across various data contexts.

Beyond the proposed improvements, future work may also investigate hybrid infras-

tructure conﬁgurations for the framework. Hybrid models combine oﬄine processing with

private cloud environments, allowing the system to take advantage of scalable computa-

tional resources while maintaining the privacy advantages inherent to oﬄine setups. This

approach could support more extensive model optimization and faster development iter-

ations. However, adopting such a hybrid architecture would introduce additional costs,

as it requires investment in both high-performance oﬄine hardware and scalable cloud

infrastructure, alongside the added complexity of ongoing system maintenance. This,

along with the necessity of constant integration with a scalable data management tool,

would make the ﬁnal system even more complex.

In parallel, expanding the proposed framework to include a dedicated user interface

(UI) could signiﬁcantly enhance its practical value. Currently, the research framework

(illustrated in Fig. 3.5) is designed primarily to evaluate model capabilities within a

controlled environment. Transitioning toward a production-ready solution will require the

development of a user-friendly interface that supports both technical and non-technical

users. Such an interface would promote real-world deployment and usability. Apart from

the development of a UI, future work should consider integrating additional features and

evaluation components needed to position this study as a comprehensive enterprise-ready

solution. The model-agnostic structure and extensible design of the framework can oﬀer

a clear path for evolving the current prototype into a fully functional platform, which will

ﬁnally be suitable for enterprise applications.

126

Chapter 6

Future Use Case: Real-Time Indoor Localization

Frameworks

The following research has been published in [5]. It is included here with periodic

adaptations to maintain consistency within the context of this dissertation. It servers

as a potential future use case scenario, for the application of this thesis’s scalable and

AI-powered data management and analysis framework.

6.1 Indoor Localization in Today’s Applications

6.1.1 Location Information

Location information is a highly valuable asset, widely utilized across various busi-

nesses, organizations, and applications. Typically, a user’s location dataset is composed

of their geographic positions, commonly referred to as ‘coordinates’, which include latitude

and longitude values. These individual records can be aggregated into larger datasets,

containing the positions of multiple individuals, the coordinates of structures (such as

buildings), or notable locations (such as monuments) [138]. In addition, location datasets

may also include elevation and altitude details, oﬀering further context to end-users.

The sources of location data vary depending on the type and speciﬁc requirements

of each application. A primary source is the Global Positioning System (GPS), which

determines latitude and longitude by facilitating communication between satellites and

user devices. Other sources include Beacons, devices that transmit low-energy Bluetooth

signals, as well as Wi-Fi networks, where devices emit probes while searching for access

points. Beacons and Wi-Fi are particularly useful for determining location in indoor

environments, whereas GPS is predominantly employed for positioning across larger,

primarily outdoor areas.

Both types of location data, outdoor and indoor, are widely utilized by a range of ser-

vices. GPS, in particular, is well known for providing positioning information to millions

of users daily, assisting them in reaching destinations they are unfamiliar with. Modern

applications on smart devices frequently request user permission to collect location data,

and increasingly, websites also prompt users to share their location information. These

trends highlight the growing value and signiﬁcance of location data across multiple indus-

tries. Nevertheless, indoor localization and nearby environment positioning technologies

127

Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks

are not as widely recognized among consumers. Products such as Tracker Bluetooth

Beacons [139], including examples like Tiles and AirTags, have helped raise awareness

about Bluetooth technology’s role in generating and providing location data.

Despite this, a large portion of end-users remain unfamiliar with the potential beneﬁts

of indoor positioning data. The advantages of GPS coordinates are readily understood

through the widespread use of navigation applications and web-based maps. In contrast,

similar, easily accessible examples showcasing the beneﬁts of indoor geolocation data

are lacking from the consumer perspective. This gap can be attributed to the absence

of mainstream applications, comparable to navigation tools, that leverage indoor location

information to deliver practical services. Developing such applications would represent a

major advancement in location-based services and could oﬀer considerable value across

various sectors of modern society.

As the adoption of indoor positioning applications continues to expand, the volume of

generated data is also expected to increase. Eﬀectively retrieving and managing indoor

localization information is essential for conducting in-depth analysis and understand-

ing how this data can be utilized. Indoor positioning can assist critical infrastructures

and businesses in gaining insights into customer behavior, thereby enhancing marketing

strategies and optimizing resource allocation. For instance, the healthcare sector can

beneﬁt signiﬁcantly by improving the tracking of medical equipment and patients within

facilities, ultimately reducing delays in urgent situations. Indoor location services hold

the potential to transform a wide range of industries [140]. Previous research has fur-

ther explored the diverse applications of indoor positioning systems [141], emphasizing

their transformative impact across various sectors and the corresponding growth in data

generation.

6.1.2 Indoor Localization as an Area Network

To encourage broader adoption of indoor localization applications in today’s market

— and consequently enhance data generation and availability — the research community

should focus on promoting the concept of a simpliﬁed indoor localization architecture.

Speciﬁcally, indoor positioning can be viewed as a domain of location data exchange

between diﬀerent entities, operating similarly to area networks. Various deﬁnitions exist

for area networks, but generally, they are considered telecommunications networks that

connect devices within a speciﬁc geographic boundary. The scale of an area network can

vary widely, from a single room to an entire city, depending on its type and intended

purpose. Common types include Local Area Networks (LAN), Wide Area Networks (WAN),

Metropolitan Area Networks (MAN), and Personal Area Networks (PAN) [142, 143, 144].

Each type of area network is designed to optimize communication and resource shar-

ing within its respective scope, addressing the particular needs of connected devices and

users. This principle can similarly be applied to the ﬁelds of indoor localization and

positioning. Consequently, indoor localization can be conceptualized as a new form of

area network. This article proposes the introduction of the ‘Transactional Area Network’

(TAN). The term ‘Transactional’ has been selected to emphasize that data exchanges be-

128

6.2 Literature Review

tween users can be interpreted as forms of transactions. By deﬁning TAN and outlining its

key characteristics, future indoor localization systems could adopt a uniﬁed architectural

model. This, in turn, could accelerate the adoption of such applications and signiﬁcantly

boost data generation eﬀorts.

The term ‘Transactional Area Network’ does not currently have a deﬁnition in the

global scientiﬁc literature. It is introduced and conceptually framed within the context

of this article. The idea of Transactional Area Networks emphasizes the need to expand

localized networking solutions that address the unique challenges of indoor environments.

This approach seeks to encourage the development of new technologies and methodologies

aimed at achieving simple yet eﬀective implementations, with a particular focus on user

data exchange and generation. Through this proposal, the article aspires to stimulate

further research and innovation in the ﬁeld of indoor localization, ultimately enhancing

user experiences within indoor spaces.

As previously outlined, the proposed concept of a Transactional Area Network could

provide a foundation for future indoor localization frameworks and applications. The

following sections of the article review existing research in indoor positioning and lo-

calization (‘Literature Review’), introduce the conceptual framework of the Transactional

Area Network (‘TAN Conceptualization’), present a basic implementation of the TAN model

(‘Proof-of-Concept Implementation’), and conclude with a discussion of ﬁndings and key

insights (‘Conclusions’).

6.2 Literature Review

6.2.1 RSS, PDR and Filtering Techniques

To date, a substantial number of studies and publications have focused on indoor ge-

olocation, localization, and positioning systems. Numerous implementations have been

explored, and a variety of frameworks have been introduced, each addressing diﬀerent

aspects of proximity-based positioning services. Before delving into the speciﬁc contribu-

tions of this work, it is essential to review some of the most prominent existing solutions

and frameworks in the domain of indoor localization.

A widely recognized approach was presented by Zhuang et al., 2016 [140], which intro-

duced a smartphone-based indoor localization system supported by a Bluetooth Beacon

network deployed in enclosed environments. This method leveraged the features of Blue-

tooth Low Energy (BLE) and Received Signal Strength (RSS) measurements for location

detection, relying on a centralized architecture composed of a smart device, along with

multiple beacons placed throughout the area. The study proposed a hybrid localization

algorithm that integrated a channel-speciﬁc Polynomial Regression Model (PRM), channel-

speciﬁc Fingerprinting (FP), outlier detection techniques, and Extended Kalman Filtering

(EKF).

The outlined techniques are frequently employed in similar indoor localization ap-

proaches. Polynomial Regression [145] and Fingerprinting [146] are commonly combined

to estimate both the position of a user, and the distances between the user and surround-

129

Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks

ing Bluetooth Low Energy Beacons. Additionally, Outlier Detection and the Extended

Kalman Filter [147] are used to enhance the quality of localization data, particularly in

terms of distance and position accuracy. Although the proposed algorithm demonstrated

signiﬁcant improvements in localization precision, especially in environments equipped

with tracker BLE beacons, its reliance on hardware beyond standard consumer-grade

smart devices (i.e., requiring BLE beacon deployment) posed limitations for broad adop-

tion. Furthermore, the solution did not address the possibility of enabling users to access

or view the presence and information of others within the same system.

Another notable contribution in the ﬁeld is presented by Chen et al. [148], who

developed a smartphone-based indoor localization and tracking system, utilizing iBea-

cons. The study aimed to enhance several key aspects of Pedestrian Dead Reckoning

(PDR) on smartphones, including step detection, walking direction estimation, and initial

position determination. By combining smartphone-integrated sensors with BLE beacon

infrastructure, the proposed method achieved high-precision object localization. The sys-

tem was built upon Apple’s iBeacon technology and applied ﬁltering techniques (such

as Kalman Filtering) to optimize the accuracy of tracking data. User positions were up-

dated dynamically whenever they entered a deﬁned calibration zone with a three-meter

range. This work eﬀectively demonstrated the feasibility of using smartphones as central

components in indoor positioning systems. However, it is important to note that its core

functionality still depended on external BLE beacon hardware.

In a study conducted by Zou et al. [149] in 2017, the authors proposed an Indoor

Positioning System (IPS) designed for navigation and tracking, utilizing the built-in sen-

sors of smartphones — such as accelerometers, gyroscopes, and magnetometers — to

estimate users’ movement and direction. The system was based on the Pedestrian Dead

Reckoning (PDR) method and incorporated a particle ﬁlter-based fusion technique. In

areas with limited Wi-Fi coverage, the weight of the particles was determined using iBea-

con measurements, while in areas with stronger Wi-Fi signals, the position estimates

were derived from Wi-Fi-based data. The proposed system demonstrated high accuracy

in estimating user movement and direction. However, it required external infrastructure.

Speciﬁcally, a modiﬁed Wi-Fi setup, beyond the use of standard consumer smartphones.

It is also worth noting that the study focused on tracking user movement within indoor

environments rather than enabling direct peer-to-peer positioning among users.

Another signiﬁcant contribution was made by Yadav et al. [150] in 2019, who ex-

plored the implementation of an IPS using a combination of iBeacon technology, Inertial

Measurement Units (IMUs), and Fingerprinting techniques. Instead of relying on the

commonly used k-Nearest Neighbors (kNN) algorithm in their map matching process, the

authors adopted Bayesian estimation to probabilistically determine the most likely ref-

erence points corresponding to the user’s position. The proposed framework integrated

both BLE beacon data and PDR to enhance localization accuracy. Additionally, the incor-

poration of a fuzzy logic-based Kalman Filter further improved the quality of positioning

results. Although the approach produced promising results in terms of indoor localization

accuracy, it required access to an external database, thereby introducing a dependency

on additional hardware beyond the smart device itself.

130

6.2.1 RSS, PDR and Filtering Techniques

An important contribution was made by Dinh et al., 2020 [151], who proposed a

smartphone-based indoor positioning system that utilized BLE iBeacons and Fingerprint-

ing techniques. The system was built upon the principles of Pedestrian Dead Reckoning

[152], which - as already mentioned - estimates a user’s location by leveraging iner-

tial sensors such as accelerometers and gyroscopes. By continuously integrating step

counts and rotational movements, the system was able to compute the user’s displace-

ment and orientation from an initial reference point, enabling indoor navigation without

dependence on external signals like GPS. The authors structured their approach around

two core components. Firstly, they introduced a regression-based method for estimating

distances using Bluetooth Low Energy Received Signal Strength (BLE-RSS). This model

was designed to correlate the Received Signal Strength Indicator (RSSI) values with the

approximate distance between the beacon and the user’s device, providing a practical

means of enhancing localization accuracy. Speciﬁcally, the model is based on the loga-

rithmic least squares method, where the RSSI value (RSSI) at a distance λis given by the

equation:

RSSI =α+ln(λ)(6.1)

Here, αand are coeﬃcients that are determined through the least squares ﬁtting

of the collected RSSI data under both line-of-sight (LOS) and non-line-of-sight (NLOS)

conditions. These coeﬃcients are calculated as follows:

=

nPn

i=1(RSSIiln λi)−Pn

i=1RSSIiPn

i=1ln λi

nPn

i=1(ln λi)2−(Pn

i=1ln λi)2(6.2)

α=Pn

i=1RSSIi−Pn

i=1ln λi

n(6.3)

In the initial phase of position estimation, once the approximate distances are deter-

mined using the regression model, the system proceeds with a calibration step. During

this process, the user is asked to remain stationary for a few seconds. While the user

stays still, the system collects BLE signals from the three closest beacons. These sig-

nals are then used to estimate the user’s initial location by applying a combination of

Line Intersection-Based Trilateration and a median ﬁltering technique. The trilateration

approach involves solving a set of equations based on the calculated distances from the

nearby beacons:

(x−xi)2+(y−yi)2=λ2

i(6.4)

where (xi, yi)are the coordinates of the beacons and λiare the estimated distances from

the regression model. Simplifying these equations leads to:

x2+y2+aix+biy+ci=0(6.5)

where ai=−2xi,bi=−2yiand ci=x2

i+y2

i−λ2

i. By solving these equations, the initial

position (x, y)is determined.

131

Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks

Finally, to improve the accuracy of the initial position estimate, a median ﬁlter is ap-

plied to minimize the impact of outlier values. Directional information is obtained from

the mobile device’s built-in magnetometer, which is accessed via the device’s operating

system Software Development Kit (SDK), such as those provided by iOS [153] and Android

[154]. This sensor data is used to determine the direction of the user’s movement. Since

precise distance estimation is essential for indoor positioning, the method used in this

study enabled accurate identiﬁcation of the starting point. Additionally, the work by Dinh

introduced an eﬀective and dependable radio map that uses ﬁngerprint-based positioning

to correct deviations caused by Pedestrian Dead Reckoning. Instead of relying on high-

resolution grids with a large number of reference points, Dinh’s approach collects data

from a smaller set of points, thereby reducing the overall deployment time. Although it

presents a sophisticated solution to challenges such as eﬃciency and hardware require-

ments seen in earlier systems, one limitation of the method is its reliance on external

beacon devices.

Another signiﬁcant contribution was made by Tuan et al., 2021 [155], who proposed

a hybrid tracking framework that integrates PDR with Wi-Fi and iBeacon signal data.

Their approach is organized into three main components. First, the authors designed

conversion functions that translate Wi-Fi and iBeacon signals into estimated proximity

distances. To support indoor tracking, they also suggested a strategic deployment method

for placing iBeacons within designated areas. Second, an enhanced version of the PDR

method was introduced for use on smartphones, which leveraged signal data to estimate

an initial position and reﬁne the user’s path during movement. Finally, the proposed

algorithm was implemented in an iOS application to showcase its functionality. Despite

its promising design, the system relies on external beacon infrastructure, while also not

addressing the detection of multiple users within the environment.

Eﬀorts to further improve the precision of such positioning systems have also been

explored. For instance, Elgui et al., 2020 [156] focused on enhancing the reliability and

accuracy of BLE RSSI signal measurements. Additionally, Duong et al., 2021 [157] ex-

amined the performance of indoor localization systems based on BLE-RSS ﬁngerprinting

under static conditions. Their study was conducted in two stages: An ’oﬄine’ phase,

during which RSSI data were collected, and an ’online’ phase, which used real-time mea-

surements combined with the previously gathered data to estimate user positions. The

eﬀectiveness of their system was assessed across multiple real-world settings, with par-

ticular attention to parameters such as the ideal value of kin the k-Nearest Neighbors

(KNN) algorithm, and the optimal number of beacons required for accurate positioning.

6.2.2 Machine Learning and Ultra Wideband

Another indoor localization system that employed Machine Learning and Deep Learn-

ing techniques was presented by Abbas et al., 2019 [158]. The researchers developed

a system aimed at delivering high accuracy and reliability, even in environments with

signiﬁcant signal noise. Their approach combined a deep learning model with denoising

autoencoders, integrated within a probabilistic framework designed to manage noise in

132

6.2.2 Machine Learning and Ultra Wideband

received Wi-Fi signals. The objective was to capture the complex relationship between

signals from Wi-Fi access points, and the device’s actual location.

Further investigation into enhancing indoor localization through Machine Learning

was carried out by Njima et al., 2022 [159]. In their study, the researchers addressed

the challenge of limited datasets by applying Generative Adversarial Networks (GANs) for

data augmentation. Their work focused on scenarios involving missing indoor localization

data, distinguishing between cases with available unlabeled data and cases where no un-

labeled data were present. In the ﬁrst scenario, they proposed a weighted semi-supervised

approach based on deep neural networks, incorporating pseudo-labeling techniques that

combined a small number of real labeled samples, with a larger set of low-cost pseudo-

labeled data. This method improved accuracy, while minimizing the need for additional

labeled data. In the second, more constrained scenario, where no unlabeled data were

available, they introduced a solution that generated synthetic ﬁngerprints using GANs. A

deep neural network was then trained on a dataset composed of both real and generated

samples, applying a similar weighting strategy as in the ﬁrst case, aiming to enhance

prediction performance and reduce overﬁtting.

Several survey studies have also been carried out to compile and analyze the majority

of indoor localization methods and use cases. One such example is the work by Yang

et al., 2021 [160], which explored key aspects such as accuracy, scalability, stability,

reliability, and algorithmic complexity across a wide range of existing and emerging lo-

calization techniques. The survey provides an in-depth comparison of these methods,

oﬀering valuable insights into practical implementation models, and aiming to improve

localization accuracy while reducing system complexity. Other notable surveys address-

ing similar themes have been published by Zafari et al. [161], Jang et al. [162], Subedi et

al. [163], and Mallik et al. [164].

Ultra-Wideband (UWB) technology has also been explored in various indoor localiza-

tion applications. Notably, Apple has adopted UWB to deliver accurate distance estimation

between its devices. Speciﬁcally, the U1 chip developed by Apple enables high-precision

distance measurements between compatible devices [165]. However, a key limitation is

that only devices equipped with a ’U’ series chip can support UWB-based localization.

Future advancements by Apple and other manufacturers may aim to overcome this re-

striction. A notable contribution to the ﬁeld was made by Porok and Martinoli in 2013

[166], who proposed a UWB localization technique designed to address error patterns

through spatial and multimodal modeling. Their method employed tessellated maps and

incorporated relative positioning within a particle ﬁlter framework, in order to enhance

accuracy. Tests involving mobile robots equipped with UWB transmitters demonstrated

the system’s ability to achieve precise indoor localization.

A notable and frequently cited work in the Ultra-Wideband domain is the survey by

Alariﬁ et al., 2016 [167]. This study presented a comprehensive overview of the in-

door positioning technologies available at the time, with a particular focus on UWB. It

oﬀered a detailed comparative analysis, including a ’SWOT’ (Strengths, Weaknesses, Op-

portunities, Threats) assessment of UWB, emphasizing its advantages and potential for

addressing indoor localization challenges. The survey also introduced taxonomies and

133

Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks

summarized recent advancements, while suggesting directions for future research.

More recent developments include a proposal by Mayer et al., 2023 [168], which in-

troduced a self-sustaining indoor localization system based on UWB technology for smart

mobile sensor nodes. These nodes were designed for compactness and long-term op-

eration, reducing the need for frequent battery replacement. The system employed an

event-driven sensing mechanism to manage energy consumption eﬃciently, and inte-

grated heterogeneous design elements with onboard processing. As a result, it achieved

high localization accuracy with minimal infrastructure requirements.

Additionally, Hapsari et al., 2024 [169], conducted a review focused on the evolution

of UWB-based indoor tag localization technologies from 2017 to 2022, covering appli-

cations in positioning, tracking, and navigation. Their study employed both Systematic

Mapping Study (SMS) and Systematic Literature Review (SLR) methodologies to explore

trends, challenges, and research opportunities. Key topics included the use of machine

learning, ﬁltering techniques, and sensor fusion, along with performance metrics such as

accuracy, scalability, and energy eﬃciency. The paper also proposed a taxonomy of op-

timized metrics and outlined future research directions, positioning itself as an updated

continuation of Alariﬁ’s foundational survey of 2016.

The wide variety of indoor positioning solutions discussed in the outlined literature

oﬀer several beneﬁts to their adopters. However, challenges such as hardware dependen-

cies, infrastructure demands, environmental limitations, and potential concerns regard-

ing energy consumption, emphasize the ongoing need for further research in this domain.

A Transactional Area Network (TAN) could provide a standardized and uniﬁed framework,

capable of integrating diverse methods and technologies. TAN may serve as a founda-

tional policy for developing new solutions, experimenting with cutting-edge algorithms,

and advancing the broader objective of decentralized indoor positioning. Additionally, it

could promote increased data exchange, while also facilitating greater data generation

within indoor environments.

6.3 TAN Conceptualization

6.3.1 Scope and Characteristics

In order to support the development of future indoor localization systems within a

uniﬁed architectural framework, the Transactional Area Network should be built upon

a core concept and a deﬁned set of key characteristics that shape its functionality and

purpose. Any new indoor positioning solution could align with the principles of TAN and,

as a result, be classiﬁed under this model. The rationale behind TAN has been brieﬂy

introduced in the opening section of this chapter. First, there is currently no widely

adopted indoor localization standard. Second, many existing systems lack support for

peer-to-peer interactions, limiting users from sharing data, or exchanging relative location

information. This absence of interconnectivity contributes to lower user familiarity with

indoor localization tools, when compared to widely used GPS-based services. Third, these

factors result in a general scarcity of indoor positioning data. Future developers should

134

6.3.1 Scope and Characteristics

consider these challenges when designing new applications. The introduction of TAN

directly addresses these gaps, aiming to promote broader adoption and innovation in the

ﬁeld.

Figure 6.1. High-level overview of the Transactional Area Network’s implementation within

an indoor environment [5].

As discussed in the Introduction section of this chapter, the term ‘Transactional Area

Network’ has not yet been formally deﬁned within the global research community. The

vocable ‘Transactional’ is used in this context to reﬂect the nature of indoor localiza-

tion, which facilitates interactions, mutual inﬂuence, and the exchange of data between

entities. Indoor positioning systems inherently depend on information exchange among

multiple devices, rather than functioning as isolated units. This continuous interaction

among entities can be understood as a form of transaction, thereby justifying the use of

the word ‘Transactional’. Furthermore, indoor localization can be classiﬁed as an ‘Area

Network’ because it connects devices operating within a conﬁned spatial region—namely,

indoor environments. Based on these considerations, this work adopts the term ‘Trans-

actional Area Network’ to describe the proposed framework. Accordingly, TAN can be

deﬁned as follows: “A Transactional Area Network is a computer network that inter-

connects devices within indoor locations, aiming to enhance entity interaction, mutual

inﬂuence and data exchange”.

135

Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks

Figure 6.2. A focused overview of the user communication in a TAN, showcasing the data-

exchange pipelines between two users, as well as the existence of external hardware for

optional assistance [5].

Based on the proposed deﬁnition, a Transactional Area Network should support

streamlined and decentralized indoor positioning processes, allowing end-users to beneﬁt

directly from its functionalities. Within this network, each instance of location, or other

data exchange between two users is referred to as a ‘transaction’. Accordingly, every suc-

cessful transmission and reception of information constitutes a transaction, with each

new data exchange representing a distinct event. To promote widespread adoption, it is

important to limit the reliance on additional hardware. Although TAN may incorporate

external devices, such as Bluetooth beacons (refer to Figures 6.1 and 6.2), the network

should primarily operate through users’ personal devices. As TAN becomes more widely

adopted, the volume of generated data will naturally grow. Consequently, the network

must be capable of forwarding this data to an external database located beyond the local

environment.

Figure 6.1 illustrates a conceptual representation of a TAN’s operational structure,

while Figure 6.2 focuses on the communication ﬂows among users within the network.

Drawing from the preceding discussion, the core characteristics of a Transactional Area

Network can be summarized as follows:

•Peer-to-Peer Design: The network should emphasize direct data exchange between

users, supporting a decentralized architecture.

•Minimal Hardware: To maintain a decentralized structure, the network should pri-

marily rely on users’ personal devices, such as smartphones and smartwatches.

External components, like Bluetooth beacons, should only be employed when addi-

tional positioning information is deemed necessary.

•Data Collection: Positioning data generated by each user should be collected and

136

6.3.2 Technical Details

forwarded to an external storage system, located outside the local network environ-

ment.

•Data Comprehensibility: As indoor positioning data holds value primarily within its

respective context, the gathered information should be interpretable and capable of

oﬀering useful insights about the environment.

Delving into the technical aspects of a user’s device operating within the Transactional

Area Network, it was essential to ﬁrst establish the foundational criteria that deﬁne TAN’s

design. These core characteristics provide the groundwork for specifying how a user’s

device should operate in this context. Key functionalities include acting as a virtual

beacon for estimating distances, initiating peer-to-peer sessions for data exchange, and

pairing signal information with user identiﬁers to enable accurate and integrated data

handling.

6.3.2 Technical Details

Given the outlined criteria, the implementation of a TAN should primarily utilize the

personal devices of individual users, such as smartphones and smartwatches. The main

objective is to provide users with real-time indoor positioning information about them-

selves and others nearby. Therefore, TAN must support user-to-user localization without

relying on external hardware, like Bluetooth beacons or dedicated trackers. In this con-

text, each user’s device functions as their personal tracker. As previously discussed, the

use of additional hardware should be minimized. However, the TAN architecture may

still support optional external components (such as beacons) that oﬀer supplementary

information, for example, identifying a speciﬁc building ﬂoor or indoor area. These ele-

ments, while potentially useful, must remain auxiliary and not essential to the network’s

operation. Maintaining TAN’s peer-to-peer design requires that the network primarily

depend on users’ own devices. Accordingly, participation in a TAN should only require

the installation of a lightweight software application. Ideally, such functionality would

be natively integrated into the device’s operating system, similar to existing GPS-based

services. Since the personal device is the main hardware component within a TAN, its

technical capabilities warrant closer examination.

The core indoor positioning functions of a TAN will rely on Bluetooth Low Energy and

Wi-Fi signals. Both iOS and Android — the dominant operating systems in today’s mobile

ecosystem — provide BLE-based solutions for enabling proximity-aware services. Ap-

ple’s iBeacon framework [170] enables interaction between devices using BLE, a wireless

protocol introduced in Bluetooth 4.0 that supports energy-eﬃcient, short-range commu-

nication. iBeacon, launched as part of iOS 7, makes use of the ’Core Location’ and ’Core

Bluetooth’ frameworks to identify and interact with BLE signals. On the Android side,

Eddystone [171], developed by Google, provides a similar open-source BLE beacon format.

Eddystone supports broadcasting multiple frame types, allowing transmission of URLs,

telemetry data, and other contextual information to nearby devices. Although Google

oﬃcially discontinued support for Eddystone in 2018, the format remains available for

137

Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks

continued development. Notably, iOS devices can operate as beacons themselves [172],

mirroring Android’s ability to leverage Eddystone for creating location-based experiences,

without requiring dedicated beacon hardware.

Regarding the utilization of Wi-Fi signals, frameworks for both iOS and Android plat-

forms are available. ’Multipeer Connectivity’ [173] is a framework introduced by Apple

for iOS and macOS with the release of iOS 7. It enables devices to communicate and

exchange data directly over Wi-Fi or Bluetooth, without requiring an internet connection.

The framework supports peer-to-peer communication, allowing nearby devices to form

ad-hoc or mesh networks dynamically, based on available network infrastructure. Mul-

tipeer Connectivity allows for seamless discovery, connection, and interaction between

devices, making it suitable for proximity-based applications. On the Android platform,

’Nearby Connections API’ [174], developed by Google, provides similar functionality. It

allows Android devices to discover each other and establish connections over Wi-Fi and

Bluetooth, enabling real-time data exchange and collaboration. Nearby Connections API

abstracts much of the underlying complexity involved in device discovery and connection

management, oﬀering a streamlined solution for developers. Additionally, it is worth not-

ing that, until December 2023, Google maintained support for the ’Nearby Messages API’

[175], which also allowed Android devices to discover and communicate using a combi-

nation of BLE and Wi-Fi. However, the Nearby Messages API has since been deprecated,

and developers are encouraged to migrate to the Nearby Connections API for continued

functionality and support.

A Transactional Area Network should integrate these operating system-speciﬁc frame-

works, in order to broadcast and receive BLE and Wi-Fi signals eﬀectively. The architec-

ture must combine their core functionalities to facilitate the transmission of both signal

types, along with accompanying user data, from one device to another. Consequently,

the indoor localization system within a TAN must encompass, at minimum, the following

key features:

•Virtual Beacon Conﬁguration: Each user’s device should operate as a virtual bea-

con, broadcasting key beacon parameters (such as UUID, Major/Minor values, or

Instance ID) via Bluetooth Low Energy (BLE) signals. Simultaneously, it should

scan for signals from nearby devices, and estimate their distance using the received

signal strength indicator (RSSI).

•Peer-to-Peer Session Establishment: Devices should initiate ad-hoc, peer-to-peer

communication sessions, using frameworks like Multipeer Connectivity or Nearby

Connections, to enable direct data exchange among nearby devices. This commu-

nication should rely on Wi-Fi signals, with Bluetooth support, and function without

requiring internet access.

•Signal Pairing and User Identiﬁcation: Users within the TAN network should be able

to access distance proximity data and additional information for other users, such

as names and proﬁle images. Since BLE broadcasting and peer-to-peer communi-

cation represent two separate information channels, a robust method is required to

138

6.3.3 Use Cases

accurately associate signals with users. This ensures that each device can correctly

match (and present) the relevant information for each detected user.

•External Database Connectivity: Devices should also support interaction with an

external database, allowing them to transmit all received data for storage or further

processing.

If software deployed on users’ devices were to meet all the speciﬁed criteria, TAN

could be eﬀectively implemented in indoor settings, supporting decentralized distance

measurement and data exchange between the entities (see Figure 6.3). Each device would

function as the host of its own peer-to-peer communication session, with other detected

devices acting as guest peers. At the same time, these guest devices would also serve

as hosts for their own sessions. As a result, a Transactional Area Network could allow

individuals to identify others nearby, and potentially foster new social connections. If

implemented, this network could be tested across various indoor applications. Moreover,

it may prove especially valuable in human rescue operations, particularly in life-saving

situations during disasters.

Figure 6.3. A Transactional Area Network operates across multiple devices in indoor

environments. Each device transmits and receives BLE beacon signals, initiating peer-to-

peer sessions, by also broadcasting and receiving Wi-Fi signals [5].

6.3.3 Use Cases

One practical use case of a Transactional Area Network is its ability to help users

determine their location relative to other entities within a conﬁned space, such as indoor

environments or nearby facilities. In this scenario, users would be able to see other in-

dividuals in their vicinity, and track their real-time distance. Additionally, users could

communicate with nearby entities through messages, and identify their positions within

139

Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks

the area using TAN’s distance (and potentially direction) data. This network could en-

hance interpersonal interactions in indoor settings like restaurants, social gatherings,

cafeterias, and more. Thus, TAN would provide real-time proximity information and com-

munication features, fostering social engagement in indoor spaces.

Another important use case for a Transactional Area Network is its potential to as-

sist professionals, such as ﬁreﬁghters and rescue teams, in life-saving operations during

emergencies like earthquakes or ﬂoods. In such critical situations, TAN could help rescue

teams navigate debris from collapsed structures and locate trapped individuals. Using

their devices, rescuers could receive real-time data on the survivors’ relative distances

(and ideally directions), which would signiﬁcantly improve the eﬃciency and speed of the

rescue process. Furthermore, the network could potentially enable communication be-

tween rescuers and survivors, allowing for the exchange of vital information regarding the

condition and needs of those trapped. This functionality would be especially useful when

traditional communication systems are unavailable, providing a decentralized method

for connecting rescuers and survivors, thereby enhancing the eﬀectiveness of life-saving

eﬀorts.

An additional use case for TAN is monitoring traﬃc within a company’s premises (such

as buildings or oﬃces), in order to track the number of employees present on a daily

basis. Data engineers can derive valuable insights from this data, including identifying

trends in employee behavior, such as peak lunch and break times. For companies with

multiple locations, the indoor positioning data collected by each device within TAN can be

combined with GPS coordinates from the device. This would enable data engineers (with

access to the TAN database) to associate speciﬁc data points with particular premises.

However, additional considerations must be taken into account, such as determining

which ﬂoor each user is on at any given time. This could be validated through the use of

external hardware, such as beacons positioned on each ﬂoor. While TAN’s primary design

is decentralized, relying mainly on end-user devices, the limited use of external beacons

is acceptable for improving accuracy.

These three use case scenarios highlight the practical applications and value of the

Transactional Area Network. Whether enhancing social interactions in indoor spaces,

supporting emergency rescue operations, or optimizing workplace analytics, TAN presents

a viable solution for real-time, decentralized data exchange. By primarily depending

on users’ personal devices and minimizing reliance on external hardware, TAN has the

potential to enhance user experience and improve operational eﬃciency across various

settings.

6.4 Proof-of-Concept Implementation

6.4.1 Analysis

To demonstrate the core functionality of the Transactional Area Network, the research

team developed a basic prototype using the Swift programming language. Swift [176] is a

modern language mainly used for building applications on Apple platforms, including iOS,

140

6.4.1 Analysis

macOS, and watchOS. It was introduced as a more user-friendly and secure alternative

to Objective-C, which had been the primary language for Apple development. The imple-

mented software makes use of Apple’s iBeacon and Multipeer Connectivity frameworks,

enabling Bluetooth Low Energy signal broadcasting and peer-to-peer communication over

Wi-Fi.

The initial step in the system involves setting up a virtual beacon. To achieve this, the

device begins by broadcasting a UUID along with speciﬁc ‘Major’ and ‘Minor’ values. These

values remain unchanged while the device is active in the network. A UUID is ﬁrst deﬁned

and shared among all participating devices in the session. To ensure uniqueness, each

device assigns itself random Major and Minor values within the range of 1 to 500,000.

Before broadcasting, the device scans for nearby devices and compares their Major/Minor

combinations. If a match is detected, the device generates a new pair of values and repeats

the check until it ﬁnds a unique combination. Once uniqueness is conﬁrmed, the device

starts broadcasting and becomes part of the network (see Figure 6.4). While in typical

beacon implementations, the Major value is used to group devices under the same UUID,

this proof-of-concept assigns both Major and Minor values uniquely to each device, in

order to ensure clear identiﬁcation within the network.

// loop over the existing beacons detected, with each beacon item called ’beacon’

for beacon in beacons {

var booler = true

let major = beacon.major.uint16Value

let minor = beacon.minor.uint16Value

// check if common major and / or minor values have been found

while booler {

if number1 == major {

print("CommonMajorvaluefound.Changingthevalue.")

number1 = Int.random(in: 1...500000)

}

if number2 == minor {

print("CommonMinorvaluefound.Changingthevalue.")

number2 = Int.random(in: 1...500000)

}

if number1 != major && number2 != minor && number1 != number2 {

booler = false

}

Figure 6.4. Swift code snippet that veriﬁes and ensures the uniqueness of Major/Minor

values [5].

To join the session, the application ﬁrst checks whether Bluetooth is enabled on

the device, and whether it is capable of detecting nearby signals (see Figure 6.5). If

these conditions are satisﬁed, the application deﬁnes a beacon region which is known as

‘CLBeaconRegion’. Within this region, it begins scanning for other devices that share the

same UUID. This initiates the ‘ranging’ phase, during which the software identiﬁes and

logs all nearby devices broadcasting as beacons with the shared UUID. At the same time,

it estimates the distance between devices. iBeacon performs this estimation, referred to

as ‘accuracy’, by using the received signal strength indicator (RSSI) from each beacon,

141

Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks

along with the beacon’s known transmit power (TxPower) [177]. The estimated distance

is calculated using a logarithmic path loss model, represented by the following equation:

Distance =10(TxPower−RSSI

10×n)(6.6)

where:

•RSSI This is the measured strength of the signal received from the iBeacon. It

is typically reported in decibels relative to one milliwatt (dBm). The RSSI value

decreases as the distance between the iBeacon and the receiver increases.

•TxPower (Transmit Power) is a calibrated constant that represents the expected

RSSI at a 1-meter distance from the beacon (in dBm). It is speciﬁc to each beacon

and is typically provided by the manufacturer or determined during the calibration

process.

•Path Loss Exponent (n), is a factor that represents the rate at which the signal

attenuates as it propagates through the environment. The value of nvaries with

the physical characteristics of the environment. For example, in free space, nis

typically 2, whereas in more obstructed environments, ncan be 3 or 4.

For example, if an iBeacon has a TxPower of −59 dBm and the measured RSSI at the

receiving device is −70 dBm in an environment where the path loss exponent nis 2, the

estimated distance would be calculated as follows:

Distance =10−59−(−70)

10×2=10(11

20 )≈1.78 m (6.7)

The logarithmic path loss model oﬀers a reasonably accurate estimate of distance in

most indoor settings. The collected distance data can be further processed to improve

result accuracy and, if needed, to support techniques such as triangulation for deriving

directional information.

if CLLocationManager.isMonitoringAvailable(for:

CLBeaconRegion.self) {

// Match all beacons with the speciﬁed UUID

// Create the region and begin monitoring it.

let region = CLBeaconRegion(proximityUUID: uuid,

identiﬁer: localBeaconId)

self.locManager.startMonitoring(for: region)

// Start ranging only if the feature is available.

if CLLocationManager.isRangingAvailable() {

locManager.startRangingBeacons(in: region )

// Store the beacon so that ranging can be stopped on demand.

beaconsToRange.append(region )

}

locationManager(locManager, didRangeBeacons: beacons, in: region)

}

Figure 6.5. Swift code snippet that checks the device’s ability to transmit as a beacon [5].

The next phase involves the device initiating a peer-to-peer communication session

using the Multipeer Connectivity (MC) framework. After the device successfully begins

142

6.4.1 Analysis

operating as a virtual beacon, the software compiles three key components required to

establish an MC session: i)the device’s Major/Minor identiﬁer pair used for beaconing, ii)

the user’s name credentials, and iii)a user-selected image. In addition to these elements,

certain other properties are needed to enable the MC session. A critical requirement is a

unique identiﬁer for each device, referred to as ‘myPeerId’. This identiﬁer ensures each

device is distinct within the session, helping prevent data transfer issues or conﬂicts. Two

more properties are also initialized: One to broadcast the device’s presence (‘serviceAd-

vertiser’, see Figure 6.6) and one to search for other devices (‘serviceBrowser’, see Figure

6.7).

Using the serviceAdvertiser, the device continuously sends out its availability to nearby

devices (peers) and accepts connection invitations. Meanwhile, the serviceBrowser allows

the device to discover new peers, initiate connection requests, and receive notiﬁcations

when peers disconnect from the session. Together, these components handle the joining

and leaving of devices in the session. Ultimately, the Multipeer Connectivity framework

enables smooth and direct data sharing between all connected devices, functioning as a

peer-to-peer, decentralized communication channel.

// invitation from another peer

func advertiser(_advertiser: MCNearbyServiceAdvertiser, didReceiveInvitationFromPeer peerID: MCPeerID,

withContext context: Data?, invitationHandler: @escaping (Bool, MCSession?) −> Void) {

print("ReceivedaninvitationfromPeerwithID:\(peerID)")

// Accept the invitation and provide the session

invitationHandler(true,self.session)

}

Figure 6.6. ServiceAdvertiser receives an invitation to join another peer’s session [5].

// ﬁnding new MC peer

func browser(_browser: MCNearbyServiceBrowser, foundPeer peerID: MCPeerID, withDiscoveryInfo info: [

String : String]?) {

print("PeerFound:\(peerID)")

var peerExists = false

// check if the peer is in the peer list

for pp in peers {

if pp.id == peerID.displayName {

peerExists = true

break

}

// if not, invite the peer to the session

if !peerExists {

if peerID.displayName != ID {

print("Peerdoesnotexistinothersessions.Peerinvited:\(peerID)")

self.delegate?.savePeer(peerID: peerID)

browser.invitePeer(peerID, to: self.session, withContext: nil, timeout: 10)

}

Figure 6.7. ServiceBrowser locates a peer and checks if they already exist in the device’s

list with known peers [5].

143

Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks

During a Multipeer Connectivity session, each device can send data to all connected

peers. In this proof-of-concept system, the exchanged data includes the user’s name and

image. Every time a device sends data, it also includes its unique peer ID. As a result, all

shared data within the session is labeled with the sender’s peer ID. To store and manage

data received from other devices, a list called ‘peers’ is used. This list holds objects

of a custom type named ‘aPeer’, which was created speciﬁcally for this project. Each

‘aPeer’ object contains four pieces of information: i)the peer ID, ii)the user’s name, iii

their selected image, and iv)the distance from the current device (based on the ‘accuracy’

value). In this way, the peers list keeps a separate object for each nearby device, recording

their ID, name, picture, and relative distance.

When the system receives data from other devices, it ﬁrst checks if the device is already

listed. If it is not, a new ‘aPeer’ object is created using the received peer ID, name, and

picture, and added to the list. If the peer is already known, the existing object is updated

with any new data. It is also possible to create a peer object with missing information,

allowing the rest to be added later during the session. For example, if a device starts

sharing its data, the name might arrive before the picture, due to network delays. In such

a case, the system creates a peer object with the peer ID and name. When the image

arrives later, the system ﬁnds the existing object and ﬁlls in the missing picture.

However, a key challenge in this system is linking each device’s beacon signal, which

provides relative distance to other peers, with the peer-to-peer session data, speciﬁcally

the user’s name and image. Since the Bluetooth Low Energy and Multipeer Connectivity

signals operate on diﬀerent protocols, the system must ﬁnd a way to associate them

correctly. To achieve this, the software must identify which BLE and MC signals come

from the same device. The pairing process works as follows (also depicted in Figure 6.8):

Each BLE signal provides two unique identiﬁers, Major and Minor values. These values

are combined into a single string, using a slash (‘/’) as a separator. For instance, if a

device has a Major value of ‘172’ and a Minor value of ‘8376’, the combined identiﬁer

becomes ‘172/8376’. This string is then stored in the device’s ‘myPeerId’ attribute.

As a result, the device will advertise itself as a beacon using the Major/Minor values

‘172/8376’, and also participate in the peer-to-peer MC session using the same identiﬁer.

When the software detects a BLE signal from a virtual beacon device and simultaneously

receives data from an MC peer, it combines the Major/Minor values of the beacon signal

into a string variable, separated by a slash punctuation mark (‘/’). Thus, the BLE Ma-

jor/Minor values ‘172’ and ‘8376’ will again form ‘172/8376’ on the receiving end. The

software then compares this new string variable from the BLE signal with the received

ID of the MC peer. If they match, it conﬁrms that both signals originate from the same

device. If not, the software continues to compare the signals until all are correctly paired.

This straightforward and eﬃcient solution eﬀectively combines beacon signals with

peer-to-peer session data. By using this approach, each device can correctly identify

which signals belong to the same device, allowing the software to calculate their relative

distance based on the BLE signal. As a result, the distance attribute for each peer object

in the device’s list is updated accordingly. The integration of BLE signals with peer-

to-peer session data ensures that each device can accurately associate signals with the

144

6.4.1 Analysis

correct user. This seamless integration successfully demonstrates the functionality of the

Transactional Area Network, highlighting its decentralized nature.

// Loop through the existing beacon signals, with each iteration item as ’bb’

for bb in beacons {

let amajor = bb.major as! Int

let aminor = bb.minor as! Int

// create the string value of the major/minor pair, as an ID

let theid = String(amajor) + "/" + String(aminor)

print("FoundanewMajor/MinorpairasanID:",theid)

var idExists = false

// Check if the ID created matches any known IDs

// then proceed to save the distance of a known Peer, if the ID matches with one from

the list, or create a new Peer

for pp in peers {

if pp.id == theid {

pp.distance = String(format: "%.2f", ceil(bb.accuracy ∗100) / 100)

print("FoundIDmatchesexistingPeerinthesession.Distancefrompeer:\(bb.

accuracy)")

idExists = true

table.reloadData()

break

}

if !idExists {

let checkDistance = String(format: "%.2f", ceil(bb.accuracy ∗100) / 100)

if !checkDistance.hasPreﬁx("−") {

print("FoundIDdoesnotmatchanyknownPeer.Creatingnewone...")

let peer = Peer()

peer.id = theid

peer.name = "Newpeer"

peer.image = #imageLiteral(resourceName: "proﬁlepic.png")

peer.distance = checkDistance

peers.append(peer)

table.reloadData()

}

Figure 6.8. The Major/Minor pairing process, and examining for matches in the known

Peers list [5].

145

Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks

Figure 6.9. TAN prototype software’s internal components’ architecture [5].

As illustrated in Figure 6.9, the proof-of-concept software for TAN is not merely a

fusion of two technologies. The proposed framework eﬀectively combines iBeacon and

Multipeer Connectivity, utilizing their respective signals to create a unique software so-

lution. In an Android context, these technologies would correspond to Eddystone and

Nearby Connections, respectively. TAN’s software modiﬁes the data transmitted through

each signal (BLE RSSI and MC), utilizing the CLBeaconRegion and MCServiceAdvertiser

modules for transmission, while simultaneously waiting for incoming signals through

the CLBeaconRegion (again) and MCServiceBrowser modules. The received data is then

processed in the BLE and MC Signal Pairing module to identify common senders, and

generate corresponding iBeacon/MC signal pairs. Once a match is found, the software

stores the data in a peer list, establishing a successful connection between the sender

and receiver. This approach not only enhances the accuracy of signal pairing, but also

strengthens the overall connection. Ultimately, this integration supports seamless com-

munication and data exchange within the decentralized structure of the Transactional

Area Network.

In terms of data transmission, it is essential to ensure that the data collected by

each device is properly formatted before being transmitted to an external database. The

resulting dataset, based on the peer list, is organized into JSON objects for eﬃcient storage

and analysis. Each JSON object contains key information, including a timestamp, the

BLE UUID, the device’s Major and Minor values, and the device’s ‘peerID’, which is derived

by combining its Major and Minor values (as previously mentioned). Since all relevant

details about each peer are stored in the device’s peer list, this list is converted into a

JSON array, where each entry represents an individual peer. Thus, each object includes

146

6.4.2 Testing and Results

the peer’s ID (formed by the combination of its BLE Major/Minor values), name, a base64-

encoded image, and the calculated distance based on the BLE signal (as shown in Figure

6.10). Given the frequency of distance updates (every second), it is crucial to deﬁne an

optimal transmission interval to the database. This interval should balance the need for

real-time updates with eﬃcient data management. Typically, the transmission interval is

set between a few seconds and a minute, depending on the scale of each TAN.

// JSON Object

{

"timestamp": "2024−05−22T12:34:56Z",

"uuid": "123e4567−e89b−12d3−a456−426614174000",

"major": 172,

"minor": 8376,

"id": "172/8376",

"data": {

"name": "Peer0",

"image": "https://www.thinkbymore.com/thinkbymore/file/2042052390694539264/pdf/ea499765-984b-44f8-b7a3-e854ded9af9a/images/img92.png class="_ _9">AAANSUhETgAAA...",

"extra_data_1": "example_value_1",

"extra_data_2": "example_value_2"

"peers": [

{

"id": "453/3221",

"data": {

"name": "Peer1",

"image": "https://www.thinkbymore.com/thinkbymore/file/2042052390694539264/pdf/ea499765-984b-44f8-b7a3-e854ded9af9a/images/img93.png class="_ _35">TG...",

"distance": "5.23",

"extra_data_1": "example_value_1",

"extra_data_2": "example_value_2"

}

{

"id": "7753/934",

"data": {

"name": "Peer2",

"image": "https://www.thinkbymore.com/thinkbymore/file/2042052390694539264/pdf/ea499765-984b-44f8-b7a3-e854ded9af9a/images/img94.png class="t m0 x12 h10 ya4c ff4 fs8 fc0 sc0 ls0 ws57">"distance": "3.45",

"extra_data_1": "example_value_3",

"extra_data_2": "example_value_4"

}

]

}

Figure 6.10. A data sample from the TAN’s proof-of-concept software [5].

6.4.2 Testing and Results

Testing Scenario 1: Living Room

TAN’s proof-of-concept software has been tested on four iOS devices: an iPhone X, an

iPhone 8, an iPhone SE (2016) and an iPad Pro 10.5 (2017). The ﬁrst testing scenario

aimed to determine the distance between devices forming a Transactional Area Network

within an apartment’s living room. When each user enters their information into their

device, the software initiates the previously outlined procedure. It starts by broadcasting

147

Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks

a Bluetooth Low Energy (BLE) signal, functioning as a virtual beacon, and establishes a

new peer-to-peer session using Multipeer Connectivity (MC). On the BLE side, the signal

searches for beacons in the area with the same UUID. Simultaneously, in the MC phase,

the device creates a serviceAdvertiser, a serviceBrowser, and a session that continuously

searches for new peers and any data they may send.

Figure 6.11. Establishment of TAN between two peers, as seen from Daniel Adams’ device

(iPhone X) [5].

In Figure 6.11, the moment when two users enter TAN for the ﬁrst time is depicted. The

distance between user ‘Daniel Adams’ (whose device’s screen is captured in Figure 6.11)

and user ‘David Jones’ is less than one meter, indicating close proximity. The BLE

signal calculation in this instance is quite accurate. Measurements taken with a physical

measuring instrument indicated a distance of approximately 95 cm between the two

devices, resulting in a deviation of about ﬁve centimeters. Subsequently, David Jones

moved closer to Daniel Adams. The application running on David Jones’s device recorded

this change, as shown in Figure 6.12.

148

6.4.2 Testing and Results

Figure 6.12. David Jones’s device (iPhone 8) showing the distance between him and Daniel

Adams [5].

In Figure 6.12, the user David Jones perceives the user Daniel Adams at a distance of

61 cm. According to the measurements taken, the deviation was three centimeters, with

the actual distance being 64 cm. At this point, another user enters the system. Sitting

close to Daniel Adams and David Jones is Alex Lopez, who is using an iPad tablet. He

wishes to join the decentralized TAN to detect other nearby users. The result is shown in

Figure 6.13.

Figure 6.13. Alex Lopez with his device (iPad Pro 10.5 2017) entering TAN [5].

In this ﬁgure, the information displayed on Alex Lopez’s device is shown. Alex appears

to be approximately one meter away from both Daniel Adams and David Jones. It should

be noted that Daniel and David are not seated next to each other. Enabling direction

information for each user in a Transactional Area Network is important for enhancing the

user experience and improving data generation. According to actual measurements, the

distance of Alex Lopez from Daniel Adams and David Jones had a deviation of 8 to 10 cm

from the system’s calculated value (the software calculated a shorter distance than the

149

Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks

actual one), indicating that Daniel and David were actually seated slightly further away

from Alex.

At this stage, David Jones decides to move further away from the other two users

within the system. The outcome is displayed on his device’s screen, in Figure 6.14:

Figure 6.14. David Jones’s new distance from the users, as shown in his device’s screen

[5].

From David’s device, Daniel appears to be 2.1 m away, while Alex is measured at 2.66

m. However, upon actual measurement, the BLE signal’s distance calculation shows a

deviation of approximately 15 cm. Consequently, the real distance between David and

Daniel is 2.25 m, while David and Alex are approximately 2.8 m apart. This initial testing

scenario, conducted in an apartment’s living room as previously mentioned, revealed that

the proof-of-concept software calculates the beacon signal’s distance quite accurately. As

expected, as the distance between devices increases, so does the deviation. Nonetheless,

these deviations remain relatively low, indicating a good level of precision in BLE signal

measurements, especially at close distances. Furthermore, data (users’ names and im-

ages) are seamlessly transferred via the peer-to-peer communication session with minimal

to no delays.

Testing Scenario 2: Cafeteria

The second phase of testing the proof-of-concept software for TAN took place in a

public cafeteria. Similar to the initial test case, user Daniel Adams initiates the process

on his iPhone X to detect nearby users. Upon entering the system, Daniel identiﬁes

two nearby devices/users already transmitting signals as virtual iBeacons and MC peers,

therefore having a TAN established. Speciﬁcally, one user named John Doe is using an

iPad, while another user named Catherine Smith is in the network with her iPhone SE

(2016). Both users’ information (names and images), along with the distance between

150

6.4.2 Testing and Results

them and Daniel, is displayed on Daniel’s screen, as depicted in Figure 6.15.

Figure 6.15. Daniel Adams enters TAN through his iPhone X, viewing two peers nearby,

John Doe and Catherine Smith [5].

Daniel observes that Catherine is only 11 cm away from him, while John Doe is

approximately 41 cm away. Actual measurements showed a deviation of 2 cm to Catherine

(actual distance of 13 cm) and 6 cm to John Doe (actual distance approximately 47 cm)

from Daniel’s device. Shortly after, Daniel decides to make a phone call and exits the

cafeteria. Simultaneously, Catherine moves towards the bar, located further inside the

café, to order a beverage. Daniel’s device now displays updated distance data, as shown

in Figure 6.16.

Figure 6.16. Daniel Adams’s new distance from users John Doe and Catherine Smith [5].

151

Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks

In this case, (Figure 6.16), Daniel is 4.48 m away from John Doe and 10.23 m away

from Catherine. In the actual measurements that took place, the deviations from the

values calculated by the software are larger. Daniel is approximately 3.9 m away from

John Doe (deviation of approx. 55 cm) and 9.1 m away from Katerina (deviation of 1.1

m). Despite the relatively large deviations, the calculation of BLE signal’s distance by the

software could still be considered reliable for the case of longer distances. Although it

cannot be considered as an objective scientiﬁc acceptance factor, it should be noted that

the human eye faces diﬃculty in distinguishing a one-meter diﬀerence between distances

of 9–10 m. However, minimizing the actual distance error is something that could be

further analysed. Additionally, it is worth reminding that, when a user gets closer to

another given user inside the session, whose distance was relatively large at some point,

the deviation in the actual distance will decrease. In summary, the levels of deviation are

not that signiﬁcant in order to deem the beacon’s distance calculation unreliable.

In this instance (Figure 6.16), Daniel registers a distance of 4.48 m from John Doe

and 10.23 m from Catherine. However, actual measurements reveal larger deviations from

the values calculated by the software. Daniel’s distance from John Doe is approximately

3.9 m (a deviation of approximately 55 cm), and from Catherine is 9.1 m (a deviation

of 1.1 m). Despite these relatively large deviations, the software’s calculation of BLE

signal distance could still be considered reliable for longer distances. It is important

to note that the human eye may struggle to distinguish a one-meter diﬀerence between

distances of 9–10 m. However, minimizing the actual distance error should require further

analysis. Moreover, as a user approaches another within the session, whose distance

was initially large, the deviation in the actual distance will decrease. In summary, the

observed deviations are not signiﬁcant enough to deem the beacon’s distance calculation

unreliable.

As the second user case scenario nears completion, Daniel and Catherine return to

their seats. They both sit in the same positions they sat before. John checks his device’s

screen to verify if the distances remain at the same levels as before. Daniel should be

approximately 40 cm away, with Catherine seated between them at a distance ranging

from 10 to 20 cm. The measurements displayed on John’s screen are depicted in Figure

6.17.

152

6.5 Findings

Figure 6.17. John Doe takes one ﬁnal look at his iPad’s screen, observing the distance

between himself and users Daniel Adams and Catherine Smith [5].

The distance between John Doe and Daniel Adams appears to be 0.46 cm, while

John’s distance from Catherine Smith measures at 0.17 cm. Upon actual measurements,

one observes a similar deviation as before. John is actually 0.51 cm away from Daniel (a

deviation of about 5 cm) and 0.19 cm away from Catherine (a deviation of two centimeters).

As previously mentioned, BLE signal distance calculation proves relatively accurate at

short distances. However, its accuracy diminishes with increasing distance, aﬀecting

the software’s calculations accordingly. Nonetheless, the Multipeer Connectivity session

operates smoothly, with each device taking approximately 6 to 7 s (at worst) to receive

both names and images of the other peers/users. It is worth noting that both testing

scenarios have been repeated three times, yielding consistent results each time, with a

deviation of ±3 cm for smaller distances and ±10 to ±20 cm for greater ones.

6.5 Findings

The concept of a ‘Transactional Area Network’ (TAN) presents an innovative model in

computer science, particularly within the realm of area networks. This network connects

personal devices in a decentralized fashion, aiming to enhance interaction, mutual in-

ﬂuence, and data exchange between entities. Such a network has the potential to serve

as a foundation for future indoor localization technologies. As outlined in its concept

analysis, a TAN-based indoor positioning system can be developed without requiring ex-

ternal hardware (such as beacon trackers), digital mapping, or specialized studies of the

indoor environment. The proof-of-concept software successfully demonstrated the net-

work’s ability to locate smart devices in an indoor setting, calculate distances between

them based on their BLE beacon signals, and facilitate data exchange through peer-to-

peer communication sessions. The key features and principles of TAN could promote

the widespread adoption of indoor localization applications, oﬀering practical beneﬁts to

153

Chapter 6. Future Use Case: Real-Time Indoor Localization Frameworks

users by enabling seamless location tracking and data sharing.

However, strengthening the robustness of TAN requires addressing several important

factors. While there have been various proposals to enhance the accuracy of indoor

positioning systems, their integration into a TAN must be carefully considered. Solutions

such as employing machine learning algorithms, utilizing more internal device sensors,

or applying ﬁltering techniques could improve the accuracy of distance measurements

across the network. Nonetheless, these strategies may raise concerns regarding energy

consumption, which could deter end-users from adopting the technology. Therefore,

software enabling TAN functionality must be designed to be energy-eﬃcient, in order to

avoid excessive power demands that might discourage use. So, frameworks built on the

TAN model should prioritize energy eﬃciency to prevent high consumption, ensuring that

users can engage with the network for extended periods without limitations.

Yet, the accuracy of distance calculations within TAN requires further investigation

and reﬁnement, particularly over longer distances, where deviations from actual mea-

surements tend to become more pronounced. The existing research discussed in this

chapter provides valuable insights that could help address this challenge. In light of

the impact on energy consumption, an Extended Kalman Filtering model could be ap-

plied to the BLE RSSI signals, supplemented by an Outlier Detection layer. Additionally,

providing users with directional information about each peer within the network is cru-

cial. Knowing both the direction and distance to nearby peers signiﬁcantly improves a

user’s spatial awareness and understanding of their relative position within the network.

These advancements could be integrated into future iterations of TAN’s proof-of-concept

software, as depicted in Figure 6.18.

Figure 6.18. Proposed improvement in TAN software’s internal components Architecture,

highlighting the addition of Kalman ﬁltering and outlier detection methods, along with a

signal direction calculation process [5].

154

6.5 Findings

Future research on the concept of a Transactional Area Network should primarily

focus on a thorough examination of algorithms, accuracy, and eﬃciency. It should inves-

tigate the potential limitations of TAN, such as the eﬀects of physical barriers like walls

on signal propagation, and how these might inﬂuence localization accuracy, introducing

geometric distortions. Additionally, the study of distance measurement errors, standard

deviation, and root mean square error should be considered, drawing on the work of

Dalkılıç et al. [177] on BLE signal accuracy. The proposed enhancements in Figure 6.18

should integrate algorithmic analysis based on the work of Dalkılıç and Dinh et al. [151].

These eﬀorts would help validate the theoretical developments discussed in this article,

ensuring that TANs can be applied eﬀectively in practical scenarios.

Regarding the implementation of the software in Swift, one minor issue arises from the

limit on the number of peers that each Multipeer Connectivity session can accommodate

(set at a maximum of 8 peers, as speciﬁed by Apple in pre-deﬁned value ‘maximumNum-

berOfPeers’) [178]. However, this limitation can be addressed by creating multiple MC

sessions, each supporting up to 5 peers, thus keeping energy consumption and resource

requirements manageable. In this setup, certain peers would function as intermediaries

or bridges between diﬀerent sessions. These intermediary peers would receive data from

one session and relay it to others, eﬀectively linking the sessions, allowing for extended

network coverage.

Lastly, it is essential to propose an eﬃcient data management system for eﬀectively

retrieving and processing the indoor positioning data transmitted by each device within

a TAN. To enhance the dataset, each device should also transmit its GPS coordinates,

oﬀering valuable context for data engineers analyzing the information. The scalable data

management layer, main part of this PhD thesis, is designed to manage large data streams

in near real-time, making it well-suited for integration with a Transactional Area Network.

For properly analyzing the data generated with TAN networks, this thesis’s data analysis

asset with the utilization of oﬄine LLMs can prove to be a vital assistant for end-users.

Thus, the PhD thesis’s complete framework is a robust solution for managing storing,

and deeply analyzing TAN data.

In conclusion, the implementation of a Transactional Area Network holds signiﬁcant

promise for advancing indoor positioning systems, oﬀering new services to end-users

and simplifying the technology to facilitate widespread adoption. The peer-to-peer design

promotes direct communication between devices, aiming for a decentralized network that

is both eﬃcient and scalable. By minimizing the need for external devices, TAN also

enhances accessibility and aﬀordability, while ensuring that consistent data collection

and clarity allow for the eﬀective reuse of the data. As TANs continue to develop and

gain adoption, they have the potential to improve indoor positioning for consumers, and

expand their applications across a variety of industries, transforming how people interact

with indoor spaces. In emergency situations, such as disasters, TANs could provide life-

saving capabilities by helping rescuers locate survivors. Ultimately, the Transactional

Area Network could redeﬁne the foundation of future indoor positioning applications,

harnessing the power of indoor localization data.

155

Part III

Epilogue

157

Chapter 7

Conlusions: Towards Scalable Data Management

Empowered By Oﬄine Large Language Models

7.1 Managing Scalable Volumes of Data

The speciﬁc demands of critical infrastructures and business organization underline

the importance of a middleware system, which will be able to structure data in a way that

supports reliable and eﬀective analysis. To enable smooth interaction between connected

components, services must also be adapted in advance, so to ensure interoperability.

This thesis’s scalable data management layer follows a data-centric design, providing a

platform where small and medium-sized enterprises (SMEs), telecom companies, data

providers, content creators, transportation authorities, and other stakeholders across

various domains could work together within a well-organized and collaborative data-

sharing environment. Creating such a space is a key goal for any asset aiming to build and

maintain an ecosystem that attracts businesses, startups, and independent developers.

As a part of PhD thesis study, the scalable data management layer is speciﬁcally

designed to tackle key challenges faced by organizations today, including issues of data

quality, and operational eﬃciency. To improve data reliability, the software layer applies

advanced cleansing methods that ensure data from various sources remains accurate,

complete, and consistent. The performance of the asset has been evaluated as a stan-

dalone component, using real-world datasets: Customs data from the port of Valencia, as

well as anonymized mobility data from the Thessaloniki metropolitan area. These results

conﬁrm that the scalable data management layer can be eﬀectively applied in other real-

life scenarios. The layer plays an integral role in the ﬁnal phase of this study, where it is

interconnected with an oﬄine LLM, leading to the establishment of an AI-powered data

management and analysis framework, aiming to ease data analytics tasks for end-users.

7.2 Leveraging Oﬄine LLMs for Data Analysis

The ﬁnal - and main - part of this thesis investigates how oﬄine large language models

(LLMs) can be used to generate code for data analysis tasks, also incorporating the scal-

able data management layer previously described. While well-known online models (such

as GPT or Gemini) oﬀer advanced capabilities, their use may pose security and privacy

159

Chapter 7. Conlusions: Towards Scalable Data Management Empowered By Oﬄine Large Language Models

risks, especially when dealing with sensitive data. Additionally, current LLMs may face

diﬃculties in processing large-scale datasets, or fully understanding the speciﬁc context

of a query. These diﬃculties can lead to inaccurate results. Oﬄine LLMs oﬀer a promising

alternative, by allowing organizations to use these models locally, maintaining full control

over both the model and their data. In this conﬁguration, users submit natural language

queries, which the model interprets to produce suitable data analysis code. This code is

then passed through a communication pipeline to the data management layer, where it

is executed to extract the ﬁnal results.

For this study, Mistral AI’s Codestral and Alibaba’s Qwen 2.5 Coder were selected as

the oﬄine LLMs, chosen for their strong performance and eﬃciency compared to other

code-focused models. The communication pipeline and the data management layer were

both built using Apache Spark, ensuring robust and scalable data handling. Five diﬀerent

datasets were used in the testing phase. Each dataset was paired with ﬁve queries of

diﬀerent complexity levels. To ensure thorough evaluation, each query was tested ten

times. This process resulted in 250 individual tests per model, allowing for a detailed

assessment of how accurately Codestral and Qwen could generate appropriate code in

response to natural language inputs.

The ﬁndings of this PhD research emphasize the strong potential of oﬄine large lan-

guage models (LLMs) in generating accurate and understandable code for data analysis

operations, all while maintaining strict data privacy standards. Both Codestral and Qwen

performed well, producing reliable and readable scripts with a high degree of automation.

Nonetheless, a major limitation identiﬁed is computational performance, particularly the

need for more advanced GPU resources to deliver faster, near real-time responses. To

overcome this challenge, future work could focus on ﬁne-tuning the models, designing a

user-friendly interface, and exploring hybrid system architectures that balance scalability

with privacy. These improvements would help address current technical limitations and

support the transition from experimental setups to practical, enterprise-level solutions.

This PhD study may serve as a foundation for further research into the use of oﬄine

LLMs for data analysis through natural language-driven code generation. Rather than

transferring data to an external model, this approach emphasizes moving the code gen-

eration process directly to where the data resides. Continued research in this area is

expected, as the potential of LLMs within the ﬁeld of Data Science is far from fully real-

ized. Over time, this method could become a widely adopted approach for performing data

analysis. One clear trend is the growing inﬂuence of LLMs across scientiﬁc disciplines,

with Data Science likely to be one of the ﬁelds most deeply shaped by their integration.

The extent to which these models will redeﬁne standard data analysis practices remains

an open and evolving question.

160

Appendices

161

Appendix

A.1 The Python code readability calculation function

Listing A.1: A custom function to provide a readability score, from 1 to 3 (with 3 meaning

higher readability), to the LLM’s generated code

def evaluate_readability(code):

"""

The function analyzes the code to determine the readability score by:

−Checking the length of each command.

−Counting the number of method call chains.

−Analyzing the depth of nested structures.

"""

try:

# Split the code into individual commands based on newlines and semicolons

commands = [cmd.strip() for cmd in code.replace(’\n’, ’;’).split(’;’) if cmd.strip()]

readability_score = 0 # Initialize the readability score

total_commands = len(commands) # Get the total number of commands

if total_commands == 0:

return readability_score

# Initialize counters for long lines and maximum chain length

long_lines = 0

max_chain_length = 0

for cmd in commands:

# Split the command into lines to check for long lines

lines = re.split(r’;\s∗|\n’, cmd)

long_lines += sum(1 for line in lines if len(line) > 80) # Count lines longer than 80 characters

# Parse the command into an abstract syntax tree (AST)

try:

tree = ast.parse(cmd)

except SyntaxError:

continue # Skip this command if there’s a syntax error

# Count method calls in chained operations

for node in ast.walk(tree):

163

A. Appendix

if isinstance(node, ast.Call):

chain_length = 0

current_node = node.func

# Traverse the chain of method calls to count its length

while isinstance(current_node, ast.Attribute):

chain_length += 1

current_node = current_node.value

max_chain_length = max(max_chain_length, chain_length)

# Adjust readability score for long lines

if long_lines / total_commands < 0.1: # Allow some tolerance for long lines

readability_score += 1

# Adjust readability score based on the length of method call chains

if max_chain_length <= 3: # Full score for chains up to 3 method calls

readability_score += 1

# Parse the entire code to evaluate nested structures

try:

tree = ast.parse(code)

except SyntaxError:

return readability_score # Return the score if there’s a syntax error in the entire code

def count_nested_structures(node, depth=0):

# Increase depth for speciﬁc nodes (if, for, while, function deﬁnitions)

if isinstance(node, (ast.If, ast.For, ast.While, ast.FunctionDef)):

depth += 1

# Recursively check the depth of nested structures

return max([count_nested_structures(child, depth) for child in ast.iter_child_nodes(node)], default=

depth)

nested_structure_depth = count_nested_structures(tree)

# Full score if the maximum depth of nested structures is 2 or less

if nested_structure_depth <= 2:

readability_score += 1

return readability_score # Return the ﬁnal readability score

except Exception as e:

print(f"Errorevaluatingreadability:{e}")

return 0# Return a score of 0 if there’s an error

)

A.2 Summary of the ‘Shared Cars Locations’ Dataset

Listing A.2: A dataset summary of the "Shared Cars Locations" dataset, carefully crafted

by a research team member.

{

"intro": "Thisdatasetcontainsinformationaboutsharedcarlocations,includingvariousattributesthat

describeeachentry.",

"columns": [

164

A.2 Summary of the ‘Shared Cars Locations’ Dataset

{

"name": "latitude",

"type": "ﬂoat64",

"simpliﬁed_type": "ﬂoat",

"description": "Latitudecoordinateofthecar’slocation.",

"sample_values": [

32.083,

32.064615,

32.113535

]

{

"name": "longitude",

"type": "ﬂoat64",

"simpliﬁed_type": "ﬂoat",

"description": "Longitudecoordinateofthecar’slocation.",

"sample_values": [

34.8043,

34.77577,

34.7911

]

{

"name": "total_cars",

"type": "int64",

"simpliﬁed_type": "int",

"description": "Totalnumberofcarsavailableatthelocation.",

"sample_values": [

]

{

"name": "cars_list",

"type": "object",

"simpliﬁed_type": "string",

"description": "Listofcaridentiﬁersavailableatthelocation.",

"sample_values": [

"[]",

"[213]",

"[137,180,193]"

]

{

"name": "timestamp",

"type": "object",

"simpliﬁed_type": "datetime",

"description": "TimestampofthedatarecordinUTC.",

"sample_values": [

"2019−12−0621:51:02UTC",

"2019−11−2914:00:02UTC",

165

A. Appendix

"2019−09−2013:21:03UTC"

"datetime_format": "%Y−%m−%d%H:%M:%S%Z"

}

"note": "Thesampledataprovidedhereissyntheticandintendedtoillustratethedatastructure."

}

166

Bibliography

[1] Anastasios Nikolakopoulos, Matilde Julian Segui, Andreu Belsa Pellicer, Michalis

Kefalogiannis, Christos Antonios Gizelis, Achilleas Marinakis, Konstantinos

Nestorakis και Theodora Varvarigou. BigDaM: Eﬃcient Big Data Management and

Interoperability Middleware for Seaports as Critical Infrastructures.Computers,

12(11):218, 2023.

[2] Anastasios Nikolakopoulos, Efthymios Chondrogiannis, Efstathios Karanastasis,

María José López Osa, Jordi Arjona Aroca, Michalis Kefalogiannis, Vasiliki Apos-

tolopoulou, Efstathia Deligeorgi, Vasileios Siopidis και Theodora Varvarigou. Scal-

able Data Proﬁling for Quality Analytics Extraction.IFIP International Conference on

Artiﬁcial Intelligence Applications and Innovations,σελίδες 177–189. Springer, 2024.

[3] Achilleas Marinakis, Matilde Julian Segui, Andreu Belsa Pellicer, Carlos E Palau,

Christos Antonios Gizelis, Anastasios Nikolakopoulos, Antonios Misargopoulos, Fil-

ippos Nikolopoulos-Gkamatsis, Michalis Kefalogiannis, Theodora Varvarigou και

others. Eﬃcient Data Management and Interoperability Middleware in Business-

Oriented Smart Port Use Cases.IFIP International Conference on Artiﬁcial Intelligence

Applications and Innovations,σελίδες 108–119. Springer, 2022.

[4] Anastasios Nikolakopoulos, Antonios Litke, Alexandros Psychas, Eleni Veroni και

Theodora Varvarigou. Exploring the Potential of Oﬄine LLMs in Data Science: A

Study on Code Generation for Data Analysis.IEEE Access, 13:64087–64114, 2025.

[5] Anastasios Nikolakopoulos, Alexandros Psychas, Antonios Litke και Theodora Var-

varigou. Leveraging Indoor Localization Data: The Transactional Area Network (TAN).

Electronics, 13(13):2454, 2024.

[6] Economist. The world’s most valuable resource is no longer oil, but data - Web Article,

2017.

[7] Yash Agrawal. The Accelerating Pace of Technological Trends - Adapting to Market

Dynamics as an IT Professionals - Web Article, 2023.

[8] Fabio Duarte. Amount of Data Created Daily - Web Article, 2024.

[9] Steven M Rinaldi. Modeling and simulating critical infrastructures and their inter-

dependencies.37th Annual Hawaii International Conference on System Sciences,

2004. Proceedings of the,σελίδες 8–pp. IEEE, 2004.

167

BIBLIOGRAPHY

[10] John D Moteﬀ, Claudia Copeland, John W Fischer, Science Resources και Industry

Division. Critical infrastructures: What makes an infrastructure critical? Congres-

sional Research Service, Library of Congress Washington, DC, 2003.

[11] OpenAI. GPT-4o Large Language Model, 2024. accessed on 25 April 2025.

[12] Google. Gemini Large Language Model, 2024. accessed on 25 April 2025.

[13] Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar,

Muhammad Usman, Nick Barnes και Ajmal Mian. A comprehensive overview of

large language models.arXiv preprint arXiv:2307.06435, 2023.

[14] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou,

Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong και others. A survey of

large language models.arXiv preprint arXiv:2303.18223, 2023.

[15] Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna

Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke

Hüllermeier και others. ChatGPT for good? On opportunities and challenges of large

language models for education.Learning and individual diﬀerences, 103:102274,

2023.

[16] Arne Bewersdorﬀ, Kathrin Seßler, Armin Baur, Enkelejda Kasneci και Claudia

Nerdel. Assessing student errors in experimentation using artiﬁcial intelligence and

large language models: A comparative study with human raters.Computers and

Education: Artiﬁcial Intelligence, 5:100177, 2023.

[17] Christof Ebert και Panos Louridas. Generative AI for software practitioners.IEEE

Software, 40(4):30–38, 2023.

[18] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer και Mike Lewis.

Generalization through memorization: Nearest neighbor language models.arXiv

preprint arXiv:1911.00172, 2019.

[19] Joonsang Baek, Quang Hieu Vu, Joseph K Liu, Xinyi Huang και Yang Xiang. A

secure cloud computing based framework for big data information management of

smart grid.IEEE transactions on cloud computing, 3(2):233–244, 2014.

[20] Andre Luckow, Ken Kennedy, Fabian Manhardt, Emil Djerekarov, Bennie Vorster

και Amy Apon. Automotive big data: Applications, workloads and infrastructures.

2015 IEEE International Conference on Big Data (Big Data),σελίδες 1201–1210.

IEEE, 2015.

[21] Apache Hadoop - Framework.https://hadoop.apache.org.

[22] Ivo D Dinov. Methodological challenges and analytic opportunities for modeling and

interpreting Big Healthcare Data.Gigascience, 5(1):s13742–016, 2016.

168

BIBLIOGRAPHY

[23] Kuljeet Kaur, Sahil Garg, Georges Kaddoum, Elias Bou-Harb και Kim Kwang Ray-

mond Choo. A big data-enabled consolidated framework for energy eﬃcient software

deﬁned data centers in IoT setups.IEEE Transactions on Industrial Informatics,

16(4):2687–2697, 2019.

[24] Praveen Kumar Donta, Boris Sedlak, Victor Casamayor Pujol και Schahram Dust-

dar. Governance and sustainability of distributed continuum systems: a big data

approach.Journal of Big Data, 10(1):1–31, 2023.

[25] Yunhe Feng, Sreecharan Vanam, Manasa Cherukupally, Weĳian Zheng, Meikang

Qiu και Haihua Chen. Investigating code generation performance of ChatGPT with

crowdsourcing social data.2023 IEEE 47th Annual Computers, Software, and Ap-

plications Conference (COMPSAC),σελίδες 876–885. IEEE, 2023.

[26] Qiuhan Gu. Llm-based code generation method for golang compiler testing.Proceed-

ings of the 31st ACM Joint European Software Engineering Conference and Sympo-

sium on the Foundations of Software Engineering,σελίδες 2201–2203. ACM, 2023.

[27] Steven I Ross, Fernando Martinez, Stephanie Houde, Michael Muller και Justin D

Weisz. The programmer’s assistant: Conversational interaction with a large language

model for software development.Proceedings of the 28th International Conference

on Intelligent User Interfaces,σελίδες 491–514. ACM, 2023.

[28] Ahmed Soliman, Samir Shaheen και Mayada Hadhoud. Leveraging pre-trained

language models for code generation.Complex & Intelligent Systems,σελίδες 1–26,

2024.

[29] CoNaLa. CoNaLa: The Code / Natural Language Challenge, 2024. accessed on 25

April 2025.

[30] Yusuke Oda. Django Dataset for Code Translation Tasks, 2024. accessed on 25

April 2025.

[31] Giovanni Pinna, Damiano Ravalico, Luigi Rovito, Luca Manzoni και Andrea

De Lorenzo. Enhancing Large Language Models-Based Code Generation by Lever-

aging Genetic Improvement.European Conference on Genetic Programming (Part of

EvoStar),σελίδες 108–124. Springer, 2024.

[32] Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang,

Ying Li, Qianxiang Wang και Tao Xie. Codereval: A benchmark of pragmatic code

generation with generative pre-trained models.Proceedings of the 46th IEEE/ACM

International Conference on Software Engineering,σελίδες 1–12. ACM, 2024.

[33] Safwan Omari, Kshitiz Basnet και Mohammad Wardat. Investigating large language

models capabilities for automatic code repair in Python.Cluster Computing, 2024.

[34] Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta,

Shin Yoo και Jie M Zhang. Large language models for software engineering: Survey

169

BIBLIOGRAPHY

and open problems.2023 IEEE/ACM International Conference on Software Engineer-

ing: Future of Software Engineering (ICSE-FoSE),σελίδες 31–53. IEEE, 2023.

[35] Man Fai Wong, Shangxin Guo, Ching Nam Hang, Siu Wai Ho και Chee Wei Tan.

Natural language generation and understanding of big code for AI-assisted program-

ming: A review.Entropy, 25(6):888, 2023.

[36] GitHub. GitHub Copilot AI Developer Tool, 2024. accessed on 25 April 2025.

[37] Google. DeepMind AlphaCode AI Developer Tool, 2024. accessed on 25 April 2025.

[38] Jianxun Wang και Yixiang Chen. A Review on Code Generation with LLMs: Ap-

plication and Evaluation.2023 IEEE International Conference on Medical Artiﬁcial

Intelligence (MedAI),σελίδες 284–289. IEEE, 2023.

[39] Hsiao Chuan Liu, Chia Tung Tsai και Min Yuh Day. A Pilot Study on AI-Assisted

Code Generation with Large Language Models for Software Engineering.International

Conference on Technologies and Applications of Artiﬁcial Intelligence,σελίδες 162–

175. Springer, 2024.

[40] IBM. What is Prompt Engineering?, 2024. accessed on 25 April 2025.

[41] Ggaliwango Marvin, Nakayiza Hellen, Daudi Jjingo και Joyce Nakatumba-Nabende.

Prompt engineering in large language models.International conference on data intel-

ligence and cognitive informatics,σελίδες 387–402. Springer, 2023.

[42] Juan David Velásquez-Henao, Carlos Jaime Franco-Cardona και Lorena Cadavid-

Higuita. Prompt Engineering: a methodology for optimizing interactions with AI-

Language Models in the ﬁeld of engineering.Dyna, 90(230):9–17, 2023.

[43] William Cain. Prompting change: exploring prompt engineering in large language

model AI and its potential to transform education.TechTrends, 68(1):47–57, 2024.

[44] Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinĳa Jain, Samrat Mondal

και Aman Chadha. A systematic survey of prompt engineering in large language

models: Techniques and applications.arXiv preprint arXiv:2402.07927, 2024.

[45] Luca Beurer-Kellner, Marc Fischer και Martin Vechev. Prompting is programming: A

query language for large language models.Proceedings of the ACM on Programming

Languages, 7(PLDI):1946–1969, 2023.

[46] Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg και Elena L

Glassman. ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis

Testing.Proceedings of the CHI Conference on Human Factors in Computing Systems,

σελίδες 1–18. ACM, 2024.

[47] Apache. Apache Spark - Framework.

170

BIBLIOGRAPHY

[48] Diego García-Gil, Sergio Ramírez-Gallego, Salvador García και Francisco Herrera. A

comparison on scalability for batch big data processing on Apache Spark and Apache

Flink.Big Data Analytics, 2(1):1–11, 2017.

[49] MongoDB - Framework.https://www.mongodb.com.

[50] Kubernetes - Framework.https://kubernetes.io.

[51] Odd Erik Gundersen και Sigbjørn Kjensmo. State of the art: Reproducibility in

artiﬁcial intelligence.Proceedings of the AAAI conference on artiﬁcial intelligence,

τόµος 32, 2018.

[52] MistralAI. Codestral Large Language Model, 2024. accessed on 25 April 2025.

[53] Alibaba Qwen. Qwen 2.5 Coder Large Language Model, 2024. accessed on 25 April

2025.

[54] IBM. What is data proﬁling? - Web Article.

[55] IEEE Spectrum Bill Sweet. The Smart Meter Avalanche - Web Article.https://

spectrum.ieee.org/the-smart-meter-avalanche#toggle-gdpr.

[56] IT News Australia Nate Cochrane. US smart grid to generate 1000 petabytes of data

a year - Web Article.https://www.itnews.com.au/news/us-smart-grid-to-generate-

1000-petabytes-of-data-a-year-170290.

[57] Data Dynamics. Analyzing Energy Consumption: Unleashing the Power of Data in

the Energy Industry - Web Article.https://www.datadynamicsinc.com/blog-analyzing-

energy-consumption-unleashing-the-power-of-data-in-the-energy-industry/.

[58] Big Data and Transport Understanding and assessing options - International Trans-

port Forum.https://www.itf-oecd.org/sites/default/files/docs/15cpb_bigdata_0.

pdf.

[59] Weiwei Jiang και Jiayun Luo. Big data for traﬃc estimation and prediction: a survey

of data and tools.Applied System Innovation, 5(1):23, 2022.

[60] Hira Zahid, Tariq Mahmood, Ahsan Morshed και Timos Sellis. Big data analytics in

telecommunications: literature review and architecture recommendations.IEEE/CAA

Journal of Automatica Sinica, 7(1):18–38, 2019.

[61] DataPorts Horizon 2020 EU Research Project.https://dataports-project.eu/.

[62] Microsoft Research AI4Science και Microsoft Azure Quantum. The impact of large

language models on scientiﬁc discovery: a preliminary study using gpt-4.arXiv

preprint arXiv:2311.07361, 2023.

[63] Anastasios Nikolakopoulos, Spyridon Evangelatos, Eleni Veroni, Konstantinos

Chasapas, Nikolaos Gousetis, Apostolos Apostolaras, Christos D. Nikolopoulos και

171

BIBLIOGRAPHY

Thanasis Korakis. Large Language Models in Modern Forensic Investigations: Har-

nessing the Power of Generative Artiﬁcial Intelligence in Crime Resolution and Suspect

Identiﬁcation.2024 5th International Conference in Electronic Engineering, Informa-

tion Technology and Education (EEITE),σελίδες 1–5. IEEE, 2024.

[64] Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu

και Robert McHardy. Challenges and applications of large language models.arXiv

preprint arXiv:2307.10169, 2023.

[65] Reihaneh H Hariri, Erik M Fredericks και Kate M Bowers. Uncertainty in big data

analytics: survey, opportunities, and challenges.Journal of Big data, 6(1):1–16,

2019.

[66] Colin Raﬀel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael

Matena, Yanqi Zhou, Wei Li και Peter J Liu. Exploring the limits of transfer learn-

ing with a uniﬁed text-to-text transformer.Journal of machine learning research,

21(140):1–67, 2020.

[67] Fabian Suchanek και Anh Tuan Luu. Knowledge bases and language models: Com-

plementing forces.International Joint Conference on Rules and Reasoning,σελίδες

3–15. Springer, 2023.

[68] Kai Kang, Yuqi Yang, Yĳun Wu και Ren Luo. Integrating Large Language Models

in Bioinformatics Education for Medical Students: Opportunities and Challenges.

Annals of Biomedical Engineering,σελίδες 1–5, 2024.

[69] European Union Law. General Data Protection Regulation, 2016. accessed on 25

April 2025.

[70] Qian Wang, Minxin Du, Xiuying Chen, Yanjiao Chen, Pan Zhou, Xiaofeng Chen και

Xinyi Huang. Privacy-preserving collaborative model learning: The case of word vec-

tor training.IEEE Transactions on Knowledge and Data Engineering, 30(12):2381–

2393, 2018.

[71] Wei Dai, Isaac Wardlaw, Yu Cui, Kashif Mehdi, Yanyan Li και Jun Long. Data

proﬁling technology of data governance regarding big data: review and rethinking.

Information Technology: New Generations: 13th International Conference on Infor-

mation Technology,σελίδες 439–450. Springer, 2016.

[72] Ziawasch Abedjan, Lukasz Golab και Felix Naumann. Data proﬁling: A tutorial.Pro-

ceedings of the 2017 ACM International Conference on Management of Data,σελίδες

1747–1751, 2017.

[73] Ikbal Taleb, Mohamed Adel Serhani και Rachida Dssouli. Big data quality: a data

quality proﬁling model.World Congress on Services,σελίδες 61–77. Springer, 2019.

[74] Zhicheng Liu και Aoqian Zhang. Sampling for big data proﬁling: A survey.IEEE

Access, 8:72713–72726, 2020.

172

BIBLIOGRAPHY

[75] Zhicheng Liu και Aoqian Zhang. A survey on sampling and proﬁling over big data

(technical report).arXiv preprint arXiv:2005.05079, 2020.

[76] Bahaa Eddine Elbaghazaoui, Mohamed Amnai και Abdellatif Semmouri. Data pro-

ﬁling over big data area: a survey of big data proﬁling: state-of-the-art, use cases and

challenges.Intelligent Systems in Big Data, Semantic Web and Machine Learning,

σελίδες 111–123. 2021.

[77] Júlia Colleoni Couto, Juliana Damasio, Rafael Bordini και Duncan Ruiz. New

trends in big data proﬁling.Science and Information Conference,σελίδες 808–825.

Springer, 2022.

[78] Anastasĳa Nikiforova. Deﬁnition and Evaluation of Data Quality: User-Oriented

Data Object-Driven Approach to Data Quality Assessment. Baltic Journal of Modern

Computing, 8(3), 2020.

[79] Marcel Altendeitering, ISST Fraunhofer και Tobias Moritz Guggenberger. Data Qual-

ity Tools: Towards a Software Reference Architecture. 2024.

[80] European Sea Ports Organisation - Conference.https://www.espo.be/.

[81] International Association of Ports and Harbors.https://www.iaphworldports.org.

[82] The Worldwide Network of Port Cities.http://www.aivp.org/en/.

[83] Athanasios Drougkas, Anna Sarri, Pinelopi Kyranoudi και Antigone Zisi. Port Cyber-

security. Good practices for cybersecurity in the maritime sector.ENSISA, 10:328515,

2019.

[84] JW Kim, JY Son και KK Yoon. An implementation of integrated interfaces for telecom

systems and TMS in vessels.International Journal of Engineering and Technology,

10(2):195–199, 2018.

[85] The Marketplace of the European Innovation Partnership on Smart Cities and Com-

munities.https://eu-smartcities.eu/.

[86] BigDataStack H2020 Project.https://www.bigdatastack.eu.

[87] SmartShip H2020 Project.https://www.smartship2020.eu/.

[88] Showkat Ahmad Bhat, Nen Fu Huang, Ishfaq Bashir Soﬁ και Muhammad Sultan.

Agriculture-food supply chain management based on blockchain and IoT: a narrative

on enterprise blockchain interoperability.Agriculture, 12(1):40, 2021.

[89] Shailja Thakur, Baleegh Ahmad, Hammond Pearce, Benjamin Tan, Brendan Dolan-

Gavitt, Ramesh Karri και Siddharth Garg. Verigen: A large language model for ver-

ilog code generation.ACM Transactions on Design Automation of Electronic Systems,

29(3):1–31, 2024.

173

BIBLIOGRAPHY

[90] Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang,

Shichao Liu και Qing Wang. ClarifyGPT: A Framework for Enhancing LLM-Based

Code Generation via Requirements Clariﬁcation.Proceedings of the ACM on Software

Engineering, 1(FSE):2332–2354, 2024.

[91] David Rau και Jaap Kamps. Query Generation Using Large Language Models.Ad-

vances in Information Retrieval,σελίδες 226–239. Springer, 2024.

[92] Devendra Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen tau Yih,

Joelle Pineau και Luke Zettlemoyer. Improving Passage Retrieval with Zero-Shot

Question Generation.Proceedings of the 2022 Conference on Empirical Methods in

Natural Language Processing,σελίδες 3781–3797, Abu Dhabi, United Arab Emi-

rates, 2022. Association for Computational Linguistics.

[93] Beir Cellar. BEIR: A Heterogeneous Benchmark, 2024. accessed on 25 April 2025.

[94] Microsoft. TREC 2020 Deep Learning Track, 2020. accessed on 25 April 2025.

[95] Xuanhe Zhou, Zhaoyan Sun και Guoliang Li. Db-gpt: Large language model meets

database.Data Science and Engineering, 9(1):102–111, 2024.

[96] OpenAI. OpenAI Codex, 2024. accessed on 25 April 2025.

[97] DataRobot. DataRobot AI Solutions, 2024. accessed on 25 April 2025.

[98] ThoughtSpot. ThoughtSpot AI-Powered Analytics, 2024. accessed on 25 April 2025.

[99] Tableau. Tableau Data Visualization, 2024. accessed on 25 April 2025.

[100] Microsoft. Microsoft Power BI, 2024. accessed on 25 April 2025.

[101] Apache. Apache Flink - Framework.

[102] Apache. Apache Storm - Framework.

[103] Hriday Kumar Gupta και Rafat Parveen. Comparative study of big data frameworks.

2019 International Conference on Issues and Challenges in Intelligent Computing

Techniques (ICICT),τόµος 1, σελίδες 1–4. IEEE, 2019.

[104] Ovidiu Cristian Marcu, Alexandru Costan, Gabriel Antoniu και María S Pérez-

Hernández. Spark versus ﬂink: Understanding performance in big data analytics

frameworks.2016 IEEE International Conference on Cluster Computing (CLUSTER),

σελίδες 433–442. IEEE, 2016.

[105] Jorge Veiga, Roberto R Expósito, Xoán C Pardo, Guillermo L Taboada και Juan

Touriﬁo. Performance evaluation of big data frameworks for large-scale data ana-

lytics.2016 IEEE International Conference on Big Data (Big Data),σελίδες 424–431.

IEEE, 2016.

[106] Guidovan Rossum. Python - Programming Language.

174

BIBLIOGRAPHY

[107] Abhinav Nagpal και Goldie Gabrani. Python for data analytics, scientiﬁc and techni-

cal applications.2019 Amity international conference on artiﬁcial intelligence (AICAI),

σελίδες 140–145. IEEE, 2019.

[108] Apache. PySpark Overview - Introduction.

[109] Efstratios Karypiadis, Anastasios Nikolakopoulos, Achilleas Marinakis, Vrettos

Moulos και Theodora Varvarigou. SCAL-E: An Auto Scaling Agent for Optimum Big

Data Load Balancing in Kubernetes Environments.2022 International Conference on

Computer, Information and Telecommunication Systems (CITS),σελίδες 1–5. IEEE,

2022.

[110] Apache NiFi - Framework.https://nifi.apache.org.

[111] Apache Flume.https://flume.apache.org/.

[112] Apache Kafka - Framework.https://kafka.apache.org.

[113] Kaggle. Netﬂix: Movies and TV Shows Dataset, 2024. accessed on 25 April 2025.

[114] Kaggle. Covid-19 Twitter Dataset, 2024. accessed on 25 April 2025.

[115] Kaggle. Shard Cars Locations Dataset: Location History of Shared Cars, 2024.

accessed on 25 April 2025.

[116] AutoTel. AutoTel Project in Tel Aviv, 2024. accessed on 25 April 2025.

[117] MavenAnalytics. Madrid Daily Weather Dataset: Daily weather conditions in Madrid

from 1997-2015, 2024. accessed on 25 April 2025.

[118] Kaggle. Supermarket Sales Dataset: Historical Record of Sales Data in 3 Diﬀerent

Supermarkets, 2024. accessed on 25 April 2025.

[119] Varun Dogra, Sahil Verma, Marcin Woźniak, Jana Shaﬁ, Muhammad Fazal Ijaz και

others. Shortcut Learning Explanations for Deep Natural Language Processing: A

Survey on Dataset Biases.IEEE Access, 2024.

[120] OpenAI. HumanEval Dataset, 2024. accessed on 25 April 2025.

[121] Tianyang Liu, Canwen Xu και Julian McAuley. Repobench: Benchmarking

repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091,

2023.

[122] Meta. CRUXEval: Code Reasoning, Understanding, and Execution Evaluation, 2024.

accessed on 25 April 2025.

[123] Meta. CodeLlama: A State-of-the-Art Large Language Model for Coding, 2023. ac-

cessed on 25 April 2025.

[124] DeepSeekAI. DeepSeek Coder 33B Large Language Model, 2024. accessed on 25

April 2025.

175

BIBLIOGRAPHY

[125] Meta. Llama 3: Openly Available Large Language Model, 2024. accessed on 25 April

2025.

[126] DataCamp. What is Mistral’s Codestral? Key Features, Use Cases, and Limitations,

2024. accessed on 25 April 2025.

[127] Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu

Liu, Jiajun Zhang, Bowen Yu, Kai Dang και others. Qwen2. 5-Coder Technical

Report.arXiv preprint arXiv:2409.12186, 2024.

[128] Shiben Zhu, Wanqin Hu, Zhi Yang, Jiani Yan και Fang Zhang. Qwen-2.5 Out-

performs Other Large Language Models in the Chinese National Nursing Licensing

Examination: Retrospective Cross-Sectional Comparative Study.JMIR Medical Infor-

matics, 13:e63731, 2025.

[129] Qiang Yin, Jianhua Wang, Sheng Du, Jianquan Leng, Jintao Li, Yinhao Hong, Feng

Zhang, Yunpeng Chai, Xiao Zhang, Xiaonan Zhao και others. An adaptive elastic

multi-model big data analysis and information extraction system.Data Science and

Engineering, 7(4):328–338, 2022.

[130] Elad Yom-Tov, Shai Fine, David Carmel και Adam Darlow. Learning to estimate

query diﬃculty: including applications to missing content detection and distributed

information retrieval.Proceedings of the 28th annual international ACM SIGIR confer-

ence on Research and development in information retrieval,σελίδες 512–519, 2005.

[131] Jiafeng Guo και Yanyan Lan. Query classiﬁcation.Query Understanding for Search

Engines,σελίδες 15–41, 2020.

[132] Yonatan Belinkov και James Glass. Analysis methods in neural language processing:

A survey.Transactions of the Association for Computational Linguistics, 7:49–72,

2019.

[133] Emily M Bender και Alexander Koller. Climbing towards NLU: On meaning, form,

and understanding in the age of data.Proceedings of the 58th annual meeting of the

association for computational linguistics,σελίδες 5185–5198, 2020.

[134] OTE. OTE Group of Companies.

[135] HuggingFace. Hugging Face: Codestral v01 Large Language Model, 2024. accessed

on 25 April 2025.

[136] LMStudio. LM Studio Server, 2025. accessed on 25 April 2025.

[137] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes και Yejin Choi. The curious case of

neural text degeneration.arXiv preprint arXiv:1904.09751, 2019.

[138] All You Need to Know about Location Data.https://www.quadrant.io/resources/

location-data. accessed on 25 April 2025.

176

BIBLIOGRAPHY

[139] J.H. Huh και K. Seo. An Indoor Location-Based Control System Using Bluetooth

Beacons for IoT Systems.Sensors, 17:2917, 2017.

[140] Y Zhuang, J Yang, Y Li, L Qi και N El-Sheimy. Smartphone-based indoor localization

with Bluetooth low energy beacons.Sensors, 16:596, 2016.

[141] R Faragher και R Harle. Location Fingerprinting With Bluetooth Low Energy Beacons.

IEEE Journal on Selected Areas in Communications, 33:2418–2428, 2015.

[142] P. H. Legay και G. Roullet. LION and MAX, the experiences of two ESPRIT Projects

on High-Speed MANs.High-Capacity Local and Metropolitan Area Networks: Archi-

tecture and Performance Issues. Berlin/Heidelberg, Germany, 1991.

[143] Robert Jordan και Chaouki T. Abdallah. Wireless communications and networking:

An overview.IEEE Antennas and Propagation Magazine, 44:185–193, 2002.

[144] J. Crisp και B. Elliott. LANs and Topology. Oxford, UK, 2005.

[145] R. M. Heiberger και E. Neuwirth. Polynomial regression.R Through Excel: A Spread-

sheet Interface for Statistics, Data Analysis, and Graphics,σελίδες 269–284. New

York, NY, USA, 2009.

[146] F Lemic, V Handziski, M Aernouts, T Janssen, R Berkvens, A Wolisz και J Famaey.

Regression-based estimation of individual errors in ﬁngerprinting localization.IEEE

Access, 7:33652–33664, 2019.

[147] M.I. Ribeiro. Kalman and Extended Kalman Filters: Concept, Derivation and Proper-

ties.Institute for Systems and Robotics, 43:3736–3741, 2004.

[148] Z Chen, Q Zhu και Y. C. Soh. Smartphone inertial sensor-based indoor localization

and tracking with iBeacon corrections.IEEE Transactions on Industrial Informatics,

12:1540–1549, 2016.

[149] H Zou, Z Chen, H Jiang, L Xie και C Spanos. Accurate indoor localization and

tracking using mobile phone inertial sensors, WiFi and iBeacon.Proceedings of the

IEEE International Symposium on Inertial Sensors and Systems (INERTIAL),σελίδες

1–4, Kauai, HI, USA, 2017.

[150] R.K. Yadav, B. Bhattarai, H.S. Gang και J.Y. Pyun. Trusted K Nearest Bayesian

Estimation for Indoor Positioning System.IEEE Access, 7:51484–51498, 2019.

[151] T.M.T. Dinh, N.S. Duong και K. Sandrasegaran. Smartphone-Based Indoor Posi-

tioning Using BLE iBeacon and Reliable Lightweight Fingerprint Map.IEEE Sensors

Journal, 20:10283–10294, 2020.

[152] A.R. Pratama και R. Hidayat. Smartphone-based pedestrian dead reckoning as

an indoor positioning system.Proceedings of the 2012 International Conference on

System Engineering and Technology (ICSET), Bandung, West Java, Indonesia, 2012.

177

BIBLIOGRAPHY

[153] Apple Inc. Software Development Kit (SDK) for iOS. Available online: https:

//developer.apple.com/ios/ (accessed on 25 April 2025).

[154] Google Inc. Software Development Kit (SDK) for Android. Available online: https:

//developer.android.com/tools (accessed on 25 April 2025).

[155] T.D. Vy, T. L. Nguyen και Y Shin. A precise tracking algorithm using PDR and

Wi-Fi/iBeacon corrections for smartphones.IEEE Access, 9:49522–49536, 2021.

[156] K Elgui, P Bianchi, F Portier και O Isson. Learning methods for RSSI-based geolo-

cation: A comparative study.Pervasive and Mobile Computing, 67:101199, 2020.

[157] N Duong και T Dinh. On the accuracy of iBeacon-based Indoor Positioning System

in the iOS platform.Proceedings of the 2021 18th International Multi-Conference on

Systems, Signals & Devices (SSD),σελίδες 58–62, Monastir, Tunisia, 2021.

[158] M Abbas, M Elhamshary, H Rizk, M Torki και M Youssef. WiDeep: WiFi-based Accu-

rate and Robust Indoor Localization System using Deep Learning.Proceedings of the

2019 IEEE International Conference on Pervasive Computing and Communications

(PerCom),σελίδες 1–10, Kyoto, Japan, 2019.

[159] W Njima, A Bazzi και M Chaﬁi. DNN-Based Indoor Localization Under Limited Dataset

Using GANs and Semi-Supervised Learning.IEEE Access, 10:69896–69909, 2022.

[160] T Yang, A Cabani και H Chafouk. A Survey of Recent Indoor Localization Scenarios

and Methodologies.Sensors, 21:8086, 2021.

[161] F Zafari, A Gkelias και K. K. Leung. A Survey of Indoor Localization Systems and

Technologies.IEEE Communications Surveys & Tutorials, 21:2568–2599, 2019.

[162] B Jang και H Kim. Indoor Positioning Technologies Without Oﬄine Fingerprinting

Map: A Survey.IEEE Communications Surveys & Tutorials, 21:508–525, 2019.

[163] S Subedi και J Pyun. A Survey of Smartphone-Based Indoor Positioning System

Using RF-Based Wireless Technologies.Sensors, 20:7230, 2020.

[164] M Mallik, A Panja και C Chowdhury. Paving the way with machine learning for

seamless indoor–outdoor positioning: A survey.Information Fusion, 94:126–151,

2023.

[165] Tech Insights. Apple U1 Ultra Wideband (UWB) Chip Analysis. Available on-

line: https://www.techinsights.com/blog/apple-u1-tmka75-ultra-wideband-uwb-chip-

analysis (accessed on 25 April 2025).

[166] A. Prorok και A. Martinoli. Accurate indoor localization with ultra-wideband using

spatial models and collaboration.International Journal of Robotics Research, 33:547–

568, 2014.

178

BIBLIOGRAPHY

[167] A. Alariﬁ, A. Al-Salman, M. Alsaleh, A. Alnafessah, S. Al-Hadhrami, M.A. Al-Ammar

και H.S. Al-Khalifa. Ultra Wideband Indoor Positioning Technologies: Analysis and

Recent Advances.Sensors, 16:707, 2016.

[168] P. Mayer, M. Magno και L. Benini. Self-Sustaining Ultrawideband Positioning System

for Event-Driven Indoor Localization.IEEE Internet of Things Journal, 11:1272–1284,

2024.

[169] G.I. Hapsari, R. Munadi, B. Erﬁanto και I.D. Irawati. Future Research and Trends

in Ultra-Wideband Indoor Tag Localization.IEEE Access, 2024.

[170] Apple Inc. Apple iBeacon. Available online: https://developer.apple.com/ibeacon/

(accessed on 25 April 2025).

[171] Google. Eddystone Beacon API. Available online: https://github.com/google/

eddystone (accessed on 25 April 2025).

[172] Apple Inc. Turning an iOS Device into an iBeacon Device. Avail-

able online: https://developer.apple.com/documentation/corelocation/turning_an_

ios_device_into_an_ibeacon_device (accessed on 25 April 2025).

[173] Apple Inc. Multipeer Connectivity: Support Peer-to-Peer Connectivity and the Discov-

ery of Nearby Devices. Available online: https://developer.apple.com/documentation/

multipeerconnectivity (accessed on 25 April 2025).

[174] Google. Nearby Connections API. Available online: https://developers.google.com/

nearby/connections/overview (accessed on 25 April 2025).

[175] Google. Nearby Messages API. Available online: https://developers.google.com/

nearby/messages/overview (accessed on 25 April 2025).

[176] Apple Inc. Swift Programming Language. Available online: https://developer.apple.

com/swift/ (accessed on 25 April 2025).

[177] F. Dalkılıç, U.C. Çabuk, E. Arıkan και A. Gürkan. An analysis of the positioning

accuracy of iBeacon technology in indoor environments.Proceedings of the 2017

International Conference on Computer Science and Engineering (UBMK),σελίδες 549–

553, Antalya, Turkey, 2017.

[178] Apple Inc. Multipeer Connectivity: Maximum Number of Peers. Avail-

able online: https://developer.apple.com/documentation/multipeerconnectivity/

mcbrowserviewcontroller/1406954-maximumnumberofpeers (accessed on 25 April 2025).

179

List of Abbreviations

AI Artiﬁcial Intelligence

API Application Programming Interface

BLE Bluetooth Low Energy

CI Critical Infrastructures

DB Data Base

GAN Generative Adversarial Network(s)

GB/GiB Gigabyte(s)

EKF Extended Kalman Filtering

EU European Union

FP Fingerprinting

IMU Inertial Measurement Unit(s)

IPS Indoor Positioning System

JSON JavaScript Object Notation

kNN k-Nearest Neighbors (Algorithm)

LLM Large Language Model(s)

MB Megabyte(s)

ML Machine Learning

MC Multipeer Connectivity

PDR Pedestrian Dead Reckoning

PRM Polynomial Regression Model

P2P Peer To Peer

RSSI Received Signal Strength Indicator

TAN Transactional Area Network

TB Terabyte(s)

UUID Universally Unique Identiﬁer

UWB Ultra-Wideband

VDC Virtual Data Container

VDR Virtual Data Repository

SQL Structured Query Language

181

0 views·189 pages

Optimizing Data Analysis Through Offline Large Language Models and Scalable Data Management Techniques PDF Free Download

Optimizing Data Analysis Through Offline Large Language Models and Scalable Data Management Techniques PDF free Download. Think more deeply and widely.

Uploaded by mr_sarah on 4/10/2026

/189

100%