PROCEEDING IDEAS'2025 PDF Free Download

Name: PROCEEDING IDEAS'2025 PDF
Author: _mark_miller_

1 / 152

0 views•152 pages

PROCEEDING IDEAS'2025 PDF Free Download

PROCEEDING IDEAS'2025 PDF free Download. Think more deeply and widely.

PROCEEDING

IDEAS’2025

SARAH BENZIANE

KARIMA BELMABROUK

DEKHICI LATIFA

M’HAMED BILAL ABIDINE

Oran Algeria

National Conference on

Innovation of Data Engineering

and Artificial Intelligence

11-12 June 2025

University of Science and

Tehnologuy of Oran-

Mohamed Boudiaf

IDEAS 2025 Conference Committees

CHAIRS AND EXECUTIVE COMMITTEES

HONORARY CHAIRS

•Prof. Ahmed Hamou, Rector of the University

•Prof. Bachir Djebbar, Dean of the Faculty

CONFERENCE CHAIR

•Sarah Benziane, USTO MB University

CO-CHAIR

•Karima Belmabrouk, USTO MB University

—

STEERING COMMITTEE

•Asmaa Boughrara, USTO MB

•Asmaa Ourdighi, USTO MB

•Karima Belmabrouk, USTO MB

•Latifa Dekhici, USTO MB

•Redouane Tlemsani, USTO MB

•Sarah Benziane, USTO MB

•Souad Ougouti, USTO MB

—

PUBLICATION COMMITTEE

•Redouane Tlemsani, USTO MB

•Sarah Benziane, USTO MB

—

POSTER COMMITTEE

•Asmaa Boughrara

•Asmaa Ourdighi

•Souad Ougouti

—

ORGANIZATION COMMITTEE

•Amina Medjahed, USTO MB

•Asmaa Boughrara, USTO MB

•Asmaa Ourdighi, USTO MB

•Bouchiba Guellta, USTO MB

•Chemseddine Choucha, USTO MB

•Imad Eddine Khiloun, USTO MB

•Latifa Dekhici, USTO MB

•Mohamed Khaldi, USTO MB

•Redouane Tlemsani, USTO MB

•Salim Alachaher, USTO MB

•Sarah Benziane, USTO MB

•Souad Ougouti, USTO MB

—

PROGRAM COMMITTEE

•Abdelatif Hassini, University of Oran 2

•Abdelatif Rahmoun, ESI SBA

•Abdelfettah ZEGHOUDI, University of Laghouat

•Abdelghani Djebbari, University of Tlemcen

•Abdelkader Benyettou, University of Relizane

•Abdelkrim Souahlia, University of Djelfa

•Abdelmadjid Allali, University of Chlef

•Abderrahmane Bendahmane, USTO MB

•Ahmed Roumanie, ENSTTIC Oran

•Amel Djebbar, ENSE Oran

•Amine Dahane, University of Oran 1

•Asmaa Boughrara, USTO MB

•Asmaa Ourdighi, USTO MB

•Badra Khellat Kihel, University of Oran 2

•Bouabdellah Kechar, University of Oran 1

•Boudjelal Meftah, University of Mascara

•Dounia Yedjour, USTO MB

•Fatiha Guerroudji, USTO MB

•Hachem Slimani, Univesity of Bejaia

•Hadria Fizazi, USTO MB

•Haﬁd Haﬀaf, University of Oran 1

•Haﬁdha Bouziane, USTO MB

•Hamza Bousbaa, ENPO

•Hayat Bendoukha, USTO MB

•Hayat Yedjour, USTO MB

•Hicham Reguieg, USTO MB

•Karima Belmabrouk, USTO MB

•Karima Kies, USTO MB

•Khadidja Belbachir, USTO MB

•Khaled Belkadi, USTO MB

•Latifa Dekhici, USTO MB

•Lila Medebber, USTO MB

•Mahmoud Zennaki, USTO MB

•Malika ALLALI, URERMS ADRAR

•M’hamed Abidine, USTHB

•Mohamed Dahmani, USTO MB

•Mohammed Amine Chikh, University of Tlemcen

•Mostefa Benhaliliba, USTO MB

•Mourtada Benazouz, University of Tlemcen

•Naciima Mellal, University of Oum Bouaghi

•Rachida Ghoul Hadiby, USTO MB

•Reda Adjoudj, University of Sidi Bel Abbes

•Redouane Tlemsani, USTO MB

•Saber Harzallah, University of Batna

•Samir Benbekriti, ENSTTIC Oran

•Sarah Benziane, USTO MB

•Sarah Maroc, USTO MB

•Sarah Nait Bahloul, USTO MB

•Selma Khouri, ESI Alger

•Souad Ougouti, USTO MB

•Souad khellat, USTO MB

•Souﬁane Boukli Hacene, University of Sidi Bel Abbes

•Yasmina Hernane, USTO MB

FOREWORD

We are thrilled and proud to present the proceedings of the First National Conference on Innovation

in Data Engineering and AI Science (IDEAS 2025), held in the vibrant city of Oran, Algeria, and

hosted by the Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf (USTO-

MB).

IDEAS 2025 was designed as a premier forum to bring together researchers, academics, students,

and industry professionals from across the country to explore innovations and address the latest

challenges in data engineering and AI science. The fast pace of technological advancement in these

domains calls for collaboration, critical discussion, and robust idea exchange — we believe this

conference achieved all these aims.

We were impressed by both the number and the quality of submissions we received. After a

thorough review by the Program Committee, the papers accepted for inclusion in these proceedings

represent the very best original research in these innovative fields. The accepted papers span

contemporary themes such as Big Data Analytics, Machine Learning and Deep Learning

Applications, Data Quality Management, Intelligent Systems Architectures, Ethical AI, and more.

Together, they highlight ongoing research, future directions, and the significant impact these fields

are poised to make.

The successful organization of IDEAS 2025 would not have been possible without the support and

engagement of many individuals and institutions. We extend our deepest gratitude to our

Honourary Chairs, Prof. Ahmed Hamou (Rector) and Prof. Bachir Djebbar (Dean), for their

continuous support and institution-wide sponsorship. A special thank-you goes to the dedicated

members of the Program Committee and the outstanding Organizing Committee, whose

commitment ensured a smooth and engaging event. I would also like personally to thank Co-Chair

Karima Belmabrouk and all members of the Steering and Publication Committees for their

collective effort and dedication in upholding a high standard of academic rigor for these

proceedings.

We are confident that these proceedings will serve as a valuable resource for researchers and

practitioners alike, spur new research questions, and foster collaborative partnerships. We look

forward to seeing the impact of this work and are hopeful that IDEAS 2025 marks the beginning

of a continuing series of enriching scientific gatherings.

We thank all the authors for their excellent contributions and all participants for their enthusiastic

attendance.

Sarah Benziane,

Karima Belmabrouk

Conference Chair, IDEAS 2025

USTO MB University, Oran, Algeria

Table of Contents

Accurate Detection System using Eigen-CAM with YOLO Architecture and Drone Images

for Rice Panicle Detection ..................................................................................................... 3

Meroua Belmir#1, Wafa Difallah#*2 .................................................................................... 3

A Concise Overview Of Vehicle Detection Techniques ....................................................... 10

Zentar Mohamed Dhia El Hak#1, Bencheriet Chemesse ennahar*2 ................................... 10

An MFU-based Approach for Data Quality Management in NoSQL Document-oriented

Databases ............................................................................................................................ 17

Aicha Aggoune#*1............................................................................................................ 17

Biometric identification through iris image processing ........................................................ 22

DAOUDI Hadjer, HADJ SLIMANE Zine-Eddine1,2 ...................................................... 22

Demand-aware drug assignment in manipulator arm automated dispensing systems via graph

convolutional network ranking ............................................................................................ 25

Yassine Bouhelassa #1, Khalid Hachemi 2 ........................................................................ 25

Brain Tumor Detection of MRI Images Using CNN Features Extraction and SVM

Classification ...................................................................................................................... 32

Zouhir Iourzikene#1, Fawzi Gougam #2, Djamel Benazzouz#3........................................... 32

Using Moments Invariants for Multi Mobile Robot ............................................................. 40

BOUDRA SOUMIA ....................................................................................................... 40

Feature Extraction and Machine Learning for Classification Date Fruit ................................ 46

Ikram Kourtiche1, Mostefa Bendjima2, Mohammed El Amin Kourtiche3 .......................... 46

Collaborative business process: A formal verification and validation .................................. 51

1st Hanane Ouaar ............................................................................................................ 51

An Empirical Study on the Effectiveness and Efficiency of Machine Learning Classifiers for

Liver Disease Prediction ..................................................................................................... 59

Mohamed Amine NEMMICH 1, Asmaa BOUDALI 2, Noureddine BOUKHARI 3, Fatima

DEBBAT 4 ...................................................................................................................... 59

Large-Scale Customer Feedback Analysis via a Kafka Pipeline and Pre-Trained Transformers

........................................................................................................................................... 70

Gasbaoui Mohammed el Amin, Benkrama Soumia, Bendjima Mostefa, Abden Sofiane .. 70

Comparative Analysis of Mortality Prediction Models at the University Hospital Center of

Oran, Algeria ...................................................................................................................... 74

Mohammed Nadjib Osmani 1, Djamila Benhaddouche 2, Nawal Sad Houari 3 ................. 74

Data Visualization Tools in Mental Health Informatics ....................................................... 81

Imene DAHANE*1, Abdelkrim Mebarki*2, S.S BENHARRATS*3 ................................... 81

Greedy-based approach to Reduce Congestion Areas in IoV ............................................... 84

BELHADJ Aissa#1, KIES Ali#2, MOSTEFA Fatima Zahra#3, MEKKAKIA MAAZA

Zoulikha#3 ....................................................................................................................... 84

Heterogeneous Graph Neural Networks for Product Recommendation on Transactional Retail

Data .................................................................................................................................... 87

Imad Eddine Khiloun *1, Karima Belmabrouk 2, Latifa Dekhici 2,3, Christoph Bergmeir 4 . 87

Using Chatbot in E-commerce to Improve Profit : Artificial Intelligence in practice ............ 95

Houda EL BOUHISSI, Essaid FERHAT, Naima BOUAGAL ......................................... 95

Predicting Fire Forest in Algeria : A new Approach ........................................................... 102

Houda EL BOUHISSI, Naima ILLOUL ........................................................................ 102

Deep Learning-Based Classification of Knee Osteoarthritis Using Gaussian Noise

Augmentation and Knowledge Distillation ........................................................................ 107

1st Khadidja Messaoudene............................................................................................. 107

2nd Khaled Harrar ......................................................................................................... 107

Bio-Driven Facial Mark Detection: Robust Celebrity Identification .................................. 118

Souad Khellat-Kihel#1 ................................................................................................... 118

efficiency and challenges .................................................................................................. 121

Nassiba Wafa ABDERRAHIM ..................................................................................... 121

Comparative Analysis of Deep Learning-Improved Traditional PET Image Reconstruction

Methods ............................................................................................................................ 125

Benyelles Asma#1, Korti Amel*2 .................................................................................... 125

Drone Presence Detection in Wireless Network................................................................. 129

Asmâa Ouessai#*1, Marouane Mekri*2, Mohamed Farahi*3

, Sofiane Abdelkrim Khalladi

. 129

Student Performance Prediction Using Artificial Intelligence In Education: A Temporal

Modeling Approach With OULA Dataset .......................................................................... 133

Asma Bouchekouf1, Toufik Bazemlal2, Messaoud Mosbah3 .......................................... 133

Open Circuit Fault Detection and Diagnosis in DTC-SVM of Induction Motor Drives Using

Neural Network and Inverter Reconfiguration ................................................................... 141

Younes Tamissa1, Farid Kadri2 ...................................................................................... 141

State of the Art on Graph-based Nutritional Recommendation Systems ............................. 147

DIDOUNE Nadia #1, SAD-HOUARI Nawal *2,REGUIEG Hicham#3 ............................ 147

Data Visualization Tools in Mental Health Informatics ..................................................... 152

Imene DAHANE*1, Abdelkrim Mebarki*2, S.S BENHARRATS*3 ................................. 152

Bio-Driven Facial Mark Detection: Robust Celebrity Identification .................................. 155

Souad Khellat-Kihel#1 ................................................................................................... 155

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Accurate Detection System using Eigen-CAM with

YOLO Architecture and Drone Images for Rice

Panicle Detection

Meroua Belmir#1, Wafa Difallah#*2

#Department of mathematics and computer science, Tahri Mohammed University

of Bechar. Innovation in Informatics and Engineering Laboratory (ENIE LAB),

Bechar, Algeria.

1belmir.meroua@univ-bechar.dz

*Laboratory of Energetic in arid zones (ENERGARID)

Bechar, Algeria

2difallah.wafa@univ-bechar.dz

Abstract— In rice field phenotyping, accurate identification

of rice panicles is an essential process. Nevertheless, the

traditional manual method of characterizing rice panicles

is labor-intensive and time-consuming. In this study, we

explore the use of the YOLOv9 deep learning model for the

detection of rice panicle images without human

intervention using aerial footage of rice panicles taken by

drones in Bangladesh. To enhance the interpretability of

the model's predictions, we integrated Eigen-CAM, a

gradient-based visualization technique that highlights the

regions influencing the model’s decision-making process.

The activation maps showed that YOLOv9 successfully

concentrated on the important features of rice panicles,

even with difficulties from thick plants, blockages, and

messy backgrounds. Quantitative analyses demonstrate

that YOLOv9, combined with interpretability tools, offers

a promising solution for high-throughput phenotyping and

yield estimation applications. Finally, the proposed model

achieved outstanding results, with a recall of 74.3%, an

mAP50 of 76.8%, and an mAP50-95 of 52.5%.

Keywords—Automated yield estimation, Rice panicle detection,

YOLOv9, drone images, Eigen-CAM.

I. INTRODUCTION

One of the biggest problems facing the world's population

today is food security [1]. The world's third-largest food crop is

rice [2], [3]. Rice is an essential food source for almost half of

the world's population. More than 65 percent of people in China

eat rice as their main food. This demonstrates the importance

of rice cultivation in feeding the world's population [4], [5]. The

number of panicles per unit area is a critical factor in grain yield

production, as it strongly correlates with overall yield.

Conventional methods for measuring this trait are labor-

intensive and time-consuming, highlighting the necessity for

high-throughput phenotyping techniques in both grain

production [6]. Accurately and efficiently assessing traits of

intact rice panicles remains a significant challenge [7].

Precision agriculture (PA) has emerged as a vital approach for

achieving sustainability in modern farming systems [8].

Although there are various definitions of PA, the core idea

revolves around using advanced technologies and data-driven

strategies to support decision-making in agricultural practices.

This includes optimizing the use of inputs such as water,

fertilizers, pesticides, seeds, energy, and labor to improve crop

yields while minimizing resource waste and environmental

harm [9]. Beyond traditional crop cultivation, PA techniques

are also applied in areas like viticulture, horticulture, pasture

management, and livestock farming [10].

The emergence of drone technology, also known as

unmanned aerial vehicles (UAVs), has significantly

transformed data acquisition in agriculture by enhancing

coverage and operational efficiency [11]. Equipped with

advanced sensors such as high-resolution RGB cameras,

multispectral imagers, and radar systems, UAVs can swiftly

gather a wide range of high-quality data. This rapid data

collection enables the timely extraction of valuable insights,

which has contributed to the growing use of UAVs in various

agricultural practices [12].

In recent years, the integration of artificial intelligence with

advancements in computing power has enabled the application

of machine learning and deep learning techniques in agriculture

[13]. Traditional machine learning methods are particularly

effective for tasks involving structured data, such as

classification and prediction. Techniques like support vector

machines, artificial neural network, and k-Nearest Neighbors

[14] have been commonly employed for purposes such as

disease detection, classification, yield forecasting, and

monitoring plant growth stages [15], [16]. Nonetheless, these

conventional approaches often fall short when tackling

complex, image-based problems. For example, in the context

of rice panicle detection, they typically depend on manually

crafted features, which limits their accuracy and

generalizability under challenging field conditions [17].

Deep learning offers the advantage of automatically learning

essential features from large datasets without the need for

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

manual feature engineering. Its use in agriculture has grown

rapidly in recent years, especially for object detection tasks

including crop classification [18], pest and weed recognition,

fruit detection [19] and counting, plant identification, and

monitoring of animal behavior [20]. Depending on specific

application requirements, deep learning-based object detection

methods are generally classified into two types: one-stage and

two-stage detectors. Two-stage models, including Region-

based CNN (R-CNN) [21] and Faster R-CNN [22], first

generate region proposals before classifying objects in the

second stage, prioritizing detection accuracy. In contrast, one-

stage models like YOLO (You Only Look Once) [23] and SSD

(Single Shot MultiBox Detector) [24] directly predict both

object locations and categories in a single step, achieving a

better trade-off between speed, model size, and accuracy. These

approaches allow flexibility in selecting the most suitable

model based on specific agricultural monitoring requirements

[17].

In [25], the researchers presented a generalized detection

approach for identifying curved rice panicles using an enhanced

YOLOv4 model. This method showcased the versatility of

UAV-acquired imagery across various rice cultivars and

environmental conditions, employing MobileNetv2 as the

backbone for feature extraction. Hong et al. [26] extended this

work by integrating segmentation with detection through a

modified Mask R-CNN architecture. Their approach

demonstrated strong performance in expansive field settings,

making it effective for continuous monitoring of rice

development and predicting yields. Another study [27]

proposed FS-SSD, a feature fusion and scaling-based variation

of the SSD model, tailored for detecting small objects in UAV

imagery. The FS-SSD method outperformed other techniques

in terms of detection accuracy while maintaining competitive

speed. Collectively, these studies illustrate the latest progress

in UAV-driven rice panicle detection, emphasizing the synergy

between deep learning-based object detection and advanced

imaging technologies.

In this context, the primary objective of this study is to

develop a rice panicle detection system utilizing the YOLOv9

object detection model and a dataset composed of drone

captured images. The proposed approach leverages Eigen-

CAM, a visualization technique, to enhance the interpretability

of the model's predictions. YOLOv9 has shown strong

detection performance, making it well-suited for complex, real-

world agricultural environments. This detection system is

aimed at improving automated yield estimation and rice growth

monitoring. This paper's subsequent sections offer an in-depth

description of the experimental setup, tools, and methods

before analyzing the findings and their significance for

precision agriculture. A summary of the main conclusions

along with potential areas of research are provided at the end of

the study.

II. METHODS AND TOOLS

This study presents a comprehensive approach for rice

panicle detection using YOLOv9, enhanced with EigenCAM

for model interpretability. The methodology encompasses

several steps covers dataset preparation, model architecture,

training, evaluation, and explainability analysis to ensure both

high performance and transparency in detection results as

shown in Fig 1. First, a labeled dataset of rice images was

collected and organized into training, validation, and test sets.

The annotations were converted to YOLO format, where each

image has a corresponding .txt file containing normalized

bounding box coordinates and class labels. A YAML

configuration file was created to define dataset paths and class

names.

The YOLOv9 architecture was selected for this study due to

its superior performance characteristics. The model features an

optimized CSPDarknet-based backbone for efficient feature

extraction, a PANet neck for multi-scale feature fusion, and an

anchor-free detection head for precise bounding box prediction.

These advancements contribute to the model's higher accuracy

and efficiency while maintaining real-time processing

capabilities, making it particularly suitable for agricultural

applications. The YOLOv9 trained and tested using the

prepared dataset. Finally evaluate the obtained results using

evaluation metrics.

To enhance model transparency, EigenCAM was integrated

for visual interpretability. This technique highlights regions of

interest (ROIs) in input images that most influence the model’s

predictions by generating heatmaps that overlay activation

zones on original images, showing where the model focuses

during panicle detection.

Fig. 1 An overall for detecting rice panicle.

A. Dataset

This study uses the "Dataset of Annotated Rice Panicle

Image from Bangladesh" dataset, which can be found at

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

"https://data.mendeley.com/datasets/ndb6t28xbk/4." This

dataset serves as a helpful resource for agricultural research as

it offers aerial footage of rice panicles taken by drones in

Gazipur, Bangladesh. The collection includes 2193 high-

quality 4K images of rice fields, expanded to 5701 images

through augmentation techniques. Each image has been

carefully labeled to facilitate accurate automated detection of

rice panicles. Designed to assist in algorithm development, the

dataset supports applications such as crop monitoring, yield

prediction, disease detection, and plant health assessment. The

images were derived from drone footage and annotated using a

combination of manual labeling and deep learning methods in

a semi-automated process [28].

B. YOLO object detection model

Object detection is a core challenge in computer vision,

with applications spanning autonomous vehicles, agriculture,

robotics, and security monitoring. The demand for high-speed,

precise, and resource-efficient detection systems has driven

ongoing improvements in neural network designs and training

techniques [29]. Introduced in 2015 by Redmon et al. , the

YOLO (You Only Look Once) framework transformed the

field by treating object detection as a unified regression task,

delivering remarkable real-time performance [30].

In February 2024, Wang et al. [31] unveiled YOLOv9, the

latest advancement in the YOLO series of object detection

models. This new version introduces two major innovations:

the Programmable Gradient Information (PGI) framework and

the Generalized Efficient Layer Aggregation Network

(GELAN).

The PGI framework tackles the common issue of

information bottlenecks in deep neural networks while enabling

deep supervision techniques to work effectively with

lightweight models. By preserving reliable gradient flow during

training, PGI enhances learning efficiency, leading to improved

accuracy across both deep and compact architectures.

GELAN, on the other hand, refines gradient path

optimization by integrating concepts from CSPNet and ELAN.

This architecture is designed to optimize the trade-off between

model size, inference speed, and detection accuracy. Its flexible

structure ensures strong performance across different

computational settings, making it suitable for deployment on

various devices including edge devices with limited resources.

Combining PGI and GELAN, YOLOv9 marks a

significant leap in efficient object detection. Although still in

its early stages, it outperforms YOLOv8 by reducing

parameters and computational costs while achieving a 0.6%

higher AP on the MS COCO dataset [32].

C. Metrics of evaluation

This paper uses recall (R), precision (P), and mAP0.5.

Precision evaluates the model's detection by calculating the

proportion of correctly identified rice panicle pieces out of all

detected instances. Recall, on the other hand, assesses the

model's ability to avoid missing rice panicle pieces, computed

as the ratio of correctly detected pieces to the total number of

actual pieces (including both detected and missed ones) [33].

Another key evaluation metric for object detection models

is mean Average Precision (mAP), which measures the model's

average detection accuracy across different classes. For a given

class, the Average Precision (AP) is computed as the area under

the precision-recall curve. The overall mAP is then derived by

averaging the AP values across all classes. In this

context, k denotes the specific class of rice pieces,

while N represents the total number of classes in the detection

task [34]. The relevant formulas are as follows:

 

 (1)

 

 (2)

󰇛󰇜



 (3)







 (4)

D. Eigen-CAM

We visualized the most important areas for decision-

making processes using color gradients spanning from blue to

red, where red indicates the most crucial regions, using Eigen

class activation mapping (Eigen-CAM) to better comprehend

our model [35].

The Eigen-CAM method analyzes activation maps from

convolutional layers by utilizing principal components,

eliminating the need for gradient backpropagation. The

process for the final convolutional layer involves the following

steps:

1. The combined activation map A for input X is

decomposed using singular value decomposition

(SVD), resulting in the factorization A =  ∑ vt

2. The activation map is then projected onto the primary

eigenvector derived from matrix V.

3. This projection emphasizes the most significant

components within the activation map.

Unlike some other methods, Eigen-CAM does not apply a

ReLU activation function. Mathematically, Eigen-CAM can

be expressed as:

 (5)

Where V1 denotes the first the eigenvector at the first

position in the V matrix [36].

Example of the application of Eigen-CAM with CNN [37]

is shown in Fig 2.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Fig. 2 Visualization of strawberry and bighorn detection using CNN and Eigen-

CAM [37].

III. RESULTS AND DISCUSSION

This section outlines the key findings of the comprehensive

research conducted on rice panicle detection using deep

learning techniques. The study employed the advanced

YOLOv9m model, which exhibited strong performance in

accurately identifying and localizing rice panicles in complex

agricultural environments. To train the model, experiments

were carried out on a GPU-enabled platform via Google Colab,

with a total of 150 training epochs, a learning rate of 0.01, and

a batch size of 16 to ensure optimal convergence. The model's

effectiveness was rigorously assessed through standard object

detection metrics, including precision, recall, mAP50, and

mAP50-95, providing a robust evaluation of its detection

capabilities. Furthermore, to enhance interpretability and verify

the model's decision-making process, Eigen-CAM was applied,

revealing the critical regions of interest that influenced

detection outcomes. The experimental results, including visual

detection outputs, are illustrated in Fig 3, while a detailed

quantitative summary is presented in Table 1 and Table 2 for

comparative analysis.

TABLE I

Performances of YOLOv9 model on validation set.

Precision

Recall

mAP50

mAP50-95

YOLOv9

67.3%

72.7%

76.5%

51.4%

Table 1 showcases the validation results of the YOLOv9

model for rice panicle detection, highlighting key performance

metrics such as precision, recall, and mean Average Precision

(mAP) at both 50% and the averaged 50-95% IoU thresholds.

During validation, the model attained a precision of 67.3% and

a recall of 72.7%, indicating its ability to accurately identify

rice panicles while maintaining a low false negative rate. The

mAP50 and mAP50-95 scores reached 76.5% and 51.4%

respectively, reflecting the model's capability to consistently

detect objects with varying overlap criteria. These results

demonstrate a solid learning performance, suggesting that the

model has effectively captured the distinguishing features of

rice panicles during training without overfitting.

TABLE II

Performances of YOLOv9 model on test set.

Precision

Recall

mAP50

mAP50-95

YOLOv9

67.1%

74.3%

76.8%

52.5%

Table 2 presents the testing results on a separate unseen

dataset to evaluate the generalization capacity of the YOLOv9

model. The model achieved a testing precision of 67.1% and a

recall of 74.3%, which points to strong detection ability in

diverse field conditions, even slightly surpassing the recall

observed in training. The mAP50 and mAP50-95 scores were

76.8% and 52.5.8%, respectively, which are slightly higher

than the training scores and confirm the robustness of the

model. This close alignment between training and testing

performance highlights the effectiveness of the YOLOv9

model in handling complex visual scenes and makes it a

promising tool for applications in automated crop monitoring

and precision agriculture. These consistent results across

datasets confirm the model's effectiveness in identifying rice

panicles in complex scenes, including those with occlusions

and dense vegetation. The high detection reliability makes

YOLOv9 a practical and scalable solution for automated

phenotyping and precision agriculture applications.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Fig. 3 Results of rice panicle detection using YOLOv9 visualization.

The results of rice panicle detection using YOLOv9 are

depicted in Fig 3. The model successfully detects multiple rice

panicles in various field conditions. The bounding boxes,

marked in blue, indicate that the model can localize and identify

the panicles even in complex backgrounds with overlapping

plant structures. However, certain challenges were observed,

such as occasional misdetections in regions with dense foliage

and partially occluded panicles. Despite these limitations, the

model demonstrates reliable detection capabilities, making it a

promising tool for automated rice yield estimation and

precision agriculture applications.

To better understand the model’s attention and decision-

making process, we applied EigenCAM visualization to the

YOLOv9 detection results. As shown in Fig 4, the highlighted

regions in red and yellow indicate areas where the model

focused most strongly during inference. The visualization

clearly shows that YOLOv9 successfully localized the target

object despite the presence of complex background textures,

such as dense grass. The high-activation zones precisely

correspond to the object’s structural features, demonstrating

that the model is effectively capturing salient object

characteristics. This confirms that the integration of EigenCAM

not only aids in interpretability but also validates the robustness

and reliability of YOLOv9 in handling cluttered scenes.

Fig. 4 Eigen-CAM visualization of rice panicle detection.

IV. CONCLUSION

In this research, the YOLOv9 model was utilized to detect

rice panicles in complex field images. The model achieved

notable results, with a precision of 67.1%, a recall of 74.3%, a

mean Average Precision at IoU threshold 0.5 (mAP50) of

76.8%, and Map50-75 of 52.5% on the test set. These findings

demonstrate that YOLOv9 possesses a strong capability to

accurately identify rice panicles while maintaining a good

balance between detection accuracy and model generalization.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

To gain deeper insight into the model’s decision-making

process, we employed EigenCAM visualization techniques,

which highlighted the specific regions within the images that

contributed most significantly to the model’s predictions. The

activation maps revealed that YOLOv9 effectively

concentrated on the relevant features of the panicles, even in

the presence of challenging background elements such as dense

vegetation and overlapping structures. Qualitative assessments

further confirmed the model’s ability to localize and detect

panicles under diverse and complex conditions, although

certain limitations, including occasional misdetections in areas

of dense foliage and partial occlusions, were observed.

Nonetheless, the overall performance underscores YOLOv9’s

potential for application in precision agriculture, particularly

for automated rice yield estimation tasks. Moving forward,

future work could explore the integration of more sophisticated

feature extraction methods, augmentation of the dataset with

additional annotated examples from varied environments, and

the incorporation of advanced attention mechanisms to enhance

detection robustness, especially in scenes characterized by

heavy occlusion and visual clutter. Additionally, continued use

of interpretability tools like EigenCAM can provide valuable

feedback for further model refinement and ensure more

transparent, reliable deployment in real-world agricultural

settings.

REFERENCES

[1] M. Belmir, W. Difallah, and A. Ghazli, “Plant Leaf Disease Prediction

and Classification Using Deep Learning,” in 2023 International

Conference on Decision Aid Sciences and Applications (DASA), Annaba,

Algeria: IEEE, Sep. 2023, pp. 536–540. doi:

10.1109/DASA59624.2023.10286672.

[2] Q. Zhou et al., “Analyzing Nitrogen Effects on Rice Panicle

Development by Panicle Detection and Time-Series Tracking,” Plant

Phenomics, vol. 5, p. 0048, 2023, doi: 10.34133/plantphenomics.0048.

[3] S. Tan et al., “In-field rice panicles detection and growth stages

recognition based on RiceRes2Net,” Computers and Electronics in

Agriculture, vol. 206, p. 107704, Mar. 2023, doi:

10.1016/j.compag.2023.107704.

[4] X. Lu et al., “Phenotyping of Panicle Number and Shape in Rice

Breeding Materials Based on Unmanned Aerial Vehicle Imagery,” Plant

Phenomics, vol. 6, p. 0265, 2024, doi: 10.34133/plantphenomics.0265.

[5] G. Sun et al., “A IR M EASURER : open‐source software to quantify static

and dynamic traits derived from multiseason aerial phenotyping to

empower genetic mapping studies in rice,” New Phytologist, vol. 236, no.

4, pp. 1584–1604, Nov. 2022, doi: 10.1111/nph.18314.

[6] Z. Li et al., “LKNet: Enhancing rice canopy panicle counting accuracy

with an optimized point-based framework,” Plant Phenomics, vol. 7, no.

1, p. 100003, Mar. 2025, doi: 10.1016/j.plaphe.2025.100003.

[7] J. Sun et al., “A High-Throughput Method for Accurate Extraction of

Intact Rice Panicle Traits,” Plant Phenomics, vol. 6, p. 0213, 2024, doi:

10.34133/plantphenomics.0213.

[8] J. A. Delgado, N. M. Short, D. P. Roberts, and B. Vandenberg, “Big Data

Analysis for Sustainable Agriculture on a Geospatial Cloud Framework,”

Front. Sustain. Food Syst., vol. 3, p. 54, Jul. 2019, doi:

10.3389/fsufs.2019.00054.

[9] E. Pierpaoli, G. Carli, E. Pignatti, and M. Canavari, “Drivers of Precision

Agriculture Technologies Adoption: A Literature Review,” Procedia

Technology, vol. 8, pp. 61–69, 2013, doi: 10.1016/j.protcy.2013.11.010.

[10] R. Gebbers and V. I. Adamchuk, “Precision Agriculture and Food

Security,” Science, vol. 327, no. 5967, pp. 828–831, Feb. 2010, doi:

10.1126/science.1183899.

[11] L. Ma, M. Li, L. Tong, Y. Wang, and L. Cheng, “Using unmanned aerial

vehicle for remote sensing application,” in 2013 21st International

Conference on Geoinformatics, Kaifeng, China: IEEE, Jun. 2013, pp. 1–

5. doi: 10.1109/Geoinformatics.2013.6626078.

[12] J. Wei et al., “A Precise Plot-Level Rice Yield Prediction Method Based

on Panicle Detection,” Agronomy, vol. 14, no. 8, p. 1618, Jul. 2024, doi:

10.3390/agronomy14081618.

[13] M. Belmir, W. Difallah, and A. Ghazli, “A Reliable Apple Leaf Disease

Identification Using a Deep Learning-Based MobileNetV2 to Safeguard

Apple Fruit Safety,” in 2024 4th International Conference on Embedded

& Distributed Systems (EDiS), BECHAR, Algeria: IEEE, Nov.

2024, pp. 279–284. doi: 10.1109/EDiS63605.2024.10783370.

[14] M. Belmir, W. Difallah, and A. Ghazli, “Applying Machine Learning

Approaches with Integrated Internet of Things for Water Management

System,” in Intelligent Systems and Pattern Recognition, vol. 2304, A.

Bennour, A. Bouridane, S. Almaadeed, B. Bouaziz, and E. Edirisinghe,

Eds., in Communications in Computer and Information Science, vol.

2304. , Cham: Springer Nature Switzerland, 2025, pp. 157–168. doi:

10.1007/978-3-031-82153-0_12.

[15] P. K. Sethy, N. K. Barpanda, A. K. Rath, and S. K. Behera, “Deep feature

based rice leaf disease identification using support vector machine,”

Computers and Electronics in Agriculture, vol. 175, p. 105527, Aug.

2020, doi: 10.1016/j.compag.2020.105527.

[16] A. O. Conrad, W. Li, D.-Y. Lee, G.-L. Wang, L. Rodriguez-Saona, and

P. Bonello, “Machine Learning-Based Presymptomatic Detection of Rice

Sheath Blight Using Spectral Profiles,” Plant Phenomics, vol. 2020, p.

8954085, 2020, doi: 10.34133/2020/8954085.

[17] Z. Song et al., “A Lightweight YOLO Model for Rice Panicle Detection

in Fields Based on UAV Aerial Images,” Drones, vol. 9, no. 1, p. 1, Dec.

2024, doi: 10.3390/drones9010001.

[18] S. Sagar and J. Singh, “A Comprehensive Study of Crop Disease

Detection Using Machine Learning Classification Techniques,” in Key

Digital Trends Shaping the Future of Information and Management

Science, vol. 671, L. Garg, D. S. Sisodia, N. Kesswani, J. G. Vella, I.

Brigui, S. Misra, and D. Singh, Eds., in Lecture Notes in Networks and

Systems, vol. 671. , Cham: Springer International Publishing, 2023, pp.

509–525. doi: 10.1007/978-3-031-31153-6_41.

[19] F. Xiao, H. Wang, Y. Xu, and R. Zhang, “Fruit Detection and

Recognition Based on Deep Learning for Automatic Harvesting: An

Overview and Review,” Agronomy, vol. 13, no. 6, p. 1625, Jun. 2023,

doi: 10.3390/agronomy13061625.

[20] J. Chappidi and D. M. Sundaram, “A comparative study of animal

detection and classification algorithms, applications and challenges,”

presented at the THE INTERNATIONAL SCIENTIFIC AND

PRACTICAL CONFERENCE RAKHMATULIN READINGS,

Tashkent, Uzbekistan, 2024, p. 020020. doi: 10.1063/5.0211343.

[21] Sumit, Shrishti Bisht, Sunita Joshi, and Urvi Rana, “Comprehensive

Review of R-CNN and its Variant Architectures,” Int Res J Adv Engg

Hub, vol. 2, no. 04, pp. 959–966, Apr. 2024, doi:

10.47392/IRJAEH.2024.0134.

[22] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-

Time Object Detection with Region Proposal Networks,” 2015, arXiv.

doi: 10.48550/ARXIV.1506.01497.

[23] J. Terven, D.-M. Córdova-Esparza, and J.-A. Romero-González, “A

Comprehensive Review of YOLO Architectures in Computer Vision:

From YOLOv1 to YOLOv8 and YOLO-NAS,” MAKE, vol. 5, no. 4, pp.

1680–1716, Nov. 2023, doi: 10.3390/make5040083.

[24] W. Liu et al., “SSD: Single Shot MultiBox Detector,” 2015, doi:

10.48550/ARXIV.1512.02325.

[25] B. Sun et al., “Universal detection of curved rice panicles in complex

environments using aerial images and improved YOLOv4 model,” Front.

Plant Sci., vol. 13, p. 1021398, Nov. 2022, doi:

10.3389/fpls.2022.1021398.

[26] S. Hong, Z. Jiang, L. Liu, J. Wang, L. Zhou, and J. Xu, “Improved Mask

R-CNN Combined with Otsu Preprocessing for Rice Panicle Detection

and Segmentation,” Applied Sciences, vol. 12, no. 22, p. 11701, Nov.

2022, doi: 10.3390/app122211701.

[27] X. Liang, J. Zhang, L. Zhuo, Y. Li, and Q. Tian, “Small Object Detection

in Unmanned Aerial Vehicle Images Using Feature Fusion and Scaling-

Based Single Shot Detector With Spatial Context Analysis,” IEEE Trans.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Circuits Syst. Video Technol., vol. 30, no. 6, pp. 1758–1770, Jun. 2020,

doi: 10.1109/TCSVT.2019.2905881.

[28] M. R. A. Rashid et al., “Comprehensive dataset of annotated rice panicle

image from Bangladesh,” Data in Brief, vol. 51, p. 109772, Dec. 2023,

doi: 10.1016/j.dib.2023.109772.

[29] M. Hussain, “YOLOv1 to v8: Unveiling Each Variant–A Comprehensive

Review of YOLO,” IEEE Access, vol. 12, pp. 42816–42833, 2024, doi:

10.1109/ACCESS.2024.3378568.

[30] M. Yaseen, “What is YOLOv9: An In-Depth Exploration of the Internal

Features of the Next-Generation Object Detector,” 2024, arXiv. doi:

10.48550/ARXIV.2409.07813.

[31] C.-Y. Wang, I.-H. Yeh, and H.-Y. M. Liao, “YOLOv9: Learning What

You Want to Learn Using Programmable Gradient Information,” Feb. 29,

2024, arXiv: arXiv:2402.13616. doi: 10.48550/arXiv.2402.13616.

[32] M. Hussain and R. Khanam, “In-Depth Review of YOLOv1 to

YOLOv10 Variants for Enhanced Photovoltaic Defect Detection,” Solar,

vol. 4, no. 3, pp. 351–386, Jun. 2024, doi: 10.3390/solar4030016.

[33] N. Mamdouh and A. Khattab, “YOLO-Based Deep Learning Framework

for Olive Fruit Fly Detection and Counting,” IEEE Access, vol. 9, pp.

84252–84262, 2021, doi: 10.1109/ACCESS.2021.3088075.

[34] S. Tang and W. Yan, “Utilizing RT-DETR Model for Fruit Calorie

Estimation from Digital Images,” Information, vol. 15, no. 8, p. 469, Aug.

2024, doi: 10.3390/info15080469.

[35] E. Önler and N. D. Köycü, “Wheat Powdery Mildew Detection with

YOLOv8 Object Detection Model,” Applied Sciences, vol. 14, no. 16, p.

7073, Aug. 2024, doi: 10.3390/app14167073.

[36] M. Giavina-Bianchi, W. G. Vitor, V. Fornasiero De Paiva, A. L. Okita,

R. M. Sousa, and B. Machado, “Explainability agreement between

dermatologists and five visual explanations techniques in deep neural

networks for melanoma AI classification,” Front. Med., vol. 10, p.

1241484, Aug. 2023, doi: 10.3389/fmed.2023.1241484.

[37] M. Bany Muhammad and M. Yeasin, “Eigen-CAM: Visual Explanations

for Deep Convolutional Neural Networks,” SN COMPUT. SCI., vol. 2,

no. 1, p. 47, Feb. 2021, doi: 10.1007/s42979-021-00449-3.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

A Concise Overview Of Vehicle Detection

Techniques

Zentar Mohamed Dhia El Hak#1, Bencheriet Chemesse ennahar*2

#Laboratoire d’automatique et Informatique de Guelma, Université 8 Mai 1945 de

Guelma

Guelma, Algeria

1mohameddhiaelhak.zentar842@gmail.com

2 cbencheriet@yahoo.fr

Abstract—Vehicle detection, a specialized subset of object

detection, has gained significant importance in recent years,

particularly in the realms of autonomous and assisted driving

technologies. This field, while promising, grapples with several

challenges including occlusion, scalability issues, and the

complexity of real-world backgrounds. This paper sets out to

provide a summary of recent state-of-the-art advancements in

vehicle detection technology. First, it organizes vehicle detection

approaches into three primary categories: classical methods, deep

learning techniques, and hybrid approaches that combine

elements of both. Within the deep learning category, the paper

further distinguishes three subcategories: anchor-based methods,

anchor-free methods, and attention-based techniques. Each of

these approaches offers unique advantages and addresses

different aspects of the vehicle detection challenge. Secondly, it

provides a literature review of different papers on vehicle

detection.

Keywords— vehicle detection, deep neural networks, Traffic

surveillance, object detection.

I. INTRODUCTION

Recent years have seen remarkable advancements in

artificial intelligence, with significant impacts on computer

vision and, notably, vehicle detection technologies. The ability

to accurately identify vehicles is crucial across various

domains, including intelligent transportation systems, self-

driving vehicles, and driver assistance platforms. For practical

applications like driver assistance, vehicle detection systems

must deliver both precision and speed to safeguard all road

users. One persistent challenge in this field is dealing with

occlusions, which are particularly prevalent in busy urban

environments.

This review aims to offer a comprehensive look at cutting

edge methods currently employed in vehicle detection. By

examining these state-of-the-art techniques, we seek to provide

insight into the current landscape of detection technologies and

their capabilities in addressing real-world challenges.

Generally speaking, a vehicle detection system architecture

consists of: A training phase and a testing phase “Fig. 1”.

Usually, in the training phase, the inputs are a collection of

images of vehicles and their labels. These images undergo a

pre-processing stage which includes operations like resizing,

normalization, noise reduction...etc. This step creates uniform

input data, establishing a consistent format that enables

effective learning in subsequent stages.

After pre-processing, comes feature extraction in order to

identify significant visual elements from the images. These

elements typically include edges, shapes, and textures and the

characteristics that distinguish vehicles from their

surroundings. The extracted features are then fed into a

classifier, which is trained using the labeled data. During this

phase, the classifier learns patterns and associations between

the features and the corresponding labels. As a result, a trained

classifier model is developed, capable of accurately identifying

vehicles in new, unseen data.

In the testing phase, the trained model is deployed to analyze

real-world inputs. Camera-based sensor systems capture

images or video containing vehicles, which are then passed

through the same pre-processing pipeline to ensure consistency

with the training data. The standardized images are subjected

to feature extraction, where relevant attributes are isolated.

These features are then passed to a predictor, which utilizes the

trained classifier model to analyze the data. The predictor

determines whether vehicles are present in the input and, if so,

localizes them by drawing bounding boxes around each

detected vehicle in the scene.

The primary goal of data collection in the system mentioned

is to enhance safety. However, some individuals may perceive

it as an invasion of privacy. A study by [1] revealed that a

significant number of people expressed discomfort with the

idea of data being collected. In modern systems, data

processing often involves sharing information with third

parties, such as traffic management systems, which raises

further privacy concerns.

Fortunately, techniques like Transport Layer Security (TLS)

[2], Differential Privacy (DP) [3], and K-anonymity [4] are

already being used to address this issue. Another concern is

data retention. According to [5], there have been instances

where users of car rental services were able to access and

control vehicle systems, highlighting the need for limited data

storage. Data should not be retained indefinitely, and [6]

suggests imposing a limited lifespan for stored information.

The European Data Protection Board (EDPB) also recommends

enabling users to delete their personal data [7]. Another

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

possible solution is adopting a privacy-centric design, which

minimizes the amount of data collected by enforcing the

minimization principle [7], particularly for tasks like pedestrian

detection.

From an ethical standpoint, companies should be transparent

with all parties regarding the data being collected, its purpose,

how it is processed, and who has access to it [6]. Providing clear

notifications, such as a sticker on vehicles to inform pedestrians

of being recorded, can also help [8]. However, despite these

measures, [9] notes that current laws are still inadequate and do

not fully address all possible scenarios.

II. CLASSIFICATION OF PEDESTRIAN DETECTION METHODS

Vehicle detection techniques can be categorized into three

main branches “Fig. 2”: Classical Methods, Deep Learning

Methods, and Hybrid Methods.

E. Classical Methods

This category includes traditional computer vision methods

that use hand-crafted features and conventional algorithms,

which were common before deep learning became prominent.

Techniques like Histogram of Oriented Gradients (HOG),

Support Vector Machines (SVM), and other feature-based

approaches are part of this group. While these methods

established the foundation for vehicle detection, they often

struggle with complex scenarios compared to modern

techniques.

F. Deep learning Methods

Deep learning techniques have transformed vehicle

detection by enabling more advanced and robust models, it can

be divided into 3 types:

1) Anchor-based methods: These methods use predefined

anchor boxes to predict bounding boxes around vehicles. They

can be divided into one-stage detectors, such as You Only Look

Once (YOLO) [10] and Single Shot Detector (SSD) [11] [31]

[32], which perform detection in a single step for faster but

occasionally less accurate results, and two-stage detectors, such

as Fast Region-based Convolutional Neural Network (Fast R-

CNN) [12] [33], which involve an initial region proposal stage

followed by classification, resulting in higher accuracy but

greater computational complexity.

2) Anchor-free methods: These methods remove the

necessity for predefined anchor boxes by directly predicting

vehicle locations from the image. Approaches like CenterNet

[13] [34] and Fully Convolutional One-Stage Object Detection

(FCOS) [14] [35] are often simpler and can potentially provide

faster inference times.

3) Attention-based methods: These methods use attention

mechanisms to boost object detection by concentrating on the

most relevant parts of the image. Attention mechanisms

enhance feature representation and improve accuracy in

complex scenes, as seen in Detection Transformer (DETR) [15]

[19] and Vision Transformers (ViT) [16].

G. Hybrid Methods

These methods aim to build robust systems by integrating

classical and deep learning techniques. They strive to balance

traditional feature extraction with modern deep learning

models, leveraging the strengths of both approaches while

addressing their limitations.

Fig. 2 Categorisation of different vehicle detection techniques.

Here is a summary of some state-of-the-art papers:

[17] This paper Addresses low detection accuracy in vehicle

and pedestrian detection models by introducing a

Convolutional Block Attention Module (CBAM) into the cross-

stage partial Darknet-53 (CSPDarknet53)-tiny module to

enhance feature extraction capabilities and to make up for the

shortcomings of the single attention module. In addition, it

replaced the original simple convolutional module with the

Cross Stage Partial Dense Block Layer (CSP-DBL) module to

compensate for high-resolution characteristic information and

improve detection accuracy.

In order to evaluate the model, the public BDD100K [22]

dataset is adopted and the average precision (AP) and mean

average precision (mAP) and Recall are used as metrics to

evaluate the performance of the proposed architecture. The

proposed model scores the highest in terms of precision and

achieves an 88.74% mAP.

The results also show that the model is better at capturing

relevant objects in the data set with an increase of 0.73% and

0.01% in recall for both classes car and people respectively but

in terms of speed it has scored the lowest with the 63 FPS.

[18] This paper proposes an improved lightweight YOLOX

real-time vehicle detection algorithm that enhances detection

speed and accuracy with fewer parameters. It introduces a

lightweight design of the backbone extraction network and the

α-Complete-Intersection-over-Union (α-CIoU) loss function to

improve regression accuracy and convergence speed. Inspired

by GhostNet, two improved network modules were proposed:

Vehicle

detection

Classical

methods

Deep learning

methods

Anchor-based

One-stage

Two-stage

Anchor-free Attention-

based

Hybrid

methods

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

 Cross Stage Partial Ghost module (CSPGhost

Module) or (CSPGM) divides the feature maps of

input layer into two parts, with the left part going

through the GM stacked to continue feature extraction

and then merging them through the cross-stage.

 The CGM structure divides the input feature maps into

two branches, with the left branch adjusting the output

channel and the right branch derived from the splitting

of input feature map, and finally the two branches are

spliced.

In addition to that, they introduced a new loss function called

α-CIoU. It aims to improve the model by retaining all the

properties of the CIoU while paying more attention to high IoU

goals and creating more space for optimizing all levels of goals,

achieving different levels of detection frame regression

accuracy. The loss function decreases with the increase of IoU

when α (the loss function’s parameter) is greater than 1 , which

improves the model accuracy without increasing inference

time.

In order to evaluate the modified model, they conducted it

on BIT-Vehicle [23] dataset which contains 9850 images with

most images having only one or two vehicles. These images are

divided into different label categories including: Bus,

Microbus, Minivan, Sedan, SUV, and Truck. In addition, they

used parameters, size (MB), FPS, and mAP 0.5 as metrics.

The results show that compared to the original YOLOX-S,

the proposed method achieved 0.99% higher mAP, a reduction

by 41.2% in parameters and a 12.7% higher FPS.

[19] DETR, a one-stage object detection model with a

ResNet-50 backbone, faces limitations in small feature

resolution and slow training convergence. To improve

performance, the authors of this article made some adjustments

by reducing layers to 40, using a shortcut layer instead of max-

pooling, and adding a Spatial Pyramid Pooling (SPP) block.

The architecture includes a modified ResNet-50 for feature

extraction followed by a transformer with a multi-head self-

attention encoder-decoder and a Feed Forward Network FFN

for end-to-end detection. The model is trained on the MS

COCO 2017 [24] dataset and evaluated using a custom video

dataset from KELTRON [25] for vehicle detection.

Performance metrics included precision, recall, mean Average

Precision (mAP), FLOPS, and Frame Per Second (FPS).

As for the results, the model achieved 51.31% mAP on MS

COCO 2017, outperforming SSD YOLO V3 tiny and the

baseline DETR. However, it scored the lowest FPS at 53,

indicating a need for speed optimization for real-time detection.

It also achieved a significant 0.03 mAP value on the Wilcoxon

test.

[20] This paper addresses the challenge of detecting multi-

scale vehicle targets, especially small ones, in traffic

monitoring videos. It proposes a codec-based vehicle detection

algorithm based on YOLOv3, which incorporates a new multi-

level feature pyramid structure with a codec module to detect

vehicles of different scales. This involves stitching multi-level

features extracted by the backbone network into basic features,

sending them to the codec module, and combining them with

equivalent-scale features at the decoder layer for detection. The

aforementioned algorithm optimizes the original YOLOv3

network for vehicle detection

TABLE III

STATE OF THE ART

Paper

Architecture

Dataset

Results

[17]

YOLOv4

tiny*

BDD100K [22]

Achieves an

88.74% mAP. A

lower speed of 63

FPS.

[18]

YOLOX-

based

BIT-Vehicle[23]

- 0.99% increase in

mAP. 41.2%

reduction in

parameters.

- 12.7% increase in

FPS

compared to

the original

YOLOX-S.

[19]

DETR-SPP

-MS COCO

2017 [24]

-KELTRON[25]

- Achieved

51.31% mAP,

outperforming

SSD, YOLO

V3 tiny and

baseline

DETR.

[20]

YOLOv3-

based

- KITTI[26]

-UA-

DETRAC[27]

- On KITTI, the

algorithm

achieved

average

precisions of

95.04%,

92.39%, and

87.51% for

easy,

moderate, and

hard

subsets.

- On UA-

DETRAC, the

algorithm

significantly

improved

over

YOLOv3

across all

detection

conditions.

[21]

CPAM

Network

-MS COCO[28]

-UA-

DETRAC[29]

-Outperformed

most

of the other

detectors

on both datasets.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Fig. 1 General Vehicle Detection Framework.

in traffic surveillance videos. The key innovations of this

article can be sited as follow:

 A YOLOv3 Integration: The algorithm introduces the

YOLOv3 algorithm to multi-scale vehicle detection in

traffic videos and improves upon it.

 Feature Encoding and Decoding Structure Module: A

module is proposed to generate high-order multi-scale

feature maps through a simple U-shaped structure.

 Attention Mechanism: A special diagnosis module

integrated with an attention mechanism enhances the

model’s expression ability.

The evaluation of the proposed method is done on two main

datasets: the KITTI [26] dataset which was captured while

driving in the rural areas surrounding Karlsruhe, including both

the city itself and the nearby highways, and UA-DETRAC [26]

dataset is a real-world multi-object detection and multi-object

tracking benchmark, and comprises 10 hours of video footage

filmed with a Canon EOS 550D camera across 24 distinct

locations in Beijing and Tianjin, China. The videos were

recorded at a frame rate of 25 frames per second (fps) and have

a resolution of 960 x 540 pixels. As for the evaluation metrics,

average precision (AP) is used in this article to evaluate the

performance of the proposed architecture. On KITTI dataset,

the results showed that the proposed algorithm has achieved an

average precision of 95.04%, 92.39%, and 87.51% under easy,

moderate and hard subsets respectively outperforming by that

YOLOv3 by 2.49%, 3.68%, and 9.73% respectively while

showing competitive speed in comparison to other models.

Meanwhile, on UA-DETRAC dataset, the proposed

algorithm demonstrated significant improvements over

YOLOv3 in all

detection conditions namely: easy, medium, hard, full,

sunny, rainy, night, and cloudy.

[21] This paper tackles the challenge of multi-target vehicle

detection in intelligent transportation systems (ITS)

specifically detecting small vehicles in the distant view. For

that the authors of this article proposed a network called Corner

Pooling with Attention Mechanism (CPAM) which allows an

anchorless detection. The major contributions in the

aforementioned network are:

 An hourglass with coordinate attention (Hourglass-

CA) as a backbone, it is based on hourglass 104 but

with lesser depth layer 54 layers which aims to

decrease the number of parameters and make the

network lighter and faster, moreover they introduced

three collaborative attentions into the decoder in order

to gather pivotal information on three future scales

namely: 384,384 and 256.

 A multi-level attention Network (MLA): which is

designed to enhance feature maps that are generated

by the backbone by applying attention mechanism at

different scales to improve the detecting accuracy

across varying vehicle sizes, especially small ones.

 A multi-level attention loss function: which calculates

the discrepancy between the predicted attention maps

and the ground truth feature maps, learn to prioritize

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

important features and correct any deviations that

occur during the up-sampling process.

The model is evaluated on two main datasets: MS COCO

[28] which contains 3 types of vehicles; car, bus and truck,

while UA-DETRAC [29] dataset focuses on occluded vehicles.

And the metrics used for evaluating the model are as follow:

precision, recall, and average precision (AP).

On the UADETRAC dataset, the proposed architecture

achieves a mAP of 70.64%, outperforming other detectors like

Faster R-CNN, YOLOv3, CenterNet, and CornerNet. It has a

mAP of 90.72%, 74.12% and 52.94% under easy, medium,

hard subsets. In addition, the model performs well under

different weather conditions, achieving a mAP of 76.16%,

78.62%, and 59.37% on cloudy, sunny and rainy subsets.

Meanwhile, on MS COCO dataset, the model reaches an AP of

43.3%, AP50 of 59.2%, AP75 of 46.9%, APs of 24.4%, APm

of 44.8%, and APl of 57.5%.

III. CRITICAL ANALYSIS

Current research demonstrates a clear shift toward

combining multi-scale fusion, attention mechanisms, and

lightweight backbones to enhance both accuracy and

efficiency. Traditional HOG variants continue to benefit from

modern improvements, for instance standard HOG reaches

93% accuracy but suffers from errors due to inaccurate

hypothesis generation [36], enhanced HOG achieves 97% with

near real-time performance but could be less computationally

demanding [37], while region-driven HOG (RDHOG)

increases accuracy to over 99% on traffic footage but deemed

not successful when detecting vehicles in complex scenes with

multiple occlusions [38]. Meanwhile, Haar-like cascades

operate in under 5ms per window but fails under occlusion or

in dense traffic scenarios [39]. Even deformable part models

(DPM) in [40] improve through PCA-based filter compression

and FFT-accelerated convolutions, reducing parameters by

30% and accelerating matching, though significant overlap

remains problematic [41]. Dense-ResNet hybrids combine

residual and dense connections to surpass YOLOv3 by more

than 5 AP on small/medium vehicles but needs 2 to 3 times

more memory and approximately 50% longer training [42].

Two-stage frameworks such as Improved Region-based

Convolutional Neural Network for Vehicle Detection (IRCNN-

VD) in [43] eliminates background pixels using Scale-Invariant

Feature Transform (SIFT), utilize hard negative mining and

employ evolutionary hyperparameter tuning to achieve 0.85

mAP on BOXY [44] dataset in less than 1ms, which is twice

Faster R-CNN's speed, but they also demand high

computational cost and lack weather resilience. Retinex

preprocessing [45] with a Neural Architecture Search [46]

(NAS)-optimized ResNet101 backbone and Intersection over

Union (IoU)-guided anchors improves UA-DETRAC mAP

from 62.13% to 68.25% and small-vehicle AP from 14.16% to

43.64%, but operates below 2fps [47].

Attention-enhanced architectures further improve detection

performance, it is shown earlier in [17] where CBAM in

CSPDarknet53-tiny with CSP-DBL attains 88.74% mAP on

BDD100K with better car and pedestrian recall at 63 FPS,

while a streamlined YOLOX-S in [18] decreases parameters by

41.2%, increases mAP by 0.99%, and enhances FPS by 12.7%

on BIT-Vehicle. Transformer hybrids also excel: DETR-SPP in

[19] incorporates spatial pyramid pooling with a reduced

ResNet-50 backbone to elevate MS COCO mAP to 51.31%

outperforming SSD-YOLOv3 tiny but with a limited of 53 FPS.

While the Swin Transformer adaptation in [48] for hazy

conditions reaches 91% AP on their self-made Haze-Car

dataset and 82.3% on Real Haze-100 dataset with modest speed

reduction. A codec-based, multi-level feature pyramid with

attention integrated into YOLOv3 delivers 95.04% AP on

KITTI and notable UA-DETRAC improvements across easy,

medium, and hard categories [20]. Lastly, the anchorless

CPAM network in [21] featuring an Hourglass-CA backbone,

multi-level attention modules, and specialized attention loss

reaches 70.64% mAP on UA-DETRAC with consistent AP

under various weather conditions, alongside strong AP scores

on MS COCO.

Based on this, we can say that traditional approaches offer

low‑complexity and real‑time detection but are unable to well

handle occlusion scenarios and dense traffic. While

deep‑learning approaches improve accuracy and small-object

detection capabilities, they do so at the expense of increased

computational costs and reduced adverse‑weather robustness.

In addition, Attention‑enhanced and transformer-based models

further advance detection under occlusion, multi‑scale, and

low‑visibility scenarios, but they require greater processing

time and more powerful hardware resources.

This further demonstrates the necessity for context-specific

solutions rather than universal models.

IV. FEATURE DIRECTIONS

As we mentioned earlier, occlusion is one of the key

challenges that vehicle detection models must overcome. Real-

world environments are constantly shifting, ranging from sunny

to rainy, foggy, and other adverse weather conditions that can

obscure the visibility of vehicles. Such environmental

variations directly impact the model's ability to detect and

recognize vehicles accurately. Currently, most algorithms are

tailored to perform well in specific settings, and there is still no

universal approach capable of adapting to diverse conditions.

This limitation highlights the importance of developing a

unified framework that combines multiple detection strategies

to ensure robustness across various weather scenarios [30].

In addition to these challenges, achieving a balance between

speed and accuracy remains a difficult task in model design.

Often, improving one leads to a compromise on the other,

which in practical applications can reduce the overall

robustness of the system [30]. Given that the backbone of a

deep learning model plays a central role in performance, much

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

of the research effort is focused on designing more advanced

and efficient backbones that can help achieve this crucial

balance [30].

V. CONCLUSION

Vehicle detection is a vital area in computer vision, playing

a key role in enhancing technologies like driver assistance

systems. Despite challenges such as occlusion, varying object

scales, and complex backgrounds, researchers are still

continuing to try to develop more advanced and robust

detection methods. While relatively significant progress has

been made, the field still offers many opportunities for further

research and innovation.

REFERENCES

[1] C. Bloom, J. Tan, J. Ramjohn, and L. Bauer, ‘Self-Driving Cars and Data

Collection: Privacy Perceptions of Networked Autonomous Vehicles’.

[2] C. Allen and T. Dierks, ‘The TLS Protocol Version 1.0’, Internet

Engineering Task Force, Request for Comments RFC 2246, Jan. 1999.

doi: 10.17487/RFC2246.

[3] C. Dwork and A. Roth, ‘The Algorithmic Foundations of Differential

Privacy’, FNT in Theoretical Computer Science, vol. 9, no. 3–4, pp.

211–407, 2013, doi: 10.1561/0400000042.

[4] J. Wang, Z. Cai, and J. Yu, ‘Achieving Personalized k-Anonymity-

Based Content Privacy for Autonomous Vehicles in CPS’, IEEE Trans.

Ind. Inf., vol. 16, no. 6, pp. 4242–4251, Jun. 2020, doi:

10.1109/TII.2019.2950057.

[5] European Data Protection Supervisor, TechDispatch: connected cars.

Issue 3, 2019. Publications Office of the European Union, 2019.

Accessed: Aug. 15, 2024. [Online]. Available:

https://data.europa.eu/doi/10.2804/70098

[6] European Data Protection Supervisor, “Resolution on Data Protection in

Automated and Connected Vehicles,” Accessed: Aug. 15, 2024.

[Online]. Available:

https://www.edps.europa.eu/sites/default/files/publication/ resolution-

on-data-protection-in\-automated-and-connected-vehicles_en_1.pdf.

[7] European Data Protection Board, “EDPB Guidelines on Connected

Vehicles,”Accessed: Aug. 15, 2024.[Online]. Available:

https://www.edpb.europa.eu/system/files/2021-

03/edpb_guidelines_202001_connected_vehicles_v2.0_adopted_en.pdf

[8] I. Krontiris et al., ‘Autonomous Vehicles: Data Protection and Ethical

Considerations’, in Computer Science in Cars Symposium, Feldkirchen

Germany: ACM, Dec. 2020, pp. 1–10. doi: 10.1145/3385958.3430481.

[9] IEEE Spectrum, “Self-Driving Cars Will Be Ready Before Our Laws

Are,” Accessed: Aug. 14, 2024. [Online]. Available:

https://spectrum.ieee.org/selfdriving-cars-will-be-ready-before-our-

laws-are.

[10] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘You Only Look

Once: Unified, Real-Time Object Detection’, in 2016 IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), Jun. 2016,

pp.779–788. doi: 10.1109/CVPR.2016.91.

[11] W. Liu et al., “SSD: Single Shot MultiBox Detector,” in Computer

Vision– ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds.,

Cham: Springer International Publishing, 2016, pp. 21–37. doi:

10.1007/978-3-319-46448-0_2.

[12] R. Girshick, ‘Fast R-CNN’, in 2015 IEEE International Conference on

Computer Vision (ICCV), Dec. 2015, pp. 1440–1448. doi:

10.1109/ICCV.2015.169.

[13] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, ‘CenterNet:

Keypoint Triplets for Object Detection’, in 2019 IEEE/CVF

International Conference on Computer Vision (ICCV), Oct. 2019, pp.

6568–6577. doi:10.1109/ICCV.2019.00667.

[14] Z. Tian, C. Shen, H. Chen, and T. He, ‘FCOS: Fully Convolutional One-

Stage Object Detection’, in 2019 IEEE/CVF International Conference

on Computer Vision (ICCV), Oct. 2019, pp. 9626–9635.

doi:10.1109/ICCV.2019.00972.

[15] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and

S.Zagoruyko, “End-to-End Object Detection with Transformers,” in

Computer Vision – ECCV 2020, vol. 12346, A. Vedaldi, H. Bischof, T.

Brox,and J.-M. Frahm, Eds., Cham: Springer International Publishing,

2020,pp. 213–229. doi: 10.1007/978-3-030-58452-8_13.

[16] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers

for Image Recognition at Scale,”2020.doi:

10.48550/ARXIV.2010.11929.

[17] J. Li, Z. Xu, and L. Xu, “Vehicle and pedestrian detection method based

on improved YOLOv4-tiny,” Optoelectronics Letters, vol. 19, no. 10,

pp.623–628, Oct. 2023. doi: 10.1007/s11801-023-3078-x.

[18] C. Xiong, A. Yu, S. Yuan, and X. Gao, ‘Vehicle detection algorithm

based on lightweight YOLOX’, SIViP, vol. 17, no. 5, pp. 1793–1800,Jul.

2023, doi: 10.1007/s11760-022-02390-1.

[19] K. S P and P. Mohandas, ‘DETR-SPP: a fine-tuned vehicle detection

with transformer’, Multimed Tools Appl, vol. 83, no. 9, pp. 25573–

25594,Aug. 2023, doi: 10.1007/s11042-023-16502-7.

[20] F. Hong, C.-H. Lu, C. Liu, R.-R. Liu, and J. Wei, ‘A Traffic Surveillance

Multi-Scale Vehicle Detection Object Method Base on Encoder-

Decoder’, IEEE Access, vol. 8, pp. 47664–47674,2020,

doi:10.1109/ACCESS.2020.2979260.

[21] L.-Y. Hao, J.-R. Yang, Y. Zhang, and J. Zhang, ‘Multi-target vehicle

detection based on corner pooling with attention mechanism’, Appl

Intell,vol. 53, no. 23, pp. 29128–29139, Dec. 2023, doi:

10.1007/s10489-023-05084-4.

[22] ‘Data Download — BDD100K documentation,’Accessed: Jul. 27, 2024.

[Online]. Available:https://doc.bdd100k.com/download.html.

[23] Z. Dong, Y. Wu, M. Pei, and Y. Jia, ‘Vehicle Type Classification Using

a Semisupervised Convolutional Neural Network’, IEEE Transactions

on Intelligent Transportation Systems, vol. 16, no. 4, pp. 2247–2256,

Aug.2015, doi: 10.1109/TITS.2015.2402438.

[24] ‘Papers with Code - MS COCO Dataset’. Accessed: Jul. 27,

2024.[Online]. Available: https://paperswithcode.com/dataset/coco.

[25] KELTRON dataset, “KELTRON,” Accessed: Jun. 10, 2024.

[Online].Available: https://www.keltron.org/index.php/intelligent-

traffic-systems.

[26] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:

The KITTI dataset,” The International Journal of Robotics Research,

vol.32, no. 11, pp. 1231–1237, Sep. 2013, doi:

10.1177/0278364913491297.

[27] “UA-DETRAC Dataset | Papers With Code,” Accessed: Jul. 27,

2024.[Online]. Available: https://paperswithcode.com/dataset/ua-

detrac.

[28] T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,”

2014,arXiv. doi: 10.48550/ARXIV.1405.0312.

[29] L. Wen et al., “UA-DETRAC: A new benchmark and protocol for

multiobject detection and tracking,” Computer Vision and Image

Understanding,vol. 193, p. 102907, Apr. 2020, doi:

10.1016/j.cviu.2020.102907.

[30] M. A. Berwo et al., “Deep Learning Techniques for Vehicle Detection

and Classification from Images/Videos: A Survey,” Sensors, vol. 23,

no.10, Art. no. 10, Jan. 2023, doi: 10.3390/s23104832.

[31] Z. Chen et al., ‘Fast vehicle detection algorithm in traffic scene based on

improved SSD’, Measurement, vol. 201, p. 111655, Sep. 2022, doi:

10.1016/j.measurement.2022.111655.

[32] J. Cao et al., ‘Front Vehicle Detection Algorithm for Smart Car Based

on Improved SSD Model’, Sensors, vol. 20, no. 16, Art. no. 16, Jan.

2020, doi: 10.3390/s20164646.

[33] N. Arora, Y. Kumar, R. Karkra, and M. Kumar, ‘Automatic vehicle

detection system in different environment conditions using fast R-CNN’,

Multimed Tools Appl, vol. 81, no. 13, pp. 18715–18735, May 2022, doi:

10.1007/s11042-022-12347-8.

[34] Y. Sun, Z. Li, L. Wang, J. Zuo, L. Xu, and M. Li, ‘Automatic Detection

of Vehicle Targets Based on CenterNet Model’, in 2021 IEEE

International Conference on Consumer Electronics and Computer

Engineering (ICCECE), Jan. 2021, pp. 375–378. doi:

10.1109/ICCECE51280.2021.9342498.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

[35] ‘Double FCOS: A Two-Stage Model Utilizing FCOS for Vehicle

Detection in Various Remote Sensing Scenes | IEEE Journals &

Magazine | IEEE Xplore’. Accessed: Apr. 10, 2025. [Online]. Available:

https://ieeexplore-ieee-org.sndl1.arn.dz/document/9793845

[36] M. Cheon, W. Lee, C. Yoon, and M. Park, ‘Vision-Based Vehicle

Detection System With Consideration of the Detecting Location’, IEEE

Transactions on Intelligent Transportation Systems, vol. 13, no. 3, pp.

1243–1252, Sep. 2012, doi: 10.1109/TITS.2012.2188630.

[37] H. Tehrani Niknejad, A. Takeuchi, S. Mita, and D. McAllester, ‘On-

Road Multivehicle Tracking Using Deformable Object Model and

Particle Filter With Improved Likelihood Estimation’, IEEE

Transactions on Intelligent Transportation Systems, vol. 13, no. 2, pp.

748–758, Jun. 2012, doi: 10.1109/TITS.2012.2187894.

[38] B.-F. Wu, C.-C. Kao, C.-L. Jen, Y.-F. Li, Y.-H. Chen, and J.-H. Juang,

‘A Relative-Discriminative-Histogram-of-Oriented-Gradients-Based

Particle Filter Approach to Vehicle Occlusion Handling and Tracking’,

IEEE Transactions on Industrial Electronics, vol. 61, no. 8, pp. 4228–

4237, Aug. 2014, doi: 10.1109/TIE.2013.2284131.

[39] ‘Vision-based scale-adaptive vehicle detection and tracking for

intelligent traffic monitoring | IEEE Conference Publication | IEEE

Xplore’. Accessed: May 21, 2025. [Online]. Available:

https://ieeexplore-ieee-org.sndl1.arn.dz/document/7090470

[40] ‘Improvement and Application of Deformable Parts Model in Vehicle

Detection-【VIP Journal Official Website】- Chinese Journal Service

Platform’. Accessed: May 21, 2025. [Online]. Available:

https://cstj.cqvip.com/Qikan/Article/Detail?id=7000043723

[41] C. Ma and F. Xue, ‘A review of vehicle detection methods based on

computer vision’, Journal of Intelligent and Connected Vehicles, vol. 7,

no. 1, pp. 1–18, Mar. 2024, doi: 10.26599/JICV.2023.9210019.

[42] X. Sun, Q. Huang, Y. Li, and Y. Huang, ‘An Improved Vehicle

Detection Algorithm Based on YOLOV3’, in 2019 IEEE Intl Conf on

Parallel & Distributed Processing with Applications, Big Data & Cloud

Computing, Sustainable Computing & Communications, Social

Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom),

Dec. 2019, pp. 1445–1450. doi: 10.1109/ISPA-BDCloud-SustainCom-

SocialCom48970.2019.00208.

[43] Y. Djenouri, A. Belhadi, G. Srivastava, D. Djenouri, and J. Chun-Wei

Lin, ‘Vehicle detection using improved region convolution neural

network for accident prevention in smart roads’, Pattern Recognition

Letters, vol. 158, pp. 42–47, Jun. 2022, doi:

10.1016/j.patrec.2022.04.012.

[44] K. Behrendt, ‘Boxy Vehicle Detection in Large Images’, in 2019

IEEE/CVF International Conference on Computer Vision Workshop

(ICCVW), Oct. 2019, pp. 840–846. doi: 10.1109/ICCVW.2019.00112.

[45] D. J. Jobson, ‘Retinex processing for automatic image enhancement’, J.

Electron. Imaging, vol. 13, no. 1, p. 100, Jan. 2004, doi:

10.1117/1.1636183.

[46] B. Zoph and Q. V. Le, ‘Neural Architecture Search with Reinforcement

Learning’, 2016, arXiv. doi: 10.48550/ARXIV.1611.01578.

[47] J. Luo, H. Fang, F. Shao, Y. Zhong, and X. Hua, ‘Multi-scale traffic

vehicle detection based on faster R–CNN with NAS optimization and

feature enrichment’, Defence Technology, vol. 17, no. 4, pp. 1542–1554,

Aug. 2021, doi: 10.1016/j.dt.2020.10.006.

[48] Z. Sun, C. Liu, H. Qu, and G. Xie, ‘A Novel Effective Vehicle Detection

Method Based on Swin Transformer in Hazy Scenes’, Mathematics, vol.

10, no. 13, p. 2199, Jun. 2022, doi: 10.3390/math10132199.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

An MFU-based Approach for Data Quality

Management in NoSQL Document-oriented

Databases

Aicha Aggoune#*1

1aggoune.aicha@univ-guelma.dz

#Computer science Department, University of 8th May 1945

Guelma, Algeria

*LabSTIC Laboratory, University of 8th May 1945

Guelma, Algeria

Abstract— NoSQL databases offer a flexible schema for

representing data, unlike the rigid schema of relational

databases. However, this flexibility can introduce challenges in

data quality. In this paper, we examine NoSQL document-

oriented databases, especially MongoDB and propose an

approach based on the Most Frequently Used (MFU) method to

detect and repair data quality issues by leveraging the most

frequently used elements. The proposed approach deals with

three data quality issues: schema overlap, missing data; and data

redundancy. Experimental evaluations on real-world MongoDB

datasets demonstrate that our MFU-based method effectively

enhances data consistency and completeness while reducing

redundancy. This work provides a practical framework for

improving data quality in NoSQL environments, ensuring more

reliable data for analytical and operational tasks.

Keywords— MongoDB, NoSQL document-oriented database,

Data quality, MFU.

I. INTRODUCTION

Data quality is a major challenge for organizations,

impacting both data analysis performance and financial

planning. High-quality data enables companies to enhance

operational efficiency, improve customer satisfaction, and

maintain a competitive edge by swiftly adapting their business

strategies. Joseph M. Juran [1] defines data quality as follows:

“data to be of high quality if they are fit for their intended uses

in operations, decision-making and planning”. Larry P.

English [2], an early leader in data quality management,

defined data quality as follows:

 The ability to consistently fulfill the expectations of

knowledge workers and end customers across all quality

aspects of information products and services necessary

for achieving organizational goals.

 The extent to which data reliably meets the needs and

expectations of knowledge workers who depend on it

for their tasks.

In addition, Wang and Strong [16] define a "data quality

dimension" as a set of data quality attributes that represent a

single aspect or construct of data quality. They present four

categories of dimensions:

- Intrinsic DQ denotes that data have quality in their

own right. It includes accuracy, believability,

objectivity, and reputation,

- Contextual DQ highlights the requirement that data

quality must be considered within the context of the

task at hand; that is, data must be relevant, timely,

complete, and appropriate in terms of amount so as to

add value,

- Representational DQ, includes aspects related to the

format of the data {concise and consistent

representation) and meaning of data (interpretability

and ease of understanding), and

- Accessibility consists of accessibility and access

security.

In the context of NoSQL databases, the flexibility of

schemas means that data can be stored without a predefined

structure, allowing dynamic and varying formats within the

same database [10]. This enables developers to modify

document structures on the fly and handle diverse data types

without strict constraints. However, it also introduces

challenges in data consistency, quality management, and

query optimization. Our primary focus is on data quality

management in MongoDB, one of the most widely used

NoSQL databases, where data is stored in collections of JSON

documents. We examine three key data quality issues in

MongoDB: schema overlap, which impacts the concise

representation and representational consistency of document-

oriented NoSQL data; completeness, which relates to the

challenge of missing values; and conciseness, which addresses

data redundancy.

To tackle these challenges, we propose an approach based

on the Most Frequently Used (MFU) method. This approach

identifies the most commonly used schema elements and data

values to detect inconsistencies and enhance data quality. By

leveraging MFU analysis, our proposal improves schema

standardization, reduces redundancy, and ensures data

completeness.

Furthermore, we implement and evaluate our approach

using real-world datasets to demonstrate its effectiveness in

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

improving data quality in NoSQL document-oriented

databases.

The remainder of this paper is structured as follows:

Section 2 reviews related work on data quality management in

NoSQL databases. Section 3 presents our proposed approach,

detailing the MFU-based approach for detecting and resolving

data quality issues. Section 4 discusses the experimental

results, including the implementation of our data quality

management tool for MongoDB and key findings from our

experiments. Finally, Section 5 concludes the paper by

summarizing key insights and outlining potential future

research directions.

II. RELATED WORK

Many studies and frameworks have been developed for

managing data quality in relational databases and linked open

data [3], [4]. These areas have been extensively researched

due to the structured nature of relational data and the growing

need for semantic web technologies. However, while research

on NoSQL data quality is less extensive compared to

relational databases, it has been gaining attention in recent

years.

In the context of NoSQL columnar databases, several

approaches have been proposed. For example, Frozz et al.

[13], suggested replicating the hierarchical structure into a

new namespace and analyzing each column to infer data

types. In contrast, Bouhamoum et al. [14] explored schema

discovery and entity linkage in RDF data sources through

clustering techniques, operating without predefined schema

information.

In our work, we shift the focus to document-oriented

databases such as MongoDB, which are widely used but

present challenges related to maintaining data consistency and

integrity [15], [17].

Störl et al. [5] present a middleware called Darwin, which

enables the extraction of a NoSQL schema description, the

discovery of schema version history, and the proposal of

mappings between these versions. The databases supported by

Darwin include MongoDB and CouchDB. Darwin consists of

four main components: Schema Extraction Manager, Schema

Evolution Manager, Data Migration Manager, and Query

Rewriting Manager. Schema quality control is based on

detecting various operations applied to the schema, such as

adding a new attribute, renaming an attribute, and other

modifications.

An extension of Darwin [6] focuses on schema evolution,

which relies on five key operations: add, delete, rename, copy,

and move. These operations serve as the foundation for

maintaining data quality control.

Cristalli et al. [7] proposed a data quality control

framework for MongoDB databases, based on data quality

dimensions such as consistency, completeness, and

conciseness. Their framework relies on predefined quality

rules for each dimension and includes a process for data

quality verification. This approach can help organizations

enhance data quality and improve the reliability of analyses

based on MongoDB data.

Conrad et al. [8] propose a framework for evaluating

NoSQL systems with evolving schemas. It consists of two

main components: json-data-generator, which generates a

JSON file for evaluation, and EvoBench runner, which

assesses the performance of schema evolution in document-

oriented NoSQL databases like MongoDB and column-

oriented databases like Cassandra.

Möller et al. [11] addressed the challenge of data quality

management in NoSQL databases during their evolution. As

NoSQL collections may contain datasets in multiple versions,

they often become heterogeneous over time. To handle this,

the authors proposed four heterogeneity classes based on

schema evolution operations—ranging from highly structured

datasets with 1:1 cardinalities to unstructured datasets with

arbitrary cardinalities. Data quality is assessed across three

key dimensions: completeness, actuality, and consistency.

Asaad et al. [12] presented ER-defined quality framework

to NoSQL on a sample of diverse NoSQL schemas and using

both industrial and academic participants. A decision tree is

utilized to describe the heuristics of data model assessment,

and an analysis is performed to identify inter-annotator

disagreement, quality criterion importance, and quality trade-

offs.

Despite these advancements, research on NoSQL data

quality remains relatively underdeveloped compared to that of

relational databases. Existing studies primarily focus on

managing data quality during schema evolution, often without

providing detailed methodologies or extensive experimental

evaluations. Moreover, there is still a need for more

comprehensive solutions capable of addressing dynamic

schema changes, large-scale data validation, and real-time

quality monitoring.

III. PROPOSED MFU-BASED APPROACH FOR DATA QUALITY

MANAGEMENT

We present a document-oriented data quality management

approach based on the Most Frequently Used (MFU)

elements.

We define document-oriented data quality based on three

key dimensions: representational consistency, concise

representation, and completeness. These dimensions influence

other quality aspects, such as accuracy, conciseness,

coherence, and ease of understanding.

Our approach studied three important data quality

dimensions:

 Representational DQ: Identified the correct schema by

analysing the most frequently used schema across

documents within a collection. The documents with

missing attributes required by the selected schema

should be updated by adding these attributes with null.

 Contextual DQ: especially the completeness. After

resolving schema overlap and selecting the correct

schema, the second issue is the incompleteness data.

Our approach provides multiple imputation strategies

for missing values:

1. Mean imputation: replacing missing values with

the average.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

2. Mode imputation: replacing missing values with

the most frequently used value.

3. Removing of documents containing missing

data.

 Concise representation: Occurs when the frequency

of a data entry exceeds one. The correction involves

removing duplicate documents from the collection.

A. Data quality detection

Data quality detection is the detection of data issue. We

distinguish three types of data issues:

 Schema overlaps: In NoSQL databases, the order and

number of fields (or attributes) are not important, which

can lead to poor data quality and create challenges in

data manipulation and query interpretation. The MFU

method examines documents within a collection to

identify schema inconsistencies. If different schemas

are detected, it displays each schema's frequency.

 Data incompleteness: The MFU method analyzes the

list of documents in the collection, identifies those with

missing attributes based on the most frequently used

schema, and proceeds with the correction process.

 Data Duplication: The MFU method analyzes the

documents in the collection and detects redundancy

based on the frequency of document occurrences

(excluding the unique document ID). It then displays

the duplicate documents. If duplicates are found, MFU

proceeds with the correction process.

B. Data repairing

Data repair involves applying corrective actions to data

after identifying quality issues.

Schema Overlap: The repairing of schema overlap involves

using the results of the overlap detection process, which

generates a set of (schema, frequency) pairs. The schema with

the highest frequency is selected as the correct schema. Then

we apply this pivot schema to our database as follows:

 Analyze the collection and apply the selected schema to

each document.

 If an attribute exists in a document, retrieve its value

and insert it into the new collection.

 If an attribute is missing, replace it with 'Null'.

Data incompleteness: Several approaches have been

proposed to address data incompleteness:

 Removing documents with missing values: This

approach is suitable when the percentage of missing

attributes in a document exceeds a predefined threshold,

ensuring minimal information loss while maintaining

data quality

 Imputation of Missing Values: Missing values can be

replaced with a statistical measure such as the mean or

mode. To minimize bias, imputed values should closely

reflect the actual data distribution.

Data Duplication: One of the major challenges in

MongoDB document-oriented databases is data duplication.

To address this issue, we use the following process:

1. Select a collection from the database.

2. Retrieve a document (excluding its unique identifier).

3. Iterate through all documents in the collection and

check for duplicates by comparing them to the

previously retrieved document.

4. Repeat steps 2 and 3 for each document in the

collection.

5. Display the list of duplicate documents found.

6. The user decides whether to delete the duplicate

documents based on the results.

The MFU method enhances data quality in MongoDB through

a two-step process: detection and repair.

Detection: MFU identifies schema overlaps by analyzing

document structures and reporting schema frequencies. It

detects data incompleteness by comparing documents to the

most frequent schema and flags missing attributes. For

duplication, it compares document contents (excluding IDs) to

find redundant entries.

Repair: The most frequent schema is applied to standardize

documents existing values are preserved, and missing ones are

filled with Null. Incomplete records are either removed or

repaired using imputation (e.g., mean or mode). Duplicate

documents are listed for user review and optional deletion.

This approach ensures consistent, complete, and duplicate-free

data suitable for reliable analysis.

IV. EXPERIMENTAL EVALUATION AND RESULTS

Ensuring data quality after detection and correction is

crucial for evaluating the effectiveness of the proposed

method. Firstly, we developed a data quality management tool

for MongoDB. We ran our experiments on a standard Linux

machine equipped with a 2.4 GhZ dual core CPU, 8GB of

RAM and 350 GB of standard storage. The approach is

implemented in Python 3.10.4 whereas data was managed in

MongoDB. We use a real-world database of COVID19,

published on February 14, 2022, and provided by Johns

Hopkins University (JHU) [9]. According to MongoDB

developer Maxime Beugnet, the COVID-19 database serves as

an excellent dataset for educational purposes and personal

projects. Table 1 provides a description of the database used.

TABLE I

DESCRIPTION OF DATASET

Name of

database

Covid19

Number of

collections

Description

of collection

Beford

Benford_view

Dataset

1 document

193

documents

11000

documents

Total of

documents

11194 documents

The following figure displays the main interface of our

tool.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Fig. 1 Data quality management tool

Our approach begins with the detection and repairing of

schema overlap. The result of data detection is illustrated in

Fig. 2.

Fig. 2 Schema overlap repair interface

The proposed approach is used to apply the most frequently

occurring schema to the collection.

Afterward, the approach scans the entire collection to

ensure that no schema inconsistencies remain (see Fig.3).

Fig. 3 Example of validation of data quality management

We follow the same steps for the other two issues, applying

the defined corrections accordingly.

We conducted both qualitative and quantitative evaluations.

The quality verification process assesses the reliability of

the repaired data by reintroducing the output of the MFU

method as input. If MFU effectively enhances data quality, the

final output should align with the expected clean data, thereby

validating the success of the repair process.

In the quantitative evaluation, we focus on the following

metrics:

1. Measure the number of documents that follow the

correct or consistent schema within the same

collection. This metric reflects the representational

dimension of data quality.

2. Measure the amount of missing data to evaluate the

completeness dimension of data quality.

3. Measure the number of duplicate records in the dataset

to assess the concise representation dimension of data

quality.

Table 2 provides the results for two collections Benford_view

(193 documents), Dataset (11000 documents).

TABLE III

EXPERIMENTAL RESULT

Collection

Measure

for

Dimension

Measure

for

Dimension

Measure for

Dimension 3

Benford_view

98.5%

0.1%

Dataset

98.1%

0.5%

The experimental results demonstrate the effectiveness of

the MFU-based approach in detecting and repairing data

quality issues in MongoDB. By applying the most frequently

used schema, handling schema overlap, missing values, and

eliminating duplicate documents, our method improves data

consistency, completeness, and conciseness. The verification

process confirms that the repaired data aligns with expected

quality standards, validating the reliability of the proposed

approach.

V. CONCLUSION

We proposed a data quality management approach for

document-oriented NoSQL databases, focusing on schema

overlap, data incompleteness, and duplication issues. Using

the Most Frequently Used (MFU) method, our approach

detects and corrects these issues by identifying the most

common schema, imputing missing values, and removing

redundant records.

Through experimental evaluation on a real-world dataset,

we demonstrated the effectiveness of our method in improving

data consistency, completeness, and conciseness. The

verification process confirmed the reliability of the repaired

data, ensuring that the proposed approach enhances data

quality in MongoDB databases.

Future research should investigate machine learning

techniques and automated schema evolution tracking to

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

improve data quality in NoSQL systems, particularly for

large-scale datasets.

REFERENCES

[1] J.M. Juran, F.M. Gryna, and R.S. Bingham, Quality control handbook,

New York: McGraw-hill, 1979, vol. 3.

[2] L.P. English, Improving data warehouse and business information

quality: methods for reducing costs and increasing profits. John Wiley

& Sons, Inc, 1999.

[3] I. F. Ilyas and X. Chu, “Trends in cleaning relational data: Consistency

and deduplication”, Found. Trends Databases, vol. 5, no.4, pp.281-

393, 2015.

[4] A. Hadhiatma, Improving data quality in the linked open data: a

survey, Journal of Physics: Conference Series, vol. 978, no. 1,

pp.012026, 2018.

[5] U. Störl, D. Müller, A. Tekleab, S. Tolale, J. Stenzel, M. Klettke, and

S. Scherzinger, Curating variational data in application development,

IEEE 34th International Conference on Data Engineering (ICDE), pp.

1605-1608, 2018.

[6] M. L. Möller, M. Klettke, and U. Störl, Keeping nosql databases up to

data semantics of evolution operations and their impact on data

quality, 2019.

[7] E. Cristalli, F. Serra, and A.Marotta, Data quality evaluation in

document oriented data stores, In Advances in Conceptual Modeling:

ER 2018 Workshops Emp-ER, MoBiD, MREBA, QMMQ, SCME, pp.

309-318, Springer International Publishing, 2018.

[8] A. Conrad, M.L. Möller, T. Kreiter, J.C. Mair, M. Klettke, and U.

Störl, EvoBench: Benchmarking schema evolution in NoSQL.

In Performance Evaluation and Benchmarking: 13th TPC Technology

Conference, TPCTC 2021, Copenhagen, Denmark, August 20, 2021.

[9] (2022) The MongoDB website. [Online]. Available:

https://www.mongodb.com/developer/article/johns-hopkins-university-

covid-19-graphql-api/

[10] A. Aggoune, “An Overview on the Mapping Techniques in NoSQL

Databases”, Int. J. Inf. Appl. Math., vol. 3, no. 2, pp. 53–65, 2020.

[11] M.L. Möller, D. Hausler, S. Strasser, T. Auge, and M. Klettke et al.,

Heterogeneity in NoSQL Databases-Challenges of Handling schema-

less Data. LWDA. pp. 134-145, 2023.

[12] C. Asaad, K. Baïna, M. Ghogho, Investigating the Perceived Usability

of Entity-Relationship Quality Frameworks for NoSQL Databases. In:

Mosbah, M., Kechadi, T., Bellatreche, L., Gargouri, F. (eds) Model and

Data Engineering. MEDI 2023. Lecture Notes in Computer Science,

vol 14396. Springer, Cham.

[13] A.A. Frozza, E.D. Defreyn and, R. Mello, A process for inference of

columnar NoSQL database schemas, in Anais do Simpósio Brasileiro

de Banco de Dados (SBBD). Anais do XXXV Simpósio Brasileiro de

Bancos de Dados, SBC, pp. 175–180, 2020.

[14] R. Bouhamoum et al.; “caling up schema discovery for RDF datasets”,

in 2018 IEEE 34th International Conference on Data Engineering

Workshops (ICDEW). 2018 IEEE 34th International Conference on

Data Engineering Workshops (ICDEW), pp. 84–89, 2018.

[15] A. Aggoune, and M.S. Namoune, Metadata-driven data migration from

object-relational database to nosql document-oriented database,

Comput. Sci, vol.23, 2022.

[16] R. Y. Wang and D. M. Strong, “Beyond Accuracy: What Data Quality

Means to Data Consumers”, J. Manag. Inf. Syst, vol.12, no.4, pp. 5–33,

1996.

[17] A. Aggoune and M.S. Namoune, “P3 Process for Object-Relational

Data Migration to NoSQL Document-Oriented Datastore”, Int. J.

Softw. Sci. Comput. Intell, vol.14, no.1, pp. 1-20, 2022.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Biometric identification through iris image

processing

DAOUDI Hadjer, HADJ SLIMANE Zine-Eddine1,2

1,2Biomedical Engineering Laboratory, Biomedical Engineering Department, Faculty of Technology,

University of Tlemcen, B.P 230 Tlemcen (13000), Algeria

E-mail address: hadjer.daoudi10@gmail.com, hadjslim@yahoo.fr

 Abstract— Iris recognition is a secure biometric technology

known for its stability and privacy. Since each iris is unique and

undergoes little change throughout a person's life, this method is

considered more reliable and less influenced by external factors

than other biometric techniques. This paper presents an end-to-

end methodology for iris enrollment, focusing on real-world

applicability and robustness. Unlike prior fragmented

approaches, our method integrates standardization,

segmentation, and feature extraction with modern preprocessing

techniques, providing a practical solution for database creation.

Keywords : Biometrics, Identification, Authentication,

Medical Images, Iris.

I.INTRODUCTION

Biometric identification, which lies at the intersection of

technology and human biology, has become crucial in various

applications, ranging from security systems to access control

and medical diagnostics. Among the different biometric

modalities, iris recognition stands out for its exceptional

accuracy and reliability. The iris, with its intricate patterns and

unique characteristics, serves as a robust biometric marker for

individual identification. Thanks to advancements in image

processing and computer vision, iris recognition systems have

become increasingly sophisticated, enabling seamless

authentication and identification processes. To implement this

system, the user enrollment phase is a key step, as it creates a

unique database of irises. An effective enrollment process is

essential to ensure the accuracy and reliability of subsequent

identifications while adhering to ethical and security

principles.

II.RELATED WORK

In 1993, J. Daugman proposed the first iris recognition system

based on iris images. In this system, he approximates the

boundaries of the iris with two non-concentric circles. To

detect these two boundaries, he proposes an integro-

differential operator that functions as a circular contour

detector. The operator searches in the filtered image for

circular contours that maximize the contrast relative to the

radius. An important advantage of this method is that it

operates directly on a gradient image without requiring

thresholding. However, this method is sensitive to noise

present in the image, which can result in strong gradients and

lead to false detections of circular contours [1].

Wildes [2] assumes that the boundaries of the iris can be

approximated by non-concentric circles. He was the first to

implement the Hough transform to detect the iris with circular

contours. So, he applied the Hough transform on an edge

image obtained from a vertical gradient for the detection of

the outer boundary and on an edge image obtained from both

vertical and horizontal gradients for the detection of the inner

boundary.

Huang et al. [3] indicate that a direct application of the

integro-differential operator requires a large computation time

due to its global search approach. In order to reduce the

complexity of calculations and improve the performance of

the method, they propose to first find the boundaries of the iris

in a reduced image, and then use this information to guide the

search in the original image [4].

Mohammed et al. [5], [6] transform the iris image into a

binary image using a simple thresholding operation. They then

apply morphological operations such as dilation to isolate the

iris from the rest of the image.

Lili and Mei [7] propose that the histogram of the iris image

should contain three main peaks, corresponding to the pupil,

the iris, and the scleral region. They use this assumption to

provide an initial coarse localization of the iris.

While seminal works by Daugman [1] and Wildes [2] laid the

foundation of iris recognition, recent methods have leveraged

deep learning techniques for improved segmentation and

feature extraction. For instance, U-Net and its variants have

shown high performance in noisy environments [8], while

GAN-based augmentation enhances recognition robustness [9].

However, many of these approaches require large datasets and

powerful GPUs, making them difficult to deploy in

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

lightweight systems. Our proposed method addresses this gap

by combining classical techniques with optimized

preprocessing to achieve high-quality segmentation under

realistic constraints.

Limitations of previous methods: Early approaches are

sensitive to noise and reflections. Wildes' use of Hough

Transform is computationally expensive, while Daugman’s

integro-differential operator lacks robustness under poor

illumination. Deep learning methods, although accurate, often

lack explainability and are resource-intensive.

III. PROPOSED WORK

The proposed approach focuses on the user enrollment phase

of an iris recognition system and aims to achieve accurate iris

segmentation and feature extraction under realistic constraints

(e.g., moderate image quality, limited computing resources).

Our method combines classical computer vision techniques

with carefully optimized preprocessing steps to ensure

robustness and efficiency.

A. Image Acquisition and Preprocessing

We start by acquiring high-resolution eye images using a

near-infrared (NIR) camera under controlled illumination

conditions to reduce noise caused by reflections or shadows.

The images are resized to a standardized dimension (e.g.,

128×128 pixels) during the normalization step, ensuring

consistency for subsequent stages.

To enhance image quality, we apply a Gaussian filter that

reduces high-frequency noise while preserving important

edges. This step is crucial for making iris textures more

distinguishable during segmentation and feature extraction.

Image filtred by filtre gaussian

B. Iris Segmentation

Segmentation is achieved using a hybrid approach involving

both edge detection and shape fitting:

1. Edge Detection: The Canny edge detector is applied

to locate prominent edges in the eye image, enabling

identification of iris boundaries despite variations in

lighting or contrast.

2. Circular Shape Detection: The Hough Transform is

used to accurately locate circular contours

corresponding to the outer iris boundary and the

inner pupil boundary. This allows for precise

definition of the iris region.

3. Mask Generation: A binary mask is created to

isolate the iris and remove irrelevant parts such as the

pupil and sclera. This mask is then applied to the

preprocessed image, producing a clean iris region of

interest (ROI).

Canny Edeg detection

Hough transform

Extraction of iris

C. Feature Extraction

For the extracted iris ROI, we apply two feature extraction

techniques:

 Histogram of Oriented Gradients (HOG): This

descriptor captures local gradient orientations, which

are effective in encoding iris textures and unique

patterns.

 Intensity Histogram Analysis: Pixel intensity

distributions are analyzed to capture global texture

features and contrast variations within the iris.

These combined features form a robust descriptor vector for

each iris, enabling high-precision identification even in the

presence of noise or minor deformations.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Histogram of Oriented Gradients (HOG)

D. Data Storage and User Profile Creation

The extracted features are linked to user identity data (e.g., ID,

name) and stored securely in a biometric database. Prior to

storage, data quality checks ensure that only images meeting

required standards (e.g., illumination, sharpness) are retained.

In compliance with ethical guidelines, user consent is obtained

before acquisition, and data protection mechanisms are

implemented to secure the biometric records against

unauthorized access.

IV. DISCUSSION

Iris detection is a crucial step in biometric recognition.

Classical techniques such as the Canny edge detector and the

Hough transform have been widely used due to their

simplicity and effectiveness.

The Canny detector is efficient at extracting fine edges,

helping to delineate the iris boundary. However, its

performance degrades in the presence of noise, varying

contrast, or low illumination. The Hough transform, on the

other hand, is powerful for detecting circular boundaries but is

computationally expensive and sensitive to parameter settings.

Recent works have proposed deep learning-based approaches

that significantly improve segmentation and robustness under

noisy or complex conditions. For example, U-Net

architectures and attention-based models can accurately

extract the iris region even in challenging images.

Nonetheless, these methods often require high computational

power and large labeled datasets, which limits their

deployment in lightweight or embedded systems.

Our proposed approach balances efficiency and accuracy by

integrating classical methods with optimized preprocessing.

While not entirely novel, this combination enables fast, robust

segmentation without the overhead of deep models. The use of

Gaussian filtering, normalization, and mask refinement

improves edge detection and circle fitting reliability.

Moreover, combining HOG and histogram-based features

ensures a more comprehensive characterization of the iris

texture, enhancing identification performance.

Compared to prior approaches, our method:

 Reduces computational cost,

 Maintains segmentation accuracy under real-world

acquisition conditions,

 And supports easier integration into low-resource

environments.

This hybrid strategy makes it particularly suitable for

applications where real-time performance, explainability, and

cost-effectiveness are essential.

V. CONCLUSION

The user registration phase is a crucial step in an iris

recognition system, as it allows for the creation of a unique

iris database. An effective registration process ensures the

accuracy and reliability of subsequent identification while also

adhering to ethical and security considerations.

REFERENCES

1. [1] Daugman, J.G.; ‘High confidence visual recognition

of persons by a test of statistical independence’, IEEE

TPAMI, Vol. 15, No. 11, pp. 1148-1161, 1993.

2. [2] Wildes, R.P., 'Iris Recognition: An Emerging

Biometric Technology', Proceedings of the IEEE, Vol.

85, No. 9, pp.1348-1363, 1997.

3. [3] Huang, Y., Luo, S., & Chen, E., 'An efficient iris

recognition system', Proc. Int. Conf. on Machine

Learning and Cybernetics, Vol. 1, pp. 450-454, 2002.

4. [4] Alaa Hilal, 'Système d'identification à partir de

l'image d'iris', Thèse Université Libanaise, 2013.

5. [5] Mohammed, G.J., et al., 'A New Localization Method

for Iris Recognition', Int. Workshop on Education

Technology and Computer Science, Vol.3, pp.316-320,

2009.

6. [6] Mohammed, G.J., et al., 'A new localization

algorithm for iris recognition', Information

Technology Journal, Vol. 8, No. 2, pp. 226-230, 2009.

7. [7] Lili, P., & Mei, X., 'The algorithm of iris image

preprocessing', IEEE Workshop on Automatic

Identification Technologies, pp. 134-138, 2005.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

8. [8] Ronneberger, O., Fischer, P., & Brox, T. (2015). 'U-

Net: Convolutional Networks for Biomedical Image

Segmentation', MICCAI.

9. [9] Goodfellow, I. et al. (2014). 'Generative adversarial

nets'. Advances in Neural Information Processing

Systems, 27.

10. [10] Loey, M., El-Bakry, H.M., & Nordin, M.J. (2021). 'A

hybrid deep transfer learning model for iris

recognition', Journal of Ambient Intelligence and

Humanized Computing, 12, 9745–9761.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

Demand-aware drug assignment in manipulator

arm automated dispensing systems via graph

convolutional network ranking

Yassine Bouhelassa #1, Khalid Hachemi 2

1,2LGPMI, Institute of Maintenance and Industrial Safety, University of Oran 2

Mohamed Ben Ahmed, B.P1015 El M'naouer 31000 Oran, Algeria

1bouhelassayassine777@gmail.com

3hachemi.khalid@univ-oran2.dz

Abstract— This paper introduces a GCN-based ranking model to

optimize drug placements within Automated Drug Dispensing

Systems (ADDS). Our method defines drugs as nodes that

incorporate features extracted from their monthly consumption

frequency, linking drugs in an interdependence graph through a

co-use matrix. The graph convolutional network uses message-

passing procedures to derive abstract representations that reveal

drug demand behaviour and co-prescription relationships before

generating drug scoring values. An optimized placement matrix is

then constructed by applying these ranking scores to position

high-value drugs in central compartment locations of a

manipulator arm-based ADDS. The optimized placement

configuration resulted in a 27% reduction in average retrieval

times compared to random drug positioning approaches.

Additionally, t-SNE analysis of the drug embeddings produced

meaningful clusters corresponding to drug relevancy. This

approach is fully adaptive to different ADDS systems,

demonstrating its potential for operational improvements in

healthcare facilities.

Keywords— Artificial intelligence, Graph Neural Networks,

Automated Drug Dispensing Systems, Graph Convolutional

Networks, Healthcare logistics, Drug assignment Optimization.

I. INTRODUCTION

Modern healthcare achieves better drug retrieval efficiency

through Automated Drug Dispensing Systems (ADDS) which

minimize human error in pharmaceutical management.

Automated drug distribution systems deliver drugs up to 94%

faster than traditional manual methods [1] . However, one of

the primary challenges in ADDS is the optimal arrangement of

medications, as utilization rates vary among products and the

combinations of drugs in prescriptions differ.

In this work, we present a manipulator-arm-based automated

drug dispensing system that offers improved accuracy in drug

retrieval operations (Fig 1). Medications are stored in

standardized bins within a rack system, and a robotic arm

follows optimized routes to retrieve drugs efficiently. We

employ a Graph Convolutional Network (GCN) to implement

an effective placement strategy by learning latent

representations of drugs based on their monthly consumption

frequencies and co-usage relationships. The GCN calculates

drug retrieval priority scores, which are then used to position

high-priority drugs in the most accessible compartments—

specifically, in the leftmost columns and bottom rows—thereby

reducing overall retrieval times.

Our proposed system builds upon recent advances in graph-

based ranking [2], [3], [4], [5]. By adapting these techniques to

the ADDS location assignment task, our approach achieves

significantly faster mean retrieval times and demonstrates

enhanced performance and consistency in drug dispensing

operations.

The remainder of this paper is organized as follows. Section

2 reviews related literature on storage optimization and graph-

based ranking methods. Section 3 formalizes the notation and

introduces Graph Neural Networks as the foundation of our

approach. Section 4 describes our research methodology,

including data preparation, graph construction, the GCN-based

ranking model, the design of the optimized placement matrix,

and the ADDS retrieval time model. Section 5 presents our

experimental results, which include comparisons with random

placement strategies and visualizations of drug embeddings.

Finally, Section 6 concludes the paper and outlines directions

for future work.

II. LITERATURE REVIEW

Healthcare facilities must properly arrange their medications

in automated drug dispensing systems because of the critical

Storage Location Assignment Problem (SLAP). Optimizing

storage positions in current procedures supports safe

medication delivery systems and boosts workplace

effectiveness [6]. Research by reyes et al shows that the main

problem is matching different medication distribution methods

with better retrieval systems and space usage practices [7].

Hausman et al [8] and Roodbergen et Vis [9] discovered the

main storage methods through their research foundation.

Random storage puts products anywhere in the ADDS to save

space yet increases the path to physical items. Products with

fixed storage locations get easy identification but leave unused

space when certain items remain understocked. Opposite

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

products in the class-based storage method follow approved

criteria to create clear zones that help staff retrieve medicines

faster with fewer errors [10]. The cube-per-order index (COI)

method was created to help achieve storage space balance and

better product picking speed levels.

Fig 1 Internal structure of manipulator arm Automated Drug Dispensing

System [11]

A considerable body of research has applied mathematical

models to storage optimization problems. Atmaca and Ozturk

[12] established models to manage inventory costs but these

methods performed well in controlled scenarios and had limited

application scale. Chaker and Khalid [13] created distribution

patterns that placed different types of medicines far apart to

help hospital workers avoid errors when picking orders.

Esmaili et al [14]. created two mixed integer programming

models to optimize warehouse performance first and secondly

considering product placement requirements and proved

superior to prior methods according to their research published

in 2018.

Advanced mathematics produces better results for this work.

The authors Hachemi and Alla [15] developed Petri nets to

solve safety protocols and optimize storage space without

exceeding drawer capacity. Hachemi and Amari revealed their

Min-Plus control method along with mathematical proof to

solve the SLAP while ensuring drug distribution effectiveness.

Recent research by Bouhelassa et al. examined normal and

fuzzy analytic hierarchy methods to enhance how well

medicines are stocked each month [16].

III. PRELIMINARIES

We begin by formalizing our notation for Information

Retrieval (IR) and Graph Neural Networks topics. In our

notation, bold symbols denote both matrices and vectors, with

uppercase letters representing matrices and lowercase letters

representing vectors (e.g., M, v). Scalars are denoted by simple

italic letters.

Let 󰇛󰇜 be an undirected graph, where  is the

set of nodes and  is the set of edges. The number of

nodes in  is denoted by , and the number of edges by

The neighbourhood of a node , 󰇛󰇜has cardinality

󰇛󰇜 The diagonal matrix  is defined such that its

 diagonal element is 󰇛󰇜.

The feature vector of node  is represented by .

These node representations are arranged into an instance matrix

, defined as:

























The set of edges can also be expressed as the adjacency

matrix 󰇛󰇜, where   if nodes 󰇛󰇜 are

connected, and   otherwise.

A. Graph Neural Networks

Graph Neural Networks (GNNs) are designed to learn and

extract features from graph-structured data. Given a collection

of graphs 󰇝󰇞

, where each graph 󰇛󰇜

has an instance matrix  and an adjacency matrix

, GNNs utilize a message passing formalism to

extract features from these datasets [17].

In this formalism, the instance matrix  for  is

iteratively updated through a series of layers during the forward

pass. The intermediate representation at layer (l) is denoted by

󰇛󰇜for, with the initial representation given by

󰇛󰇜After  layers, we obtain a latent feature matrix

󰇛󰇜󰆓for graph . Given a set of weights  and

parameters  for each layer (l), the message passing update

rule is expressed as:

󰇛󰇜󰇛󰇜󰇛󰇜 

where  and  represent the update and

aggregation functions, respectively. Several message passing

schemes exist that differ in the choice of these functions [18],

[19], [20], [21]. The new node representations, denoted as ,

obtained from this process are then used for downstream tasks

such as node classification, graph classification, or link

prediction.

B. Drug Ranking and Re-ranking

In our study, rather than a traditional information retrieval

scenario involving queries and documents, our objective is to

rank drugs based on their monthly consumption frequencies

and co-usage relationships. Instead of retrieving documents, we

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

assess drugs to determine which should be prioritized for

placement in the ADDS.

We adopt a two-stage ranking approach. First, an initial

ranking is generated using intrinsic drug features. Let 

 denote the monthly consumption frequencies for 

drugs, forming the feature matrix  where .

Additionally, the co-usage matrix  captures pairwise

co-usage relationships; an edge exists between drugs  and  if

, with the edge weight set to 

In the second stage, this initial ranking is refined by

employing a Graph Convolutional Network (GCN) that

aggregates local neighbourhood information via message

passing. Specifically, the first GCN layer computes:

󰇛󰇜ReLU 󰇡GCN



󰇛󰇜󰇢

(1)

and the second layer refines these representations:

󰇛󰇜



󰇡



󰇛󰇜󰇛󰇜󰇢

(2)

Finally, a fully connected layer projects these refined

representations to a scalar ranking score for each drug:





󰇛󰇜

(3)

for . These scores capture both the intrinsic

importance of a drug and the influence of its co-usage

relationships.

The optimized placement matrix is then constructed by sorting

drugs in descending order of  and assigning them to the most

accessible compartments in the ADDS. Specifically, the

leftmost columns minimize horizontal travel distance to the

retrieval station, while the bottom rows leverage gravitational

acceleration for faster vertical retrieval in the Free-Fall Flow

Rack design. This two-stage approach effectively refines the

initial ranking and results in a placement strategy that

minimizes retrieval times.

IV. METHODOLOGY

C. Problem Definition

The Automated Drug Dispensing System (ADDS) is a

cutting-edge solution designed to enhance the efficiency of

pharmaceutical storage and retrieval operations. In our

manipulator-arm-based ADDS, medications are stored in a rack

divided into discrete storage bins. Each bin has uniform

dimensions with a fixed length and height [11]. Drugs are

placed in these bins, and a robotic arm, operating from

designated input/output (I/O) points, retrieves medications by

following an optimal route determined by an intelligent control

algorithm.

The retrieval process begins at an I/O point, from which the

robotic arm travels to the bin containing the required drug and

then returns to the I/O point. For single-command operations,

the travel time  is calculated as the maximum of the

horizontal and vertical travel times between the I/O point and

the target bin. For dual-command operations—where the arm

retrieves two bins sequentially—the travel time is computed as

the sum of the times for each leg of the journey. These

calculations assume a constant speed and ignore acceleration,

deceleration, and loading/unloading durations.

Our goal is to optimize the assignment of drugs to storage

bins such that overall retrieval times are minimized. This

optimization considers two key data sources, monsual

consumption and Co-Use Matrix.

Prescription data, generated based on these two data sources,

is used to evaluate the performance of the placement strategy.

The objective is to assign drugs to the most accessible

compartments specifically, the bins in the leftmost columns and

the bottom rows—so that frequently used and commonly co-

prescribed medications can be retrieved quickly. The problem

is constrained by the uniformity of bin dimensions, the rule that

each bin holds only one drug, and the predefined movement

model of the robotic arm.

A. Data preparation

Our dataset comprises three key components:

 Drug Frequencies: Monthly consumption data

indicating how often each drug is used.

 Co-Use Matrix: A matrix with values between 0 and

1 that quantifies the co-usage relationships between

drugs.

 Prescription Data: Generated based on the drug

frequencies and co-use matrix, each prescription lists

the drugs that are to be retrieved.7

To evaluate the performance of the optimized drug

placement, simulated prescription data is generated to mimic

realistic usage scenarios. A total of 200 prescriptions are

synthesized, with each prescription containing 3 distinct

medications. The generation algorithm uses the normalized

consumption frequencies of the drugs. If  represents the

consumption frequency of drug , then the normalized

frequency is given by:







 

(4)

This determines the probability of selecting drug $i$ as the first

medication:

󰇛󰇜

(5)

For subsequent drugs, the selection probability is adjusted

based on the co-use matrix :

󰇛



󰇜



(6)

with  being the set of drugs already selected. This method

ensures that drugs with strong co-usage relationships are more

likely to appear together in the generated prescriptions.

Performance is evaluated based on key metrics derived from

the optimized placement, which will be detailed in the Results

section. These data sources are preprocessed and normalized to

ensure consistency before further analysis.

A. Graph Construction

In our study, the graph is constructed to represent the

relationships among 150 drugs. Each drug is modeled as a node,

and its feature vector is derived from its monthly consumption

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

frequency. Specifically, let  denote the frequency of drug ;

then the feature vector for drug  is simply .

The co-use matrix, a  matrix, quantifies the co-

usage relationship between drugs. An edge is established

between drug  and drug  if the corresponding entry in the co-

use matrix, is greater than zero. The edge weight is set to 

which reflects the strength of the co-usage relationship. Since

the matrix is sparse, only significant co-usage relationships are

represented as edges.

The resulting graph 󰇛󰇜 is undirected and does

not include self-loops. Here,

  is the set of nodes (drugs), with .

  is the instance matrix, where each row

corresponds to the consumption frequency of a drug.

  is the set of edges defined by the nonzero entries in

the co-use matrix.

  is the weighted adjacency matrix, where

 󰇜󰇛 󰇜and   otherwise. This

preserves the co-usage strength between drugs.

This graph structure captures the underlying relationships

among drugs, which is crucial for the subsequent learning of

effective representations using Graph Neural Networks [17]

[18], [19], [20], [21]. This graph construction method is fully

scalable and can be applied consistently regardless of the

ADDS dimensions.

B. GCN-Based Ranking Model

We employ a Graph Convolutional Network (GCN) [19] to

learn latent representations of drugs and compute their ranking

scores. In our model, the input feature matrix  (derived from

the monthly consumption frequencies) is combined with the

weighted edge structure from the co-use matrix. The GCN

aggregates local neighborhood information through a series of

message passing layers, enabling each drug's representation to

be informed by its own frequency as well as by the co-usage

relationships with its neighbors.

Specifically, the first graph convolutional layer applies a linear

transformation followed by a nonlinear activation (ReLU) to

produce an intermediate representation 󰇛󰇜 This layer

aggregates features from neighboring nodes, weighted by the

co-use values, as in equation (1) A second graph convolutional

layer further refines these representations (equation (2)) Finally,

a fully connected layer projects the refined representations to a

scalar ranking score for each drug (equation (3)). These ranking

scores, which capture both the intrinsic drug frequency and the

influence of co-usage relationships, are then used to prioritize

drugs for placement within the ADDS.

C. Optimized placement matrix

Based on the computed ranking scores, drugs are sorted in

descending order. The optimized placement matrix is

constructed by mapping this sorted order into the physical

layout of the storage rack. In our system, the most accessible

compartments are located in the leftmost columns and bottom

rows. Thus, the highest ranked drugs are assigned to these

positions, thereby minimizing retrieval times.

D. ADDS and Retrieval Time Model

The Automated Drug Dispensing System (ADDS) is a cutting-

edge solution designed to enhance the efficiency of

pharmaceutical storage and retrieval processes. Existing

research on order picking assumes that orders have already

been assigned to the machines. The system comprises a storage

rack divided into discrete bins, each uniquely identified by

coordinates 󰇛󰇜 where  represents the row and  the

column. Each bin has fixed dimensions (see

Table 1), ensuring a standardized storage environment. Drugs are

stored within these bins, and a robotic arm is employed to

execute pick-and-retrieve operations guided by an intelligent

control algorithm that computes the optimal route for retrieval

[11].

The picking process initiates at an input/output (I/O) point,

which in our study is fixed at 󰇛󰇜 In our setup, we work

with a single rack where drugs are stored in discrete bins. From

this I/O point, the robotic arm travels to the bin containing the

required drug and then returns to the I/O point. The single-

command retrieval time for the robotic arm traveling from the

I/O point to a bin is expressed as:

󰇛󰇜󰇧



󰇨

(7)

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

Fig 2 Dual command cycle operation of the robotic arm.

For dual-command operations as in Fig 2, where the arm

retrieves two bins sequentially before returning to the I/O point,

the travel time is calculated as:

󰇛󰇜󰇧



󰇨

󰇧



󰇨

󰇧



󰇨

(8)

Table 1 System parameters.

Parameter

Value/Description

dₙ (Bin Length)

0.168 m

dₘ (Bin Height)

0.275 m

v󰨒󰣜ₘ (Robotic Arm

Speed)

0.1486 m/s

v󰨕 (Horizontal Speed)

Typically, equal to v󰨒󰣜ₘ

v󰣠 (Vertical Speed)

Typically, equal to v󰨒󰣜ₘ

I/O Point Location

(x, y): starting/ending point for

retrieval

V. RESULTS AND DISCUSSION

In this section, we present and analyze the outcomes of our

GCN-based ranking approach for optimizing drug placement in

the ADDS (Fig 3). Our evaluation compares the optimized

placement against 30 random placements using several visual

and statistical measures.

It is important to note that the proposed approach is fully

adaptable to any ADDS, regardless of its specific architecture.

In systems employing dual-command operations, such as the

manipulator-arm-based design discussed here, our method

effectively leverages both consumption and co-usage data to

optimize drug placement. Similarly, in Free-Fall-Flow-Rack

AS/RS automated drug dispensing systems, the prescription

retrieval speed is determined by the maximum travel time

among the drugs in a prescription [1]. In such cases, our dual-

data approach remains equally applicable, as it naturally

identifies the critical drug whose retrieval time governs the

overall performance. This inherent flexibility underscores the

broad applicability and robustness of our method across various

ADDS designs.

Fig 4: Mean retrieval times for the optimized placement versus 30 random

placements.

Fig 4 shows a bar chart of the mean retrieval times, The

optimized placement consistently achieves a lower average

retrieval time than any random configuration. This significant

reduction in mean retrieval time indicates that incorporating

Fig 3: Optimized ADDS Placement Matrix

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

drug consumption frequency and co-usage data into the GCN

framework leads to more efficient retrieval operations.

Fig 5: Drug embeddings visualized via t-SNE, colored by GCN-assigned

ranking scores.

The drug classification ability of the GCN becomes more

apparent through visual t-SNE embeddings shown in Fig 4. A

scatter plot shows each drug point and its ranking score is

represented by color distribution. The upper-left portion

contains drugs that obtain higher ranking scores displayed

through warmer colors, whereas lower scoring drugs position

in the bottom-right section shown with cooler color schemes.

The clinical GCN predicts drug importance through an even

distribution of drug data points across the plot, indicating its

ability to understand drug consumption relationships and co-

usage patterns.

Fig 6: Retrieval time distribution for the optimized placement versus the top

five random placements.

Further, a box-and-whisker plot (Fig 6) compares the

distribution of retrieval times for the optimized placement and

the top five best-performing random placements. The

optimized method not only has a lower median retrieval time

but also exhibits reduced variability, with fewer extreme

outliers. This robustness in performance suggests that our

approach provides both consistent and reliable improvements

in retrieval efficiency.

Overall, these results confirm that our GCN-based ranking

model effectively leverages drug consumption frequency and

co-usage data to generate an optimized placement strategy for

ADDS. The significant reduction in average retrieval time—

improving over the average random placement by

approximately 27% demonstrates the practical benefits of our

method. The clear separation in the embedding space, as shown

by the t-SNE visualization, further validates that the GCN

captures meaningful relationships among drugs. By aligning

drug placement with actual usage patterns, our approach holds

the potential to reduce wait times, minimize errors, and enhance

the overall efficiency of pharmaceutical distribution in

healthcare settings.

VI. CONCLUSION

In this study we presented a GCN-based ranking approach

for optimizing drug placement within an Automated Drug

Dispensing System (ADDS). By integrating monthly

consumption frequencies and co-use relationships, our model

learns a meaningful embedding space and assigns higher scores

to drugs that require faster access. This ranking translates into

an optimized placement matrix that ensure that frequently used

or strongly co-used drugs are located in compartments that

minimize retrieval times.

Comparisons with 30 randomly generated placements

confirm the effectiveness of our strategy. In particular, our

method outperforms the average random placement by

approximately 27%. We choose to highlight this comparison

with the average random approach because it offers a robust

benchmark that reflects typical random performance, rather

than a single best-case scenario. This improvement underscores

the GCN's ability to capture and leverage important drug

interactions for practical gains in retrieval speed and

operational consistency.

overall, the proposed solution demonstrates the potential of

graph-based models to enhance storage and retrieval efficiency

in healthcare environments, Future work could explore more

advanced architectures or incorporate additional drug attributes

to further refine the ranking process and adapt to evolving

clinical demands.

Funding Declaration: This research received no specific grant

from any funding agency in the public, commercial, or not-for-

profit sectors.

Ethics and Consent to Participate: Not applicable.

Competing Interests: The authors declare that they have no

competing interests.

Software and Tools: The machine learning models were

developed using PyTorch (version 2.6.0), an open-source deep

learning framework. The authors also used Python and various

other libraries to process and analyze the data.

REFERENCES

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

[1] D. Metahri and K. Hachemi, “A Performance Comparison of Manual

Dispensing and Automated Drug Delshivery,” IJARPHM, vol. 5, no. 1,

pp. 1–13, Jan. 2020, doi: 10.4018/IJARPHM.2020010101.

[2] S. MacAvaney, N. Tonellotto, and C. Macdonald, “Adaptive Re-

Ranking with a Corpus Graph,” in Proceedings of the 31st ACM

International Conference on Information and Knowledge Management

(CIKM ’22), ACM, 2022. doi: 10.1145/3511808.3557231.

[3] P. Veličković, “Everything is Connected: Graph Neural Networks,”

Current Opinion in Structural Biology, vol. 79, p. 102538, 2023, doi:

10.1016/j.sbi.2023.102538.

[4] L. Pang, J. Xu, Q. Ai, Y. Lan, X. Cheng, and J. Wen, “Setrank:

Learning a permutation-invariant ranking model for information

retrieval,” in Proceedings of the 43rd International ACM SIGIR

Conference on Research and Development in Information Retrieval,

2020, pp. 499–508.

[5] Q. Wu et al., “Dual Graph Attention Networks for Deep Latent

Representation of Multifaceted Social Effects in Recommender

Systems,” in The World Wide Web Conference, ACM, 2019. doi:

10.1145/3308558.3313442.

[6] J. C.-H. Pan, P.-H. Shih, M.-H. Wu, and J.-H. Lin, “A storage

assignment heuristic method based on genetic algorithm for a pick-and-

pass warehousing system,” Computers & Industrial Engineering, vol.

81, pp. 1–13, Mar. 2015, doi: 10.1016/j.cie.2014.12.010.

[7] J. Reyes, E. Solano-Charris, and J. Montoya-Torres, “The storage

location assignment problem: A literature review,” International

Journal of Industrial Engineering Computations, vol. 10, no. 2, pp.

199–224, 2019, Accessed: Sep. 30, 2023. [Online]. Available:

http://growingscience.com/beta/ijiec/2956-the-storage-location-

assignment-problem-a-literature-review.html

[8] W. H. Hausman, L. B. Schwarz, and S. C. Graves, “Optimal Storage

Assignment in Automatic Warehousing Systems,” Management

Science, vol. 22, no. 6, pp. 629–638, 1976, doi: 10.1287/mnsc.22.6.629.

[9] K. J. Roodbergen and I. F. A. Vis, “A survey of literature on automated

storage and retrieval systems,” European Journal of Operational

Research, vol. 194, no. 2, pp. 343–362, Apr. 2009, doi:

10.1016/j.ejor.2008.01.038.

[10] V. R. Muppani and G. K. Adil, “A branch and bound algorithm for

class based storage location assignment,” European Journal of

Operational Research, vol. 189, no. 2, pp. 492–507, 2008, Accessed:

Jan. 09, 2025. [Online]. Available:

https://econpapers.repec.org/article/eeeejores/v_3a189_3ay_3a2008_3ai

_3a2_3ap_3a492-507.htm

[11] M. Yuan, N. Zhao, K. Wu, and Z. Chen, “The storage location

assignment problem of automated drug dispensing machines,”

Computers & Industrial Engineering, vol. 184, p. 109578, Oct. 2023,

doi: 10.1016/j.cie.2023.109578.

[12] E. Atmaca and A. Ozturk, “Defining order picking policy: A storage

assignment model and a simulated annealing solution in AS/RS

systems,” Applied Mathematical Modelling, vol. 37, no. 7, pp. 5069–

5079, Apr. 2013, doi: 10.1016/j.apm.2012.09.057.

[13] A. Chaker and K. Hachemi, “Evaluation des performances et pilotage

d’une armoire automatisée de dispensation de médicaments Mots clés,”

presented at the Congrès Lambda Mu 21 «󰨠Maîtrise des risques et

transformation numérique󰨠: opportunités et menaces󰨠», Oct. 2018.

Accessed: Sep. 17, 2023. [Online]. Available: https://hal.science/hal-

02074918

[14] N. Esmaili, B. A. Norman, and J. Rajgopal, “Shelf-space optimization

models in decentralized automated dispensing cabinets,” Operations

Research for Health Care, vol. 19, pp. 92–106, Dec. 2018, doi:

10.1016/j.orhc.2018.03.005.

[15] K. Hachemi and H. Alla, “Affectation de médicaments dans un système

automatisé dedispensation de médicaments󰨠: approche basée sur la

synthèsede contrôleur par réseau de Petri,” Oct. 2013.

[16] K. Hachemi and S. Amari, “Analytical solving of the storage location

assignment problem in drug dispensing systems based on a Min-Plus

control approach,” Journal of Control Engineering and Applied

Informatics, vol. 26, no. 3, Art. no. 3, Sep. 2024, doi:

10.61416/ceai.v26i3.9015.

[17] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl,

“Neural Message Passing for Quantum Chemistry,” arXiv preprint

arXiv:1704.01212, 2017.

[18] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive Representation

Learning on Large Graphs,” in Proceedings of the 31st International

Conference on Neural Information Processing Systems (NIPS’17),

Curran Associates Inc., 2017, pp. 1025–1035.

[19] T. N. Kipf and M. Welling, “Semi-Supervised Classification with

Graph Convolutional Networks,” arXiv preprint arXiv:1609.02907,

2017.

[20] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y.

Bengio, “Graph Attention Networks,” arXiv preprint

arXiv:1710.10903, 2018.

[21] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How Powerful are Graph

Neural Networks?,” arXiv preprint arXiv:1810.00826, 2019.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Brain Tumor Detection of MRI Images Using

CNN Features Extraction and SVM Classification

Zouhir Iourzikene#1, Fawzi Gougam #2, Djamel Benazzouz#3

# Laboratoire Mécanique des Solides et Systèmes (LMSS), Faculté de

Technologie, Université M’Hamed BOUGARA de Boumerdes, 35000 Boumerdes,

Algeria

1 z.iourzikene@univ-boumerdes.dz

2 f.gougam@univ-boumerdes.dz

3 d.benazzouz@univ-boumerdes.dz

Abstract— Brain tumors are cellular growths in the brain that can

be benign (non-cancerous) or malignant (cancerous). It may

originate in the brain or invade the brain after growth

(metastasizing) to another area. The classification of Magnetic

Resonance Imaging (MRI) brain images has become an important

topic in the medical field research. Many current works tend to

use machine learning methods to create a medical prediction and

diagnosis systems. This paper aims to develop a method to detect

brain tumors from different MRI brain images, ResNet50

convolutional neural networks are used for feature extraction.

Different classification using Support Vector Machine (SVM)

methods are developed to predict brain tumors.

Keywords— Brain tumor; MRI; Feature extraction; Support

vector machines; Deep learning

XVII. INTRODUCTION

Brain tumors represent a heterogeneous group of central

nervous system tumors. The World Health Organization

(WHO) classifies approximately one hundred different types of

brain tumors based on pathological diagnosis [1]. These tumors

can be broadly classified as malignant or benign, where the

WHO presses a classification system, from grade I to grade IV.

Tumors of grades from I through II are considered benign or

low grade, while tumors of grades III through IV are malignant

or high grade [2].

Different diagnostic techniques are used to obtain

information about the tumor. Magnetic resonance imaging and

Computed Tomography (CT) are the best methods currently for

identifying normal and abnormal cells growing in the brain. CT

is used to create images of the brain using X-rays and

computers to diagnose patients in axial fragments [3].

Image processing techniques (contrast, segmentation,

filtering, mathematical morphology ...) allow the extraction of

important information and characteristics (contours, edge

detection, object detection ...). This information can guide and

monitor interventions after the detection and localization of the

disease, to plan and treat the disease efficiently [4].

Early detection of malignant tumors (clusters of cancer cells)

plays an essential role in cancer diagnosis, to see precancerous

lesions at a more curable stage, and facilitate diagnosis before

the disease is at an advanced stage, which allows for lighter and

more effective treatment, and may improve long-term survival.

For this purpose, Machine Learning (ML) techniques have been

developed to create algorithms that can receive input data and

use statistical analysis to predict an output. ML is a class of

algorithms that gives computers the ability to learn without

being explicitly programmed [5]. More recently, the lack of

accuracy in predictive modeling models and the critical nature

of medical data analysis have forced researchers to turn to new

methods of detecting brain tumors with improved accuracy. In

the aim of accuracy, Deep Learning (DL), a subfield of machine

learning, has attracted much attention for its ability to provide

effective and more accurate predictive models [6]. DL

algorithms use many layers, linked together by connectors

(synapses). From there, it processes information through a

propagation model of these cellular activations, activations

above a certain threshold [7]. Deep learning is used in multiple

settings, including image recognition, language processing,

robotics, speech recognition, and bioinformatics.

The classification of images by machine learning, neural

networks, and deep learning, essentially needs a bag of features

of our images, to make it a better classification and more than

the selection of deep features more than we get better accuracy.

In recent years deep learning is used a lot because it gives

accuracy results. In this paper, two methods have been

developed which can help the medical staff such as surgeons

and radiologists or others to diagnose brain cancer from MRI

images. The first method consists of selecting and classifying

images according to healthy or tumor MRI using ResNet50.

Our database contains a set of images where some of them are

without tumor and other with tumor. The developed approach

uses 70% for training and 30% for testing. The obtained results

of identification of images with tumor is up to 89%. Due to this

percentage of the first method we thought that it is necessary to

be more accurate in the identification. Thus, the second

approach is a hybrid method combining ResNet50 and Support

Vector Machine. The reason of this combination is motivated

by the fact to take advantages from the convolution neuron

network which gives a bag of features more important than

other methods, second the SVM gives better results of

classification. Different SVM classification methods are used

to detect and classify our MRI images from feature bags

obtained by ResNet50. We obtain the same accuracy rate of

92% by using the Linear SVM and Quadratic SVM. By the

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Cubic SVM we obtain an accuracy rate of 93%, and finally by

the Medium Gaussian SVM we obtain the best classification by

an accuracy rate of 94% [8, 9]. All these methods are learned

by the bag of features of matrix form of size [546 1000], such

that the rows of this matrix represent the number of images

trained by ResNet50, the columns are the number of features of

MRI images selected by ResNet50, the training and testing of

these classification methods are partitioned 70% for training

and 30% for testing.

Before classifying our MRI images, the selection of features

was performed by the Convolutional Neural Network (CNN)

ResNet50 model to obtain the bag of features in the form of a

matrix. This was used in SVM to improve the performance

classification accuracy. First, we preprocess our MRI images,

adjustment is used to improve the contrast from the histogram

and filled them to improve edge filtering. Then, the Otsu

segmentation is applied to choose an automatic threshold from

the histogram, which is separated into two classes, black and

white for each image. Finally, the mathematical morphology is

the erosion followed by the dilation by structuring element

gamma four. The subtraction of pixels was made to detect the

contours of our MRI images and draw the outer boundaries of

the objects, as well as the boundaries of the holes inside these

objects in the binary image [10, 11].

XVIII. RELATED WORKS

Recent studies have increasingly focused on applying AI

techniques, especially DL and ML methods, to enhance brain

tumor detection and classification from MRI images.

Biswas et al [12]. introduced a hybrid model combining a

deep CNN with a SVM to improve brain tumor classification.

Their approach incorporated comprehensive preprocessing

steps, including image resizing, noise reduction through

anisotropic diffusion filtering, and contrast enhancement via

adaptive histogram equalization, along with data augmentation

to increase variability. The deep CNN automatically extracted

meaningful features, which were then classified using SVM.

Evaluated on the Figshare dataset, their method achieved 96%

accuracy, surpassing several transfer learning models such as

AlexNet, GoogLeNet, and VGG16, while being more

computationally efficient.

Suryawanshi et al [13]. proposed a hybrid approach

combining CNNs with the pre-trained VGG19 model for

feature extraction and an SVM classifier for multiclass brain

tumor classification. Their method, tested on BRATS and

Sartaj datasets, demonstrated high accuracy in distinguishing

between different tumor types, showcasing the effectiveness of

combining deep learning with SVMs for medical image

classification.

Basthikodi et al [14]. developed a multiclass brain tumor

classification method by combining SVM with feature

extraction techniques (HOG, LBP) and dimensionality

reduction using PCA. Using a Kaggle dataset with four tumor

types, their model achieved an accuracy of 96.03%. The

integration of HOG, LBP, and PCA enhanced classification

performance and reduced overfitting, making the approach

more efficient and robust.

Özkaraca et al [15]. introduced a new DL architecture that

combines the strengths of transfer learning models such as

DenseNet, VGG16, and basic CNNs, while overcoming their

limitations in brain tumor classification. Their model achieved

an accuracy of 98.5%, although it required higher

computational resources. They used an 80-20 data split and 10-

fold cross-validation to validate their results.

SAMAR M. ALQHTANI [16] proposed an automated

method for segmenting and classifying brain tumors in MRI

images, which included preprocessing using CLAHE and

diffusion filtering. Tumor segmentation was performed using

Fuzzy C-Means (FCM), followed by classification with SVM.

Tested on the CE-MRI database, the method achieved an

accuracy of 98.2%, sensitivity of 0.977, specificity of 0.979,

and a Dice score of 0.961. It also demonstrated a fast processing

time of 0.42 seconds, outperforming existing techniques in both

accuracy and speed.

Soheila Saeedi [17] study aimed at early brain tumor

detection, using DL and ML methods. A dataset of 3,264 MRI

images was used to classify glioma, meningioma, pituitary

gland tumors, and healthy brains. The study developed a 2D

CNN and a convolutional auto-encoder network, achieving an

accuracy of 96.47% for the CNN and 95.63% for the auto-

encoder. Six machine learning methods were also tested, with

K-Nearest Neighbors (KNN) achieving the highest accuracy of

86%. The results showed that the 2D CNN outperformed the

other methods, with an area under the ROC curve of 0.99 or 1,

indicating its efficiency and reliability for clinical brain tumor

detection.

This body of work highlights the promising potential of AI-

driven approaches in the early detection and classification of

brain tumors, showcasing a variety of models and techniques

with impressive accuracy and efficiency.

XIX. SVM AND CNN OVERVIEW

E. SVM

SVM is a supervised learning algorithm used in many

classification and also in regression problem [18], such as

medical signal processing applications, natural language

processing, speech recognition, and image recognition [19].

SVMs can be used to solve classification problems. The best

case is the hyperplane representing the largest edge between

two classes, as indicated by the plus and minus signs in Fig. 1.

The margin is the maximum width of the space that is parallel

to the hyperplane and contains no data points. The algorithm

can only find such hyperplanes for linearly separable problems.

For most practical problems, the algorithm maximizes the soft

margin, allowing a small amount of misclassification [20].

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Fig. 3 Defining the "margin" between classes: the criterion that SVM seeks to

optimize

F. CNN

Deep learning models tend to work well with large amounts

of data, whereas more traditional machine learning models stop

improving after a saturation point.

CNN is a network architecture for Deep Learning that learns

directly from data, without manually extracting features (Fig.

2) [21]. Useful for finding patterns in images to recognize

objects, faces, and scenes. They can also be very effective for

classifying data other than images [22].

Fig. 2. Convolutional neural network that automatically learns features and

classifies objects

XX. METHODOLOGY

The procedure diagram for the classification of MRI images

of the brain is presented in Fig. 3. The brain tumor detection

algorithm is implemented over three steps. Step 1: image

processing (pre-processing, segmentation, contour detection of

objects). Step 2: feature extraction using CNN (Resnet50). Step

3: detection based on the SVM classifier.

Fig. 3. MRI image feature selection diagram by Resnet 50 and classification

based on the algorithm of different SVM methods

A. Image processing

Database

Two sets of brain MRI images are used, 390 without tumor

and 390 with tumor in JPG image format, are extracted from

the Kaggle database [23]. Fig. 4.a shows healthy brain images

and Fig. 4.b brain images with tumor.

Fig. 4. Sample brain MRI images

Pre-processing

In this step, the technique is used to improve the image

quality and extract other useful information such as edge

detection. These are mathematical morphology

operations and pixel subtraction. First, the MRI image is

converted to grayscale, and padded to 3*3 size to ensure the

best filtering. Followed by a mathematical morphology based

on dilation and erosion with a gamma four structuring element

(Γ4). Then pixel subtraction is applied for edging detection.

Mathematical morphology

Morphology is a large set of image-processing operations

that are used to separate boundary objects and skeletons in an

image [24]. Then, we can detect image contours with erosion

and dilation [25]. In this paper, the skeletal contours of the skull

are extracted by applying dilation and erosion by the same

gamma four structuring element. The subtraction of pixels is

applied as a continuation, the dilated image is subtracted from

the eroded image, which allows us to detect the contours as

shown in Fig. 5.b.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Fig.5. Contour detection result by morphology

Segmentation

Segmentation is a technique to divide an image into several

parts or areas based on similarities in color or shape [26]. The

Otsu method is one of the segmentation methods used for

automatic thresholding from the shape of the image histogram

as shown in Fig. 6. This method needs advanced calculation of

the image histogram. Then, the algorithm assumes such image

to be digitized and contains only two classes. The iterative

algorithm calculates the optimal threshold denoted T which

separate these two classes, such that the intra-class variance is

minimized. In our approach we use the Otsu thresholding

method to the enhance brain image (after edge detection). The

obtained results are presented in Fig. 7. Notice that the "gray

thrush" function in Matlab, is used to obtain the global

threshold ‘T’ which minimize the intra-class variance of black

and white pixels [27].

Fig. 6. Bimodal histogram with selected threshold "T"

Fig. 7. Obtained results of the Otsu thresholding

Object Contour Detection

Boundary tracing, of a binary image, can be considered in a

segmentation technique that identifies the boundary pixels of

the digital region. Boundary tracing is an important starting step

in the analysis of this region [28]. After this step, we use the

function "bwboundaries" with the option "noholes", to trace the

outer boundaries of the objects, as well as the boundaries of the

holes inside these objects, in the binary image [29]. Then our

label images were converted to RGB color to visualize the

labeled regions. The "label2rgb" function was used to

determine the color to be assigned to each object according to

the number of objects. The obtained results are presented in Fig.

Fig. 8. Tracing the outer boundaries of MRI image objects

B. Feature Extraction by ResNet50

Deep learning is based on the principle of artificial neural

networks (ANN) and uses many layers for feature extraction

and conversion. In this step, ResNet50 is used for feature

extraction. It is a convolutional neural network trained on more

than one million images. It has a total of 177 layers which

correspond to a residual network of 50 layers. It can classify

images into 1000 object categories [30, 31].

Prepare Training and Test Image Sets

The training of our Residual Network is portioned in the

following data, 70% of training containing 546 of healthy and

tumor brain images, 30% of testing containing 234 of healthy

and tumor brain images. Fig. 9 shows the division of our

database which is taken randomly.

Fig .9. Data base division

Extract Training Features Using ResNet50

ResNet50 is a convolutional neural network with 50 layers

of depth [32]. The architecture of the ResNet50 is given in Fig.

10 and contains the following elements:

Convolution with the Kernel size of 7*7 and 64

different Kernels, all of them are with step size 2,

giving us 1 layer.

Then we obtain the maximum pooling with stride size

In the next convolution, we get Kernel size of 1*1,64,

then Kernel size of 3*3,64, and finally Kernel size of

1*1,256. These three layers are repeated 3 times

totalizing 9 layers.

Similarly, we do for Kernel size of 1*1,128, then for

Kernel size of 3*3,128 and finally for Kernel size of

1*1,512. These three layers are repeated 4 times

totalizing 12 layers.

In the manner we do for Kernel size of 1*1,256, then

for Kernel size of 3*3,256 and finally for Kernel size

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

of 1*1,1024. These three layers are 6 times totalizing

18 layers.

We continue for Kernel size of 1*1,512, then for

Kernel size of 3*3,512 and finally for Kernel size of

1*1,2048. These three layers are repeated 3 times

totalizing 9 layers.

Finally, we estimate the average pool to end up with a

fully connected layer containing 1000 nodes. At the

end the Softmax function gives us 1 layer. This

architecture is shown in Fig. 10

The last obtained layer is named fully connected (fc1000),

used to select features of our images. This will allow us to

obtain a matrix of size [1000x546], where lines represent the

number of features extracted for each image and columns

represent the number of image drive [33,34].

Fig. 10. Architecture of ResNet50 model

C. Classification ResNet50

The ResNet50 architecture was tested with Test data

containing 234 images. Where 117 images without tumors, 109

of them are correctly classified and 8 are misclassified. For 117

images with tumors, 100 are correctly classified and 8 are

misclassified. The accuracy (Eq.1) of the classification is 89%.

These obtained results are shown in Table 1.

 

 󰇛󰇜

TP: True positive (The tumor is present and detected).

TN: True negative (Non-existent and undetected tumor).

FP: False positive (The tumor does not exist and is

detected).

FN: False negative (The tumor exists and is not detected).

TABLE VII

CONFUSION MATRIX OF RESNET50

No tumor

Tumor

No tumor

109 (TN)

8 ( FP)

Tumor

17(FN)

100 (TP)

D. Classification using SVM-ResNet50 and Results

In this step several SVM learning algorithms have been used.

What we aim is to learn the model from the input data set for

classifying the brain tumor images. Four Kernel functions are

used which are the Linear, Gaussian, Quadratic, and Cubic. The

featured bag obtained by the layer "fc1000" Resnet50 in the

form of a matrix size [546x1000] (rows represent the number

of images and columns represent the number of features), is

used for our learning algorithm, to train and test our MRI

images, to obtain a better classification and prediction of the

tumor. The feature bag separation is given as follows, 70% for

training (382 images) and 30% for testing (164 images), labels

are used for our data such as "0" image without tumor and "1"

with tumor. Fig. 11 shows the obtained SVM classification.

Fig. 11. Classification support vector machine

Linear SVM

This is the simplest one, where the training samples are

linearly separable [35]. The linear function is given by Eq. 2

󰇛󰇜 󰇛󰇜

For each training sample , the function gives 󰇛󰇜

, for  and 󰇛󰇜 for  [36].

The base data of two different classes are separated by the

hyperplane 󰇛󰇜 where w is the weight

vector, b is the bias,  is the data. The characteristics of this

model are shown in Table 2. The obtained results based on this

model are given in Table 3.

TABLE II

LINEAR SVM CLASSIFIER FEATURES

Classifier characteristics

Preset

Linear SVM

Kernel function

Linear

Kernel scale

Automatic

Box constraint level

Multiclass method

One-vs-One

Standardize data

True

TABLE III

OBTAINED RESULTS CLASSIFYING LINEAR SVM

Accuracy

92%

Prediction speed

1200 obs/sec

Training time

6.2226 sec

From 81 images without tumors, we have 76 of them are

correctly classified and 5 are misclassified. From 82 images

with tumors, we have 74 of them are correctly classified and 8

are misclassified. The confusion matrix of the linear SVM

model is presented in Fig. 12.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Fig. 12. Linear SVM model of the confusion Matrix

Quadratic SVM

In this model we use the quadratic decision surface to

separate the measurements of two or more classes of objects

[37]. The characteristics of this model are presented in Table 4.

The quadratic function is given by Eq. 3.

󰇛󰇜 󰇛󰇜

The obtained results based on this method are presented in

Table 5. TABLE IV

QUADRATIC SVM CLASSIFIER FEATURES

Classifier characteristics

Preset

Quadratic SVM

Kernel function

Quadratic

Kernel scale

Automatic

Box constraint level

Multiclass method

One-vs-One

Standardize data

True

TABLE V

OBTAINED RESULTS CLASSIFYING QUADRATIC SVM

Accuracy

92%

Prediction speed

1200 obs/sec

Training time

6.2532 sec

From 81 images without tumors, we have 75 of them are

correctly classified and 6 are misclassified. From 82 images

with tumors, 75 of them are correctly classified and 7 are

misclassified. The confusion matrix of the quadratic SVM

model is shown in Fig. 13.

Fig. 13. Quadratic SVM model of the confusion Matrix

Cubic SVM

Polynomial Kernel is commonly used with support vector

machines (SVMs), which represents the similarity of vectors

(training samples) in a feature space over polynomials of the

original variables, allowing for nonlinear model learning [38].

The characteristics of this model are shown in Table 6. The

cubic polynomial Kernel function is given by Eq. 4.

󰇛󰇜󰇛󰇜󰇛󰇜

The obtained results using this method are presented in Table

7. TABLE VI

CUBIC SVM CLASSIFIER FEATURES

Classifier characteristics

Preset

cubic SVM

Kernel function

cubic

Kernel scale

Automatic

Box constraint level

Multiclass method

One-vs-One

Standardize data

True

TABLE VII

OBTAINED RESULTS CLASSIFYING CUBIC SVM

Accuracy

93%

Prediction speed

1300 obs/sec

Training time

7.1951 sec

From 81 images without tumors, we have 76 of them are

correctly classified and 5 are misclassified. From 82 images

with tumors, we have 76 of them are correctly classified and 6

are misclassified. The confusion matrix of the cubic SVM

model is shown in Fig. 14.

Fig.14. Cubic SVM model of the confusion Matrix

Medium Gaussian SVM

Medium Gaussian is a popular Kernel function used in many

machine learning algorithms, especially in support vector

machines (SVMs) that minimizes both estimation and

approximation errors of the classifier [39, 40]. The

characteristics of this model are shown in Table 8. The

Gaussian Kernel function is given by Eq. 5, where σ is the

standard deviation.

󰇛󰇜

󰇛󰇜

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

The obtained results using this method are presented in Table

9. TABLE VIII

MEDIUM GAUSSIAN SVM CLASSIFIER FEATURES

Classifier characteristics

Preset

Medium Gaussian SVM

Kernel function

Gaussian

Kernel scale

Box constraint level

Multiclass method

One-vs-One

Standardize data

True

TABLE IX

OBTAINED RESULTS CLASSIFYING MEDIUM GAUSSIAN SVM

Accuracy

94%

Prediction speed

1500 obs/sec

Training time

7.7308 sec

From 81 images without tumors, we have 77 of them are

correctly classified and 4 are misclassified. From 82 images

with tumors, we have 76 of them are correctly classified and 6

are misclassified. The confusion matrix of the cubic SVM

model is shown in Fig. 15.

Fig. 15 Medium Gaussian SVM model of the confusion Matrix

XXI. COMPARISON AND DISCUSSION

The different SVM methods used in this contribution are

summarised in Table 10, where we present the obtained

accuracy of each presented method and also with Harun Bingol

work [41].

TABLE X

COMPARISON OBTAINED ACCURACY OF EACH METHODS

Methods

Accuracy

Present methods

Resnet50

89%

Linear SVM

92%

Quadratic SVM

92%

Cubic SVM

93%

Medium Gaussian

SVM

94%

Harun BINGOL

and al. (2021) [41]

Deep Learning

classification

(Resnet50)

85,71%

In our study, we employed a hybrid approach that combines

ResNet50 for feature extraction and various SVM methods for

the classification of MRI of brain tumors. We achieved

impressive results with accuracy rates ranging from 89% to

94%, depending on the SVM method used.

ResNet50 alone achieved an accuracy of 89% for tumor

image identification, linear and quadratic SVMs reached an

accuracy of 92%, cubic SVM achieved an accuracy of 93%,

medium Gaussian SVM obtained the highest accuracy of 94%.

In comparison, the study by Harun Bingol [41], which also

employs the ResNet50 architecture, reports an accuracy of

85.71% for brain tumor detection from MRI images. Bingol

uses AlexNet, GoogLeNet, and ResNet50 architectures, with

ResNet50 achieving the best accuracy among the three models.

The performance gap between our results and Bingol’s can

be attributed to several factors. Firstly, we used a similar dataset

sourced from the Kaggle database, ensuring a relevant

comparison. However, it is important to note that the

preprocessing and segmentation methods used in our studies

differ. In our approach, we applied image preprocessing

techniques such as contrast adjustment, Otsu thresholding, and

morphological operations including erosion followed by

dilation to enhance contour detection. These processing steps

may explain the superior performance of our model compared

to Bingol’s study.

Moreover, our hybrid approach combining ResNet50 for

feature extraction with SVMs for classification appears to offer

additional advantages over the standalone use of ResNet50.

The SVM models, particularly the medium Gaussian SVM,

demonstrated outstanding performance, surpassing Bingol’s

results in this classification task.

It is also worth noting that using a larger dataset or other

regularization methods could further enhance our model’s

performance. In Bingol’s work, although ResNet50 produced

relatively good results, a more advanced combination with

techniques such as SVMs could potentially push the accuracy

beyond 85.71%.

XXII. CONCLUSION

In this contribution, we have presented a new approach to

improve the accuracy to diagnose brain cancer. This approach

to detect and identify the brain tumors is based on Otsu

segmentation and classification using ResNet50. The obtained

accuracy rate is 89%. The hybrid classification of different

SVM and ResNet50, linear, quadratic, cubic, and medium

Gaussian, their accuracy rate are: 92%, 92%, 93%, and 94%

respectively. The Resnet50 architecture was applied for deep

extraction of features from our MRI images. This is used for

features in the SVM classification. For the Medium Gaussian

SVM, we obtain the best accuracy rate (94%). In this part of

results, we have observed that, the ResNet50 gives an important

bag of features for the classification of our MRI images. On the

other hand, the hybrid system between SVM and ResNet50

gives better accuracy than ResNet50 alone. But the Medium

Gaussian SVM and the best hybridization with ResNet50 gave

the highest accuracy rate of 94%.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

REFERENCES

[65] M. S. Lesniak and H. Brem, Targeted therapy for brain tumours, Nat.

Rev. Drug Discov., 2004.

[66] J. S. Barnholtz-Sloan, Q. T. Ostrom, and D. Cote, Epidemiology of brain

tumors, Neurol. Clin., 2018.

[67] S. Alsubai, H. U. Khan, A. Alqahtani, M. Sha, A. Sidra, and U. G.

Mohammad, Ensemble deep learning for brain tumor detection, Front.

Comput. Neurosci., 2022.

[68] N. Gordillo, E. Montseny, and P. Sobrevilla, State of the art survey on

MRI brain tumor segmentation, Magn. Reson. Imaging, 2013.

[69] M. C. Fernandez and R. Guillevin, L’intelligence artificielle au service

de l’imagerie et de la santé des femmes, Imager. Femme, 2019.

[70] N. Noreen, S. Palaniappan, A. Qayyum, I. Ahmad, M. Imran, and M.

Shoaib, A deep learning model based on concatenation approach for the

diagnosis of brain tumor, IEEE Access, 2020.

[71] H. H. Sultan, N. M. Salem, and W. Al-Atabany, Multi-classification of

brain tumor images using deep neural network, IEEE Access, 2019.

[72] S. Chinnam, V. P. K. Sistla, and V. K. K. Kolli, SVM-PUK Kernel

Based MRI-brain Tumor Identification Using Texture and Gabor

Wavelets, Trait. Signal, 2019.

[73] AD. T. Blumenthal, M. Artzi, G. Liberman, F. Bokstein, O. Aizenstein,

and D. B. Bashat, Classification of high-grade glioma into tumor and

nontumor components using support vector machine, Am. J.

Neuroradiol., 2017.

[74] A. A. Abbood, Q. M. Shallal, and M. A. Fadhel, Automated brain tumor

classification using various deep learning models: a comparative study,

Indones. J. Electr. Eng. Comput. Sci., 2021.

[75] V. Anitha and S. Murugavalli, Brain tumour classification using two‐tier

classifier with adaptive segmentation technique, IET Comput. Vis.,

2016.

[76] A. Biswas and M. S. Islam, “A Hybrid Deep CNN-SVM Approach for

Brain Tumor Classification,” J. Inf. Syst. Eng. Bus. Intell., vol. 9, no. 1,

2023.

[77] S. Suryawanshi and S. B. Patil, “Efficient Brain Tumor Classification

with a Hybrid CNN-SVM Approach in MRI,” J. Adv. Inf. Technol., vol.

15, no. 3, pp. 340–354, 2024.

[78] M. Basthikodi, M. Chaithrashree, B. M. Ahamed Shafeeq, and A. P.

Gurpur, “Enhancing multiclass brain tumor diagnosis using SVM and

innovative feature extraction techniques,” Sci. Rep., vol. 14, no. 1, p.

26023, 2024.

[79] O. Özkaraca et al., “Multiple Brain Tumor Classification with Dense

CNN Architecture Using Brain MRI Images,” Life, vol. 13, no. 2, 2023

[80] S. M. Alqhtani et al., “Improved Brain Tumor Segmentation and

Classification in Brain MRI With FCM-SVM: A Diagnostic Approach,”

IEEE Access, vol. 12, no. January, pp. 61312–61335, 2024

[81] S. Saeedi, S. Rezayi, H. Keshavarz, and S. R. Niakan Kalhori, “MRI-

based brain tumor detection using convolutional deep learning methods

and chosen machine learning techniques,” BMC Med. Inform. Decis.

Mak., vol. 23, no. 1, pp. 1–17, 2023.

[82] R. E. Fan, P. H. Chen, C. J. Lin, and T. Joachims, Working set selection

using second order information for training support vector machines, J.

Mach. Learn. Res., 2005.

[83] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector

Machines and Other Kernel-Based Learning Methods, Cambridge

University Press, 2000.

[84] D. Ruppert, The Elements of Statistical Learning: Data Mining,

Inference, and Prediction, J. Am. Stat. Assoc., 2004.

[85] S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-

time object detection with region proposal networks, Adv. Neural Inf.

Process. Syst., 2015.

[86] C. L. Zitnick and P. Dollár, Edge boxes: Locating object proposals from

edges, Springer Int. Publ., 2014.

[87] S. Bhuvaji, A. Kadam, P. Bhumkar, and S. Dedge, Brain Tumor

Classification (MRI), Kaggle, 2020.

[88] P. Maragos, Differential morphology and image processing, IEEE

Trans. Image Process., 1996.

[89] H. J. A. M. Hexmans and C. Ronse, The algebraic basis of mathematical

morphology I. Dilations and erosions, Comput. Vis. Graph. Image

Process., 1990.

[90] M. Huang, W. Yang, Y. Wu, J. Jiang, W. Chen, and Q. Feng, Brain

tumor segmentation based on local independent projection-based

classification, IEEE Trans. Biomed. Eng., 2014.

[91] C. Sha, J. Hou, and H. Cui, A robust 2D Otsu’s thresholding method in

image segmentation, J. Vis. Commun. Image Represent., 2016.

[92] V. Kovalevsky, Image Processing with Cellular Topology, Springer

Singapore Pte Ltd, 2021.

[93] N. L. Narappanawar, B. M. Rao, T. Srikanth, and M. Joshi, Vector

algebra based tracing of external and internal boundary of an object in

binary images, J. Adv. Eng. Sci., 2010.

[94] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification

with deep convolutional neural networks, Commun. ACM, 2017.

[95] K. Simonyan and A. Zisserman, Very deep convolutional networks for

large-scale image recognition, arXiv preprint, 2014.

[96] M. K. Panda, A. Sharma, V. Bajpai, B. N. Subudhi, V. Thangaraj, and

V. Jakhetiya, Encoder and decoder network with ResNet-50 and global

average feature pooling for local change detection, Comput. Vis. Image

Underst., 2022.

[97] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image

recognition, Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.

[98] L. Ali, F. Alnajjar, H. A. Jassmi, M. Gocho, W. Khan, and M. A.

Serhani, Performance evaluation of deep CNN-based crack detection

and localization techniques for concrete structures, Sensors, 2021.

[99] S. Roy, S. Nag, I. K. Maitra, and S. K. Bandyopadhyay, A review on

automated brain tumor detection and segmentation from MRI of brain,

arXiv preprint, 2013.

[100] A. Singh, Detection of brain tumor in MRI images using combination of

fuzzy c-means and SVM, Proc. 2nd Int. Conf. Signal Process. Integr.

Netw., IEEE, 2015.

[101] S. Krishnakumar and K. Manivannan, Effective segmentation and

classification of brain tumor using rough K means algorithm and multi

kernel SVM in MR images, J. Ambient Intell. Humaniz. Comput., 2021.

[102] J. Amin, M. Sharif, M. Yasmin, and S. L. Fernandes, A distinctive

approach in brain tumor detection and classification using MRI, Pattern

Recognit. Lett., 2020.

[103] S. Ruan, S. Lebonvallet, A. Merabet, and J. M. Constans, Tumor

segmentation from multispectral MRI images by using support vector

machine classification, Proc. 4th IEEE Int. Symp. Biomed. Imaging,

2007.

[104] N. B. Bahadure, A. K. Ray, and H. P. Thethi, Image analysis for MRI

based brain tumor detection and feature extraction using biologically

inspired BWT and SVM, Int. J. Biomed. Imaging, 2017.

[105] H. Bingol and B. Alatas, Classification of brain tumor images using deep

learning methods, Turk. J. Sci. Technol., 2021.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Feature Extraction and Machine Learning for

Classification Date Fruit

Ikram Kourtiche1, Mostefa Bendjima2, Mohammed El Amin Kourtiche3

1 2 3 Tahri Mohamed University, Mathematics and Computer Science Department Laboratory of TIT Bechar,

Algeria

1kourtiche.ikram@univ-bechar.dz,

2bendjima.mostefa@univ-bechar.dz,

3kourtiche.amin@univ-bechar.dz

Abstract— Dates are important in many parts of the world,

particularly in North Africa and the Middle East. As a highly

nutritious fruit with strong demand in both local and international

markets, the classification and quality control of dates play a

crucial role in enhancing their commercial value. This work

focuses on improving date fruit classification by applying data

augmentation techniques to enrich the original dataset, and then

we employed three pre-trained CNN models, ResNet50,

EfficientNetB0, and DenseNet201, for feature extraction. The

extracted features were then classified using traditional machine

learning algorithms: Support Vector Machine (SVM), K-Nearest

Neighbors (KNN), Logistic Regression (LR), and Random Forest

(RF). The best performance was achieved using ResNet50 as a

feature extractor with logistic regression for classification,

reaching an accuracy of 97.42%.

Keywords—Date fruit, classification, feature extraction, pre-

trained CNN, machine learning.

INTRODUCTION

In recent years, the rapid progress of artificial intelligence

(AI) has brought about significant transformations across a wide

range of sectors, including agriculture[1].AI has become an

indispensable tool for addressing complex agricultural

challenges by offering innovative solutions that enhance both

efficiency and sustainability on a global scale [2]. Among these

advancements, deep learning has been instrumental in

revolutionizing various agricultural practices, including fruit

classification [3].One fruit that has garnered increasing attention

in this context is the date, known for its high nutritional value,

rich in carbohydrates, minerals, and vitamins, and recognized

for its potential health benefits, such as reducing the risk of

cancer and cardiovascular diseases. Globally, date production is

substantial, with an estimated annual output of approximately

8.46 million tons [4].

In recent years many studies have been published on the

classification of date fruits:A date fruit classification system was

developed in [5] to identify six date types. Features were

recognized by CNN models. Their dataset has 2246 images.

Comparing the system to MobileNetV1, Inception, and Resnet,

MobileNetV1 had the highest accuracy (82.67%). In [6] transfer

learning was employed to classify images using the pre-trained

models MobileNetV2, VGG 19 and ResNet50. The VGG19

model has achieved the best classification accuracy (95%) and

highest overall accuracy compared to other models.

Altaheri et al. [7] introduced a machine vision framework for

classifying date fruits according to their type, maturity, and harvest

readiness in a natural orchard setting. This framework leverages

deep convolutional neural networks (CNNs) and transfer learning

to achieve high classification accuracy, utilizing a dataset of 8000

images. Notably, the framework achieved a type classification

accuracy of 99.01%.

In [4], researchers evaluated various algorithms, including

Decision Tree, K-Nearest Neighbors (KNN), and Support Vector

Machines (SVM), for classifying seven date varieties. the neural

network model yielded the highest accuracy at 93.85%.

Alsirhani et al. [8] presented a deep transfer learning approach

for the classification of 27 distinct date varieties using a dataset

of 3228 images. By fine-tuning a DenseNet201 model, the

researchers attained a test accuracy of 95.21%.

A study conducted by [9] investigated a comprehensive

dataset comprising 8,000 images of five distinct date fruit

varieties. The performance of pre-trained deep learning

models:GoogleNet, ResNet-50, DenseNet, and AlexNet, was

evaluated on this dataset.The results indicate that ResNet-50

outperformed the other models, achieving an accuracy rate of

97.37%.

In our study, we propose a method for classifying date fruits

using feature extraction from three pre-trained convolutional

neural network models: ResNet50, DenseNet201, and

EfficientNetB0. The features extracted from each model are

subsequently classified using various machine learning

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

algorithms, including Support Vector Machine (SVM), K-

Nearest Neighbors (KNN), Logistic Regression (LR), and

Random Forest (RF).

The remainder of this paper is structured as follows: After

the introduction, we describe materials and methods used in this

study. The next section presents the experimental results,

followed by a discussion of the findings. Finally, the conclusion

is drawn in the last section.

II.

MATERIAL AND METHODS

PRE-TRAINED MODELS

ResNet-50 a residual neural network, has 50 layers and

constructs a network by sequentially stacking residual blocks.

The architecture has 48 convolutional layers, one MaxPooling

layer, and one average pooling layer. ResNet-50 is a popular

picture categorization system[10].

EfficientNet-B0 is a convolutional neural network

optimized for high performance with fewer parameters. It uses

depthwise separable convolutions and squeeze-and-excitation

(SE) modules, gradually decreasing spatial resolution and

increasing channels. The architecture balances depth, width,

and input resolution, achieving high accuracy and low

computational cost[10].

DenseNet201 is a convolutional neural network with

direct feed-forward connections, which reduces gradient

degradation and overfitting in deep learning applications. Its

architecture enhances inputs at each layer, diminishes

parameters, and elevates performance. DenseNet201, a

version of 201 layers, employs this compact architecture to

develop models that are easy to train and exceptionally

efficient [11].

Classification methods

Support Vector Machines (SVM): is a highly esteemed

traditional approach in machine learning, commonly used for

both classification and regression tasks. It works by

transforming data characteristics into higher dimensions to

establish a boundary or hyperplane for classification. The SVM

identifies a linear discriminant function that maximizes the

margin between different classes of data. Support vectors,

which are data points closest to the classification boundary,

play a crucial role in defining this boundary. SVM is well- known

for its accuracy and versatility, making it a popular choice in

applications [12].

Random Forest (RF): The decision tree method is

extensively employed for categorizing extensive datasets and

identifying data that share common traits. It involves dividing

the data into smaller subsets iteratively, culminating in the

construction of a structured tree that includes both decision

nodes and leaf nodes, yielding the final classification

outcomes [12].

k nearest neighbors (KNN): The operational principle of

the KNN classifier is direct and intuitive: it assigns categories to

samples based on the classes of their nearest neighbors. This

classification method, known as memory-based classification,

re-quires storing training samples in memory

for reference during analysis [13] In this paper, the

parameter k is set to 9.

Logistic Regression (LR): is a commonly used statistical

method for modeling the probability of a binary outcome

based on one or more explanatory variables. Its primary goal

is to estimate the coefficients of a linear model that relates the

logarithm of the odds (log-odds) to the independent variables

[14].

Dataset

This dataset, referred to as the Saudi Arabian Dataset,

consists of 1658 images, each depicting one of nine date fruit

varieties native to Saudi Arabia: Ajwa, Galaxy, Medjool, Nabtat

Ali, Sokari, Rutab, Shaishe, Sugaey, and Meneifi, as shown in

Figure 2. A controlled environment was constructed to take

pictures of the 9 different types. The imaging setup consists of a

mounted DSLR camera (Canon EOS 550D) with the flash

enabled, a ring light with a 48-centimeter diameter, and 240 LED

bulbs set to 100% brightness. A ring was used to negate any

shadows by surrounding the date with light on all sides; the flash

on the camera provides a strong, sudden light to the center to

emphasize the fleshiness or flabbiness of the date[15].

Fig. 1 Samples of date fruit dataset images

Data augmentation

A significant aspect of this study is the use of data

augmentation to enhance model performance. Data

augmentation is a crucial strategy in machine learning that

involves artificially increasing the size and diversity of a dataset

by applying various transformations to the existing data. In this

context, several augmentation techniques were employed,

including:

Rescaling: The process of adjusting the size of

images.

Random zoom: Modifies the image scale to simulate

varying distances.

Flipping: Involves mirroring the images to create

variations.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Width and height shifts: Slightly reposition the

images to account for different orientations.

Random rotations: Rotate the images at various

angles.

These techniques are designed to improve the model's ability

to generalize by exposing it to a wider variety of data

representations. By augmenting the datasets in this manner, the

study aims not only to enhance classification accuracy but also

to ensure that the models can effectively recognize and

differentiate between various date varieties. After applying data

augmentation, the dataset comprises 3460 images of Saudi

Arabian date fruit.

After augmentation, the dataset was divided into two

subsets: 80% for training and 20% for testing.

III.

EXPERIMENTAL RESULTS AND DISCUSSIONS

Evaluation metrics used

In our study, we used several evaluations. These

measures aim to evaluate the performance rate of our model.

Precision, recall, f1-score, , and accuracy were determined by

quantifying the predicted classes based on the following

quantities: the number of false negatives (FN), false positives

(FP), true negatives (TN), and true positives (TP). The

mathematical representation's definition is outlined below:

Accuracy = (1)

Precision = (2)

Recall = (3)

f1-score = (4)

Results

Our experiments are used on a computer with an Intel(R)

Core(TM) i5-6300U CPU, with 8 GB of RAM, utilizing Kaggle,

a cloud-based platform that enables users to write and execute

Python code directly in their web browsers. Kaggle is

particularly advantageous for machine learning, data analysis,

and deep learning tasks, as it offers GPU support for accelerated

computation. This environment facilitates efficient

experimentation and model training by providing access to

powerful resources and tools tailored for data science

applications.

The results of our study are summarized in Tables I,II, and

III. The results in Table I indicate that the LR classifier achieved

the highest performance among the compared algorithms, with a

testing accuracy of 97.42%, a recall of 97.42%, an F1-score of

97.42%, and a precision of 97.44%. These results highlight the

effectiveness of feature extraction using the Resnet50 Table II

presents the results after feature extraction using

EfficentNetB0, where the best performance was achieved with

the Logistic Regression (LR) classifier, reaching an accuracy

of 97.27%, and the table III shows the results after feature

extraction using DenseNet201, where the LR classifier obtained

the highest accuracy of 96.84%.

Logistic Regression outperforms all other models across the

three feature extraction methods (ResNet-50, EfficientNetB0,

DenseNet201). Its strong performance is likely due to the

extracted features being well-structured and linearly separable.

LR remains a simple, efficient choice for this kind of

classification task.

TABLE I. PERFORMANCE METRICS FOR RENET-50 EXTRACTED FEATURES

Models

Accuracy

Recall

Precision

F1-score

SVM

93.26%

93.33%

93.25%

97.42%

97.44%

97.42%

KNN

85.51%

86.28%

85.61%

89.81%

89.87%

89.79%

TABLE II. PERFORMANCE METRICS FOR EFFICENTNETB0 EXTRACTED

FEATURES

Models

Accuracy

Recall

Precision

F1-score

SVM

94.26%

94.37%

94.26%

97.27%

97.32%

97.28%

KNN

88.24%

88.63%

88.29%

90.67%

90.70%

90.67%

90.63%

TABLE III. PERFORMANCE METRICS FOR DENSENET201 EXTRACTED FEATURES

Models

Accuracy

Recall

Precision

F1-score

SVM

94.12%

94.13%

94.11%

96.84%

96.91%

96.85%

KNN

88.52%

89.11%

88.52%

93.11%

93.24%

93.52%

Figures 2, 3, and 4 show the confusion matrices for the

best- performing methods using features extracted with

Resnet50, EfficentNetB0 , and densenet201, respectively.

Fig. 2 Confusion matrix for LR using Resnet-50

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Fig. 3 Confusion matrix for LR using EfficentNetB0

Fig. 4 Confusion matrix for LR using DenseNet201

Our proposed study is evaluated against several recent state-

of-the-art techniques, as presented in Table IV. demonstrates

superior performance using a dataset with 9 different types of

date fruit.

TABLE IV. COMPARISON WITH STATE OF THE ART METHODS

IV.

CONCLUSION

The objective of this study was to develop a classification

system for date fruits by utilizing feature extraction from three

pre-trained convolutional neural network models: ResNet50,

EfficientNetB0, and DenseNet201. The extracted features were

subsequently classified using traditional machine learning

algorithms, including Support Vector Machine (SVM), Logistic

Regression (LR), Random Forest (RF), and K-Nearest

Neighbors (KNN). This research aims to support and improve

agricultural practices related to date fruit classification.

For future work, we plan to apply this approach to other

agricultural products, with the aim of improving classification

accuracy.

References

[1]

S. Sukkasem, W. Jitsakul, et P. Meesad, « Fruit

Classification with Deep Transfer Learning using Image

Processing », in 2023 7th International Conference on

Information Technology (InCIT), Chiang Rai, Thailand: IEEE,

nov. 2023, p. 464‑469. doi:

10.1109/InCIT60207.2023.10413036.

[2]

S. Meghwanshi, « ARTIFICIAL INTELLIGENCE IN

AGRICULTURE: A REVIEW », Open Access, vol. 06,

no 03.

[3]

H. S. Gill et B. S. Khehra, « An integrated approach using

CNN-RNN-LSTM for classification of fruit images »,

Materials Today: Proceedings, vol. 51, p. 591‑595, 2022,

doi: 10.1016/j.matpr.2021.06.016.

[4]

Department of Mathematics, Atatürk University, Faculty

of Science, Erzurum, Turkey et Ö. Özaltin, « Date Fruit

Classification by Using Image Features Based on Machine

Learning Algorithms », Research in Agricultural Sciences,

vol. 55, no 1, p. 26‑35, janv. 2024, doi:

10.5152/AUAF.2024.23171.

[5]

Md. A. Khayer, Md. S. Hasan, et A. Sattar, « Arabian Date

Classification using CNN Algorithm with Various Pre-

Trained Models », in 2021 Third International Conference

on Intelligent Communication Technologies and Virtual

Mobile Networks (ICICV), Tirunelveli, India: IEEE, févr.

2021, p. 1431‑1436. doi:

10.1109/ICICV50876.2021.9388413.

[6]

H. Bichri, A. Chergui, et M. Hain, « Image Classification

with Transfer Learning Using a Custom Dataset:

Comparative Study », Procedia Computer Science, vol.

220, p. 48‑54, 2023, doi: 10.1016/j.procs.2023.03.009.

[7]

H. Altaheri, M. Alsulaiman, et G. Muhammad, « Date Fruit

Classification for Robotic Harvesting in a Natural

Environment Using Deep Learning », IEEE Access, vol. 7,

p. 117115‑117133, 2019, doi:

10.1109/ACCESS.2019.2936536.

[8]

A. Alsirhani, M. H. Siddiqi, A. M. Mostafa, M. Ezz, et A.

A. Mahmoud, « A Novel Classification Model of

Date Fruit Dataset Using Deep Transfer

Learning »,

Ref

Year

Technique

Date

type

Best Accuracy

[5]

2021

Various pre-trained models

(MobileNet,Inception, and Resnet)

MobileNetV1

82.67%

[9]

2021

GoogleNet, ResNet50, DenseNet and

AlexNet

ResNet50 97.37%

[14]

2021

Stacking model created by

combining LR and ANN

92.80%

[16]

2019

Features extraction+ combination of

several hidden layers

97.20%

[17]

2019

VGG16

96.98%

Our

Study

Feature extraction using ResNet50,

DenseNet201, EfficientNetB0 and

several machine learning algorithms

Feature extraction

using Resnet50+LR

97.42%

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Electronics, vol. 12, no 3, p. 665, janv. 2023, doi:

10.3390/electronics12030665.

[9]

A. Al-Sabaawi, R. I. Hasan, M. A. Fadhel, O. Al- Shamma, et

L. Alzubaidi, « Employment of Pre-trained Deep Learning

Models for Date Classification: A Comparative Study », in

Intelligent Systems Design and Applications, vol. 1351, A.

Abraham, V. Piuri, N. Gandhi,

P. Siarry, A. Kaklauskas, et A. Madureira, Éd., in

Advances in Intelligent Systems and Computing, vol.

1351. , Cham: Springer International Publishing, 2021, p.

181‑189. doi: 10.1007/978-3-030-71187-0_17.

[10]

N. Ahmed, M. Rahman, et F. Ishrak, « Comparative

Performance Analysis of Transformer- Based Pre- Trained

Models for Detecting Keratoconus Disease ».

[11]

S. Dümen, E. Kavalcı Yılmaz, K. Adem, et E. Avaroglu,

« Performance of vision transformer and swin

transformer models for lemon quality classification in fruit

juice factories », Eur Food Res Technol, vol. 250, no 9, p.

2291‑2302, sept. 2024, doi: 10.1007/s00217-024-

04537-5.

[12]

T. S. Xian et R. Ngadiran, « Plant Diseases Classification

using Machine Learning », J. Phys.: Conf. Ser., vol. 1962, no

1, p. 012024, juill. 2021, doi: 10.1088/1742-

6596/1962/1/012024.

[13]

H. Y. Bayram, H. Bingol, et B. Alatas, « Hybrid Deep Model

for Automated Detection of Tomato Leaf Diseases », TS,

vol. 39, no 5, p. 1781‑1787, nov. 2022, doi:

10.18280/ts.390537.

[14]

M. Koklu, R. Kursun, Y. S. Taspinar, et I. Cinar,

« Classification of Date Fruits into Genetic Varieties Using

Image Analysis », Mathematical Problems in Engineering,

vol. 2021, p. 1‑13, nov. 2021, doi: 10.1155/2021/4793293.

[15]

W. Alhamdan, J.M. Howe, « Classification of date fruits in

a controlled environment using convolutional neural

networks », in: Advanced Machine Learning Technologies

and Applications, vol. 9, (1) Springer, 2021,

pp. 154–163. doi:10.1007/978-3-030-69717-4_16.

[16]

A. Magsi, Department of Computer Science, Shah Abdul

Latif University, Khairpur, Pakistan;, J. Ahmed Mahar,

Department of Computer Science, Shah Abdul Latif

University, Khairpur, Pakistan;, S. H. Danwar, et

Department of Computer Science, Shah Abdul Latif

University, Khairpur, Pakistan;, « Date Fruit Recognition

using Feature Extraction Techniques and Deep

Convolutional Neural Network », Indian Journal of Science

and Technology, vol. 12, no 32, p. 1‑12, août 2019, doi:

10.17485/ijst/2019/v12i32/146441.

[17]

A. Nasiri, A. Taheri-Garavand, et Y.-D. Zhang, « Image-

based deep learning automated sorting of date fruit »,

Postharvest Biology and Technology, vol. 153, p. 133‑141,

juill. 2019, doi:

10.1016/j.postharvbio.2019.04.003.

Collaborative business process: A formal

verification and validation

1st Hanane Ouaar

Computer Science Department, Biskra University

LINFI Laboratory

Biskra, Algeria

hanane.ouaar@univ-biskra.dz

Abstract—Business Processes (BP) formal validation and veri-

fication form the basis of the current work. The design, spec-

ification, and implementation of a simulation application for

assembly-line automobiles are the goals. The case study was

selected because it offers several perspectives on a number of

company operations, including the administrator’s perspective,

which permits component setting, account administration, and

system configuration.The technical team believes that identifying

the damage, completing onsite maintenance, publishing the cur-

rent tacked measures, and modifying reports to comprehend the

state of the existing system are all important. The supply chain

and robotic arms are components of the system business processes

that are identified during the analysis phase. UML diagrams

are used in the design phase. In order to formally validate the

system behavior through CTL, system synchronization, business

process simulation, and validation of the majority of properties,

the specification phase uses the UPPAL software tool to develop

a timed automaton. The best software tools for creating the

simulation system are used throughout the implementation phase.

Index Terms—Business processes, assembly-line cars, robotic

arms, supply chain application, UML, UPPAL, CTL, simulation.

INTRODUCTION

A discipline for managing the lifespan of business processes

(BP) [1], from the modeling stage to process enactment and

improvement, while accounting for all the many stakeholders

involved, is business process management (BPM) [3].

Both the computer science and business administration

communities have recently paid close attention to BPM.

BPM is a well-established field that propels business success

by means of successful and efficient business processes,

claims . Capability frameworks, which outline and group

capability areas pertinent to implementation or the orientation

process in businesses, are a standard way to structure business

process management (BPM). Otherwise, by facilitating better

coordination between technology and human resources,

BPM seeks to assist businesses in being more effective.

In terms of company operations, it improves visibility and

streamlines procedures [2]. As a result, businesses that

embrace BPM techniques and technology get a quick return

on investment and improve the efficiency of their current

systems [3]. Moreover, BP provides a means of coordinating

interactions between workers and organizations in a structured

way. However, the dynamic nature of the modern business

environment means that some BP should be externalized,

i.e., accept new BP from outside or let local BP displace off

boundary [1]. So, the challenge is to provide flexibility and

to offer external process support at the same time. However,

current BPM suffers from some limitations in optimization

due to the lack of good monitoring methods because the

involved control of internal and external BP achieves both

the company’s business strategy and its global objectives [3].

Since its features are best suited for the aspect of our

modeling, the assembly line [4] has been selected as a study

case of a modernized information system. Additionally, be-

cause of the dynamic nature of their business process and

their various qualities, it provides a synchronal simulation

application. The current work’s goal was approached in a

way that will concentrate on: The public and private business

processes in the automobile assembly line are identified and

their constituent parts are described during the analysis phase.

During the design phase, we concentrate on leveraging UML

models to simply and easily graphically describe business

process management. The next step is to use a car assembly

line as a case study in order to use different kinds of UML

diagrams (static behavior and dynamic behavior) to identify

and develop its business processes [5].During the specification

phase, we use UPPAAL [6] as the software tool to evaluate

and validate our system utilizing a temporal state machine

automat. TCL (Computation Tree Logic) [8] leverages the

model checking [7] to formally prove the system-modeled

properties. Lastly, during the implementation phase, it delivers

soft handling services that are accessible to all users through

various reporting and configuration stages and displays the

application as a simulation system of their business process.

II.

OVERVIEW

Several paradigms were used for the formalization and

verification of BP models, such as colored Petri Nets, Pi-

Calculus, Timed Automata, etc. Here are some related works

that used the last paradigm.

[10]

Use the Petri Nets model to describe BP with

their resource consumption to verify BP properties in cloud

computing. In order to verify the efficiency of initial resource

provisioning between different BP services and to verify the

partial elasticity of BP in cloud computing based on the

initial allocation of resources. They suggest a verification

methodology based on a formal model to verify resource

consumption properties and select services with low resource

consumption. In order to reduce the cost of elasticity of BP

resources in cloud computing.

To support business process reengineering (BPR) efforts,

[11]

proposed a framework based on high-level Petri nets.

This framework is used to model and analyze business

processes. The use of high-level Petri nets provides advanced

analysis techniques and sophisticated software tools.

[12]

Developed a model based on Colored Petri Nets

(CPN), the Interactive Business Process Fusion (IBPF) net,

which is adept at identifying such vulnerabilities during the

design phase. However, the analysis methods for IBPF net

still urgently need innovation. In addressing this issue, they

used dynamic slicing techniques to analyze IBPF net, serving

as a method for revealing logical vulnerabilities. They obtain

backward slices, partial forward slices, and bidirectional slices

through the slicing algorithms. Eventually, these three types

of slices are merged to form the final dynamic slice. This

technique, which involves a more targeted analysis than exam-

ining the entire IBPF net, simplifies the analysis process and

prevents state space explosion, thereby providing a distinct

advantage. The results of this research are of great value

in enhancing system reliability, reducing maintenance costs,

and providing analysis techniques in the field of e-business

security.

[13]

Investigate how to leverage Model Learning

(ML) algorithms for the automated discovery of DFAs

from event logs. DFAs can be used as a fundamental

building block to support not only the development of

process analysis techniques but also the implementation of

instruments to support other phases of the Business Process

Management (BPM) lifecycle, such as business process

design and enactment. The quality of the discovered DFAs

is assessed with customized definitions of fitness, precision,

generalization, and a standard notion of DFA simplicity.

Finally, they used these metrics to benchmark ML algorithms

against real-life and synthetically generated datasets, with

the aim of studying their performance and investigating their

suitability to be used for the development of BPM tools.

[14]

Proposes a novel semantic-based e-business contract

model-Simple Natural Contract (SimNC), to represent a

universal contract created by a Supervised Sentence Contract

(SSC), which is inputted via a Semantic Input Method

(SIM) with strict grammar from a human-understandable

natural language contract. Then, the SSC is analyzed through

Machine Natural Language (MNL) to enhance contract

semantic understanding by enabling case grammar for

crossing language parties. In doing so, SimNC analyzes

various deontic components and combines them with the

operational aspects of a legal contract to achieve a common

and better understanding between hard code and natural

language. In addition, they apply the SimNC into a Network

of Timed Automata (NTA) for supporting automation, which

builds a formal model including temporal constraints and

then translates it into an executable SimNC-NTA model. This

work aims to provide a bridge between natural language

contracts and e-business contracts, making them universal

and intelligible.

In [15], Pi-calculus is chosen as the modelling and

analysis means for cross-organizational business processes.

Furthermore, on the basis of Pi-calculus, the deadlock

verification method of process is proposed, and the formal

descriptions of several typical reduction rules are presented.

Finally, a case study is presented, and the result shows that the

proposed method can achieve the deadlock detection of large-

scale and complicated cross-organizational business processes.

III.

CASE STUDY: CAR ASSEMBLY LINE

The assembly line cars are selected as a study case; this part

introduces the system components, define them, and model

their business processes using UML.After that, UPPAAL is

used for specification and formal verification of this system,

elaborating on their automaton time synchronization, defining

all walk probabilities, and after that, validating by CTL if the

automaton time is true or false. As a result, although it is a

long time to study, it gives a lot of positives. This part defines

all the action and the ingredient that will be wanted, knowing

the probability source of risk, robotics, and the safety system

for the safety of its stages.

Analysis

An assembly line car is a manufacturing process, often

called a progressive assembly, in which parts (usually in-

terchangeable parts) are added as the semi-finished assem-

bly moves from workstation to workstation, where the parts

are added in sequence until the final assembly is produced.

However, these systems are considered very critical in time

(synchronization), in risk (robots, automatic arms) and in cost

(expensive maintenance). So, the main factors need to simulate

all system’s behavior before implanting the real and the hard

systems.

Cars assembly line Definition: An assembly line cars

continent two principle business Processes (BP): robotic arms

and supply chain [4]:

•

Robotic arms: machines that are programmed to execute

a specific task or job quickly, efficiently, and extremely

accurately. Generally, motor-driven, they’re most often

used for the rapid, consistent performance of heavy

and/or highly repetitive procedures over extended periods

and are especially valued in the industrial production,

manufacturing, machining, and assembly sectors.

•

Supply chain: operates on three levels: strategic, tactical,

and operational. While the strategic approach is generally

about improving network resources such as network

design, location, facility count determination, etc., tactical

decisions deal with mid-term,including production levels

in all factories, assembly policy, inventory levels, and lot

sizes.

Motivations: An assembly line car has been chosen as

a study case of a modernized information system, and this

choice is backed by the following motivations: - For the first

time, an auto-makeup application has been made with a view

to increasing production.

It’s easy for us to find the malaise quickly without wasting

it.

Easy to distinguish between its business processes, which

leads to a better understanding of how our modeling will be.

Work is clear, and this makes it easy for us to synchronize

business process components according to time.

Its characteristics are the most suitable for the aspect of our

modeling.

Conception

This part provides the system modeling using UML dia-

grams. “Fig. 1” shows in the use case diagram of an assembly

line car with a chain that represents the functionalities that the

table:

TABLE I

ACTOR AND USE CASES DESCRIPTION

different actors can do: technician and his relationship with the

robotic arms and supply chain.

Fig. 1. Use case diagram

: And the explanation of the main actors roles and the as-

sociated use cases for each actor is presented in the following

Formal specifications

This part studies the system specification by a system of

transitions (timed automaton).

Formal specification software tools: UPAAL [4]: is

an integrated tool environment for modeling, validation and

verification of real-time systems modeled as networks of

time automatons, extended with data types (limited integers,

ararrays, etc.) It was jointly developed by the universities of

Uppsala (Sweden) and Aalborg (Denmark). It allows for the

analysis of the network of timers communicating through

binary synchronization and using broadcast or reception

channels. Automata with added entire variables, clock tables,

emergency,. . . Transitions manipulate two kinds of variables:

clocks that evolve synchronously over time and discrete

bounded variables. The state of the automaton may contain

a condition on the clocks, called invariant, which must be

satisfied by the time in this state. The passage of the automata

is marked by:

-A guard, which expresses a condition on the values of the

variables (true by default).This condition should generally

be compatible with the invariant of the original state of the

transition, and it must be satisfied to make the transition.

-A synchronization of the form ! Or ?, the lack of

synchronization indicating the automaton’s internal action.

-Reset some clocks and update certain variable s Whole.

Formally Description: The system modeled in this study

is the simplified system of an assembly line. This system

includes the following three synchronized processes.

•

Assembly line: this system contains two systems, Ro-

robotic arms, and the supply chain will be synchroniza-

tion between them.

•

Robotic arm: this system starts from what we need of

irons plats and in other steps, the installation form of the

car with irons plats form.

•

Supply chain: this system can change steps between

(step to another step). Declaration of variables and system

assembly line by UPPAAL: The Uppaal tool is made up

of 3 main parts: - A graphic editor where timed processes

can be described,

A graphic simulator where you can have a view of the

behavior of the system.

A checker that allows you to check the different

properties.

The editor itself is made up of two parts:

Declaration: Contains whole variables, clocks, channels of

synchronization, and constants. Chan move—bras, pren—ves,

tach—ves, rutilise, fin—bras, stop—bras, stop—chain,

move—chain, fin—chain, remove—chain, remove—bras;

Clock x;

Bool mrc;

System declaration: Contains processes.

// Place template instantiations here. brasd = bras();

Chemad =chem ();

Asemblylin =asemblyline ();

// List one or more processes to be composed into a system.

System asemblylin, brasd, chemad;

Business Process Actions: - move—bras: Starting move

of robotic arms.

- pren—ves: Download the required installation tools.

-tach—ves: Install the required tools.

-Rutilise: Restarting move of Robotic arms.

-Fin—bras: fin moving of Robotic arms.

-Stop—bras: Stop moving of arm in short time.

- Stop—Chain: Stop moving of chain in short time.

-move—chain: Starting move of chain.

-fin—ch: Fin moving of chain.

-remove—chain: Restarting move of chain.

-remove—bras: Restarting move of arm.

chn—move: Counter moving of supply chain.

position: Counter moving of robotic arm.

? : This operation is in sync with another, and this operator

means that the subsystem has to wait for another sub-system

to trigger the action.

! : This operator means that the action is done by this part of

the system.

Formal model specification: An assembly line is made

up of three processes that synchronize with each other as

follows: the assembly line, robotic arms, and supply chain.

They are modeled as state automatons finished in the following

part:

•

Business process assembly line: This process is the

system of Assembly line; it contains 10 states and it is

synchronized with two other robotic arm and supply chain

susubsystems‘Fig. 2”:

•

Business process Robotic arms: This process is the

system of robotic arms; it contains 6 states, and it is syn-

chronized with the assembly line subsystem. (“Fig. 3”):

Fig. 2. Business process models of assembly line

Fig. 3. Business process models of robotic arm

•

Business process supply chain: This process is the

system; it contains 6 states and is synchronized with the

assembly line subsystem. (“Fig. 4”):

Fig. 4. Business process models of supply chain

Business Process Synchronization: three processes that

synchronize the assembly line, robot arms, and supply chain:

For example, synchronization between ”robot arms” and

”supply chain,” where ”stop—bras” sends an action to and

”delete—bras” ”receives this action to check what has been

done or not.

Business Process Guards: Keepers express conditions

regarding clock variables and variables that must be met.

Formally, the keepers are a combination of time constraints

and constraints on whole variables. For example, in the

subsystem ”chain,” there is a guard (vair-true) between states

e1 and e2.

Business Process Reset operation: Retying a clock or

variable transition data is an initialization of the value of

the clock. For example, in the under ”bras” system, in the

transition between (s1 and s2), after the 4 minutes (in the

”chain” process), it was reset to zero to calculate the time it

was closed bras valve for up to restart the system.

Verification and validation of formal system modeled:

The purpose of verification is to ensure that a program meets

many characteristics. Model checking is an automatic formal

verification technique, for which it is necessary to formally

model the behavior of the system. In addition, the temporal

logic CTL has also been well presented; it is an attribute

specification language. Once the system is described by the

conversion system and the required attributes are specified

in the timelogic, an algorithm called model checking will

automatically answer the question, ”Does the system meet the

required attributes?”. We have written the formal model that

standardizes a pipeline to the idea CTL time logic as shown

in the“Fig. 5”

Using the Model-Checking algorithm as a formal verification

technique to prove the safety of this formal specification under

temporal logic assembly line system CTL, the satisfaction of

the properties requested is proved.

Fig. 5. Formal specification of assembly line CTL

On the other hand, UPPAAL is used as a simulator to

validate the behavior of this system in order to show if there

are problems such as infinite loops, blocking, etc., and it

proves the system validation. The formal method involves

the application of mathematical techniques to design and

implement software. A formal specification is expressed in

a language whose syntax and semantics are formally defined.

Model-based specification uses a set of theory, function, and

logic tools to develop abstract models of the system.

Implementation

This part is dedicated to the implementation of the code us-

ing the programming language (Java) [9] and the development

environments used to build our application to simulate system

assembly line cars. By using information fromthe modeling,

and it treats the different steps of code generation of the code

and is interested in the passage according to the specificities

of the various types of semi-formal and formal models (UML)

and formal paradigms (TCL) already developed towards pro-

gramming and presents the different scenarios of this simu-

lation system by showing multiple graphical user interfaces

(GUI)

In order to allow the manager to operat all systm’s set-

ting,“Fig. 6” presents the parametric interface. and“Fig. 7”

presents an interface line chart of factory production rate. Also,

“Fig. 8” presents an interface report of factory production rate:

Fig. 6. Interface of parameter setting

Through the use of the formal specification and validation

in this system, it has been noticed that a lot of benefits before

implementation are deducted on below: - Build a bug-free

system before building it in reality. - If there are errors,

they can find out the defect is in a short time. - Select all

components and supplies (method, attribute, conditional, etc.)

in a defined way.

The specification takes a long time, but this study enables

us to build a system in a short time.

Distribution of services, clarifies tasks, and specifies

interfaces.

Fig. 7. Interface line chart of factory production rate

the database coding phase; and finally, the sequence diagram

and the activity diagram, each of which depicts the dynamics

of our business process, whether it be private or public.Second,

the specification step makes use of the UPPAAL tool, which

enables the creation, validation, verification, and simulation

of synchronized timed finite-state automata in the following

ways: The supply chain, robotic arm, and assembly line are

described formally as the stat machine synchronous automat.

UPPAAL: BusinessProcess Actions: Variable declaration and

system assembly line. formal confirmation of the properties of

the system-modeled utilizing TCL’s model checking. Analysis

of the formal specification gain. Validation of the Formal

System Modeled (Simulation).

In order to create the best simulation application of a set

of public business process management of the assembly line

car, the implementation phase of our work also makes use

of several programming languages and guarantees logical

matching of the previously created diagrams and requirements.

We intend to enhance this project in subsequent work by

including the following features: Add more features, create

an Ubuntu version, and make this online application widely

accessible.

Fig. 8. Interface report of factory production rate

IV.

CONCLUSION

Simulating and formally defining a critical synchronous

system as a collection of business processes was the aim of this

effort. Because it offers a set of suitable internal and external

business management processes to create a system that assures

the best quality and is most suitable for vehicle installation, we

use the current contributions in a case pertaining to an auto-

mobile installation and manufacturing agency.Three life cycle

engineering phases are followed in this work: First, regardless

of the programming language, the design phase makes use

of semi-formal models like UML, which combine structured

description and behavior, to comprehend challenges and rep-

resent and model objects using the following diagrams: a case

diagram that shows our system’s functionality from various

perspectives (technician/administrator); a diagram of classes

that illustrates the static structure of our system and aids in

REFERENCES

[1]

Weske, M., ‘Business Process Management: Concepts, Languages, and

Architecture’, third Edition book, Springer-Verlag GmbH Germany, pp.

3, 2019.

[2]

Hammer, M. Introduction. In Jan vom Brocke and Michael Rosemann

(2nd ed.), “What is business process management?” In Springer-Verlag

Berlin Heidelberg (Ed.). Hand book on Business Process Management,

Methods, and Information Systems, Cambridge, USA, pp. 6, 3, 2015.

[3]

Rosemann, M., ‘An Exploration into Future Business Process Manage-

ment Capabilities in View of Digitalization’, Georgi Dimov Kerpedzhiev.

[4]

assembly-line website:(https://www.inboundlogistics.com/articles/assembly-

line/), visited the 22/12/2024.

[5]

UML website: (https://www.uml.org/), visited on 03/12/2024.

[6]

uppal website: (https://uppaal.org/), visited the 20/12/2024.

[7]

Stefan, S., “Model Checking Concepts,” ppt corse, ENSIIE, 2024.

[8]

Massimo, B., Laura, B., Fabio, M.: “Full Characterisation of Extended

CTL*”, Universita` di Napoli Federico II, 31st International Symposium

on Temporal Representation and Reasoning (TIME 2024).

[9]

Java website: (https://www.java.com/), visited the 22/12/2024.

[10]

Mohammedn, N. L., Nabil, H., Ramdane, M., “Resources consumption

analysis of business process services in cloud computing using Petri

Net”, Journal of King Saud University – Computer and Information

Sciences, pp. 408.

[11]

van der Aalst W.M.P., van Hee K.M., “ Business process redesign: A

Petri-net-based approach”, Computers in Industry Volume 29, Issues 1–

2, July 1996, pp. 15.

[12]

Wangyang, Y., Jie, F., Lu, L., Xiaojun, Z., and Yumeng, C., “Enhancing

security in e-business processes: Utilizing dynamic slicing of Colored

Petri Nets for logical vulnerability detection”, Future Generation Com-

puter Systems, Volume 158, September 2024, pp. 210.

[13]

Simone, A., Francesco, C., Fabrizio, M.M, Andrea, M., Fabio, P., “ Pro-

cess mining meets model learning: Discovering deterministic finite state

automata from event logs for business process analysis”, Information

Systems Volume 114, March 2023, pp. 1.

[14]

Peng, Q., Quanyi, H., Menglin, C., “Towards machine-readable

semantic-based E-business contract representations using Network of

Timed Automata (NTA)”, Future Generation Computer Systems Volume

158, September 2024, pp. 457.

[15]

Xin, Y., Xinghua, B., Chao, Z., “The reduction and deadlock detection

of cross-organizational business process based on Pi-calculus”, Procedia

Engineering Volume 15, 2011, pp. 3487.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

An Empirical Study on the Effectiveness and

Efficiency of Machine Learning Classifiers for

Liver Disease Prediction

Mohamed Amine NEMMICH 1, Asmaa BOUDALI 2, Noureddine

BOUKHARI 3, Fatima DEBBAT 4

1Department of Computer Science, Mathematics Laboratory, Djillali liabes University of Sidi Bel Abbes, Sidi

Bel Abbes, Algeria

1amine.nemmich@univ-sba.dz

2Department of Electronics, Laboratory of coding and security of information, Sciences and Technology

University of Oran Mohamed Boudiaf, Oran, Algeria

2asmaa.boudali@univ-usto.dz

3Department of Mathematics, Djillali liabes University of Sidi Bel Abbes, Sidi Bel

Abbes, Algeria

3noureddine.boukhari@univ-sba.dz

4Department of Computer Science, Mustapha Stambouli University of Mascara,

Mascara, Algeria

4debbat.fatima@univ-mascara.dz

Abstract— Liver disease poses a significant global health burden,

with high mortality rates exacerbated by challenges in early

detection. Machine learning (ML) offers promising avenues for

developing automated diagnostic tools to address this critical

need. While various ML classifiers have been explored for liver

disease prediction, a comprehensive, systematic comparison of a

wide range of modern algorithms, incorporating robust

preprocessing, handling of class imbalance, hyperparameter

tuning with cross-validation, and analysis of computational

efficiency, is essential to guide the selection of models for practical

application. This study systematically evaluates thirteen diverse

ML classification algorithms using the Liver Patient Dataset

(LDPD). The methodology includes data preprocessing with

imputation, encoding, and standardization within a pipeline to

prevent data leakage, handling class imbalance using SMOTE,

splitting data into training and testing sets, and employing

RandomizedSearchCV with Stratified K-Fold cross-validation for

hyperparameter optimization. Performance was assessed using

key metrics including Accuracy, Precision, Recall, Specificity, F1-

Score, and ROC AUC on an independent test set, alongside

training time. Results demonstrate that ensemble and advanced

tree-based methods achieve superior predictive performance.

Hyperparameter tuning further optimized performance, with

Tuned Random Forest achieving the highest ROC AUC (0.9995)

and Specificity (0.9973), and Tuned LightGBM achieving the

highest Recall (0.9996). The study highlights a crucial trade-off:

while tuning yields peak performance, default configurations of

efficient models like LightGBM and XGBoost offer exceptionally

high performance (ROC AUC ≥ 0.9993) combined with

significantly faster training times (≤ 0.41 seconds), providing a

favorable balance for practical application. This research

identifies highly effective and efficient ML models for liver disease

prediction, contributing empirical evidence to support the

development of automated diagnostic aids.

Keywords— Liver Disease Prediction, Machine Learning

Classification, Class Imbalance, Hyperparameter Tuning,

Ensemble Methods.

XXIII. INTRODUCTION

Liver disease represents a significant global health

challenge, contributing to substantial morbidity and mortality

worldwide. As highlighted by recent data, the burden of liver

disease is particularly acute in regions like India, where

264,193 deaths were reported in 2018, corresponding to an age-

adjusted death rate of approximately 23.00 per 100,000

population [1]. The liver, a vital organ responsible for

detoxification and numerous metabolic functions, is susceptible

to damage from various etiologies, including viral infections,

metabolic disorders, excessive alcohol consumption, and

genetic factors [2, 4]. While conditions like cirrhosis and liver

failure represent advanced stages, early detection of liver

damage is often challenging due to its insidious progression and

non-specific initial symptoms [4]. This delayed identification

can severely limit therapeutic options and negatively impact

patient outcomes, underscoring the critical need for timely and

accurate diagnostic tools to facilitate early intervention and

improve prognosis [3].

The growing availability of health data and advancements in

computational capabilities have positioned machine learning

(ML) as a powerful paradigm for enhancing medical diagnosis

and prognosis [4]. Classification techniques, in particular, have

shown promise in developing automated tools for identifying

various diseases based on patient data. In the context of liver

disease, ML algorithms have been explored for tasks such as

classifying liver fibrosis stages, predicting patient survival, and

distinguishing between different liver conditions [4]. However,

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

the landscape of ML applications in liver disease prediction is

continuously evolving. While numerous studies have

investigated various algorithms, there remains a need for

comprehensive, head-to-head comparisons of a wide array of

modern and diverse ML classifiers on relevant datasets.

Furthermore, the impact of critical steps like systematic data

preprocessing, effective handling of class imbalance, and

rigorous hyperparameter tuning on the performance of these

models for liver disease prediction warrants further

investigation to identify the most robust and reliable

approaches for potential clinical application.

This study aims to address these gaps by conducting a

systematic and comprehensive evaluation of multiple machine

learning classification algorithms for liver disease prediction

using a publicly available dataset. The primary objectives are:

(1) to benchmark the performance of a diverse set of ML

classifiers; (2) to identify the most effective and efficient

models for this prediction task based on a thorough analysis of

various performance metrics, including those crucial in medical

diagnosis such as Recall and Specificity, alongside overall

discrimination ability (ROC AUC) and computational

efficiency (training time). The rationale behind this research is

to provide a data-driven comparison to guide the selection of

suitable ML models for developing automated liver disease

screening or diagnostic support tools. This work contributes to

the field by offering a detailed comparative analysis of

numerous algorithms, demonstrating the practical impact of

different techniques, and highlighting the trade-offs between

model performance and efficiency in the context of liver

disease prediction.

The implemented methodology involves standard data

preprocessing techniques, addressing class imbalance using

SMOTE, splitting the data into training and testing sets,

training and evaluating a broad range of classifiers, and

conducting a staged performance comparison analysis of both

models.

The remainder of this paper is organized as follows: Section

2 presents a review of the existing literature on machine

learning applications in liver disease classification and

detection. Section 3 provides a detailed explanation of the

dataset, the proposed architecture, the algorithms utilized, and

the preprocessing steps. Section 4 describes the experimental

setup and presents the evaluation results. Section 5 discusses

the conclusion and outlines potential directions for future work.

XXIV. LITERATURE REVIEW

This section reviews existing research on applying machine

learning classification techniques for liver disease prediction

and diagnosis, focusing on commonly used algorithms,

datasets, and key findings to establish the context for this study.

Machine learning models such as Support Vector Machines

(SVM), Logistic Regression, Naïve Bayes, Decision Trees

(DT), Random Forest, K-Nearest Neighbors (KNN), and

Artificial Neural Networks (ANN), along with various boosting

algorithms, have been widely applied to classify liver diseases

[5]. Comparative studies on datasets like the Andhra Pradesh

(AP), UCLA, UCI, and Indian Liver Patient Dataset (ILPD)

show varied results regarding the best-performing algorithms.

Some studies found KNN, backward propagation (a type of

ANN), and SVM to be effective [5], while others highlighted

Decision Trees [7, 8], C4.5 [9, 14], ANN [13], or Bayesian

networks [12] as top performers in specific comparisons or on

particular datasets. The influence of the dataset itself on model

performance has also been noted [5, 6].

Researchers have also explored specific techniques and

algorithms. Studies have compared models like SVM and back

propagation [11], focused on predicting specific conditions like

fibrosis [8, 10] or fatty liver disease [12], and investigated the

utility of risk factors [17]. Techniques such as feature selection

[6, 15, 16], and data normalization [15] have been incorporated

to improve model performance. While some work has focused

on single algorithms with preprocessing and tuning [16], the

diverse findings across studies using different methodologies

and datasets underscore the complexity of the problem and the

lack of a universally agreed-upon optimal approach.

Despite the extensive research, a key gap in the literature is

the need for comprehensive, systematic comparisons of a wide

range of modern machine learning classifiers evaluated under a

consistent and rigorous methodology. Many studies focus on a

limited set of algorithms or lack detailed consideration of

crucial steps like robust preprocessing, handling class

imbalance (although SMOTE is used in some implementations,

its systematic evaluation across models is needed), and the

impact on a broad scale. Furthermore, a thorough analysis that

considers not only predictive performance metrics but also

practical factors like computational efficiency (training time) is

often missing but essential for real-world application.

This study aims to address these gaps by providing a

comprehensive and systematic evaluation of a wide array of

machine learning classifiers. By employing a consistent

methodology, including robust preprocessing pipelines and

SMOTE for imbalance handling, and evaluating models across

a standard set of performance metrics including training time,

this research offers a valuable comparative analysis to identify

effective and efficient models for liver disease prediction,

contributing empirical evidence to the field. Furthermore, the

analysis will explicitly consider the computational efficiency

(training time) alongside predictive performance metrics,

providing valuable insights for the practical application of these

models in liver disease prediction.

XXV. RESEARCH METHODOLOGY

This study adopted a systematic machine learning workflow

to develop and evaluate predictive models for the classification

of liver disease. The comprehensive methodology encompasses

data acquisition, rigorous preprocessing, strategies for handling

class imbalance, model training, hyperparameter tuning, and a

comprehensive performance evaluation process. The specific

steps are elaborated in the following subsections.

E. Data Acquisition and Initial Inspection

The initial phase involved the acquisition of the dataset,

identified as the Liver Patient Dataset (LDPD), which contains

patient-specific information and related medical parameters

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

relevant to liver disease diagnosis. The fundamental

characteristics of the dataset, including its demographic scope,

total number of records, and distribution across liver patient and

non-liver patient categories, as well as gender distribution, are

summarized in Table 1. The dataset comprises ten predictor

variables and one target variable. The predictor variables

encompass demographic information (age, gender) and various

biochemical markers related to liver function (Total Bilirubin,

Direct Bilirubin, Alkaline Phosphatase, Alamine

Aminotransferase (SGPT), Aspartate Aminotransferase

(SGOT), Total Proteins, Albumin, and Albumin-to-Globulin

Ratio). The target variable indicates the diagnosis as either

'Liver Patient' or 'Non Liver Patient', expert-labeled to facilitate

supervised learning. Detailed information regarding each

attribute, including measurement units, value ranges, means,

and standard deviations, is provided in Table 2.

TABLE 1. LDPD DATASET DESCRIPTION.

Demography

Total

records

Liver

patients

Not

liver

patients

Male

Female

Worldwide

liver patients

30691

21917

8774

21986

7803

TABLE 2. ATTRIBUTES’ INFORMATION OF DATASET.

Attribute

Measurement

unit

Value

range

Mean

Std

Age (AG)

Years

4–90

44.107

15.981

Gender (GN)

Categorical

0 or 1

0.775

0.483

Total bilirubin (TB)

mg/dl

0.4–75

3.370

6.256

Direct bilirubin

(DB)

mg/dl

0.1–

19.7

1.528

2.870

Alkaline

phosphatase (AP)

U/L

63–

2110

289.075

238.538

Alanine

aminotransferase

(ALA)

U/L

10–

2000

81.489

182.159

Aspartate

aminotransferase

(ASA)

U/L

10–

4929

111.470

280.851

Total proteins (TP)

g/dl

2.7–

9.6

6.480

1.082

Albumin (AL)

g/dl

0.9–

5.5

3.130

0.792

Albumin and

globulin ratio

(AGR)

g/dl

0.3–

2.8

0.943

0.323

Liver disease or not

(LD or NLD)

Categorical

0 or 1

0.286

0.452

Upon loading the data into a structured format, a preliminary

inspection was conducted to ascertain the dataset's overall

structure and identify variable types (numerical and

categorical). Basic descriptive statistics were reviewed to

understand the distribution and central tendencies of the

attributes. To gain deeper insights into data distribution patterns

and the relationships between variables, particularly

concerning the target variable, visual exploratory data analysis

(EDA) techniques were employed, including the generation of

histograms for individual attributes and pair plots to visualize

attribute distributions and their relationships with the liver

disease outcome. A critical assessment was also performed to

identify the presence and extent of missing values across

different features, which is a necessary precursor to data

cleaning. Ensuring data quality by addressing such

redundancies and inconsistencies, including the identification

and potential handling of duplicate instances, is essential for

improving the efficiency and reliability of subsequent

modeling. Initial steps also involved recognizing the need to

convert the categorical 'Gender' feature into a numerical format

suitable for machine learning algorithms, which was performed

through data encoding in a subsequent preprocessing step.

F. Data Preprocessing

Data preprocessing constituted a crucial stage focused on

transforming the raw data into a clean, consistent, and

numerically compatible format for machine learning, while

strictly adhering to principles that prevent data leakage. This

stage involved several key procedures. Missing values,

identified during the initial inspection (the counts of which are

detailed in Table 3), were handled through Imputation.

Specifically, a Median Imputation strategy was applied to

numerical features, replacing missing entries with the median

value calculated solely from the training data subset to avoid

test set influence. For the categorical 'Gender' feature, Mode

Imputation was utilized to fill missing values with the most

frequent category observed in the training subset. Categorical

features, such as 'Gender', were converted into a numerical

representation through One-Hot Encoding, creating binary

indicator variables to ensure no ordinal relationship was

incorrectly imposed. Furthermore, numerical features, which

often exhibit widely varying scales, were subjected to

Standardization (Z-score scaling). This technique transforms

features to have a mean of zero and a standard deviation of one,

standardizing their range. The Z-score method was also

employed to address the presence of significant outliers

observed in certain attributes, effectively neutralizing their

disproportionate impact. Feature Scaling is a fundamental step

for algorithms sensitive to feature magnitudes, ensuring that no

single feature dominates the learning process, regardless of its

original unit or range.

All these preprocessing steps—imputation, encoding, and

scaling—were encapsulated within a Preprocessing Pipeline

using scikit-learn's Pipeline and ColumnTransformer classes.

This theoretical framework guarantees that all fitting of

preprocessing parameters occurs exclusively on the training

data, and these learned parameters are then applied consistently

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

to transform both the training and independent test sets,

rigorously preventing data leakage.

TABLE 3. NO OF MISSING VALUES IN THE DATASET.

ALA

ASA

AGR

278

648

561

796

739

859

463

494

559

G. Handling Class Imbalance

The dataset utilized in this study exhibited a notable

imbalance in the distribution of the target variable, with a

higher prevalence of the positive class (Liver Patient).

Addressing this inherent class imbalance was a critical step to

mitigate potential model bias towards the majority class. This

was achieved through Oversampling of the minority class.

Specifically, the Synthetic Minority Over-sampling Technique

(SMOTE) was applied to the pre-processed training data.

SMOTE is a synthetic oversampling algorithm that generates

artificial instances of the minority class by interpolating

between existing minority samples and their k-nearest

neighbours in the feature space. This process, applied only to

the pre-processed training data, resulted in a training dataset

with a more balanced class distribution, thereby enabling the

subsequent models to learn the characteristics of the minority

class more effectively. The independent test set was kept in its

original class distribution to ensure performance evaluation

reflected real-world scenarios.

H. Model Training and Evaluation

To establish a baseline performance and identify algorithms

with high potential, a diverse suite of thirteen machine learning

classification models was initially selected and trained using

their default hyperparameters. These models were chosen to

represent a broad spectrum of theoretical approaches to

classification, encompassing Generalized Linear Modeling

(Logistic Regression), Instance-Based Learning (K-Nearest

Neighbors), Decision Tree Learning, various Ensemble

Methods based on Bagging (Random Forest, Extra Trees) and

Boosting (Gradient Boosting Machines, XGBoost, LightGBM,

AdaBoost, CatBoost), a Kernel Method (Support Vector

Machine with an RBF kernel), a Probabilistic Model (Gaussian

Naïve Bayes), and an Artificial Neural Network (Multi-Layer

Perceptron). Each selected model underwent Model Training

by being fitted to the SMOTE-resampled and preprocessed

training data. The diversity in algorithm selection was

intentional, designed to enrich the comparative study by

evaluating models with distinct underlying mechanisms and

potential strengths in capturing different patterns within the

data. Following training, each model's performance was

evaluated on the independent preprocessed test dataset.

I. Hyperparameter Tuning

Following the initial evaluation of models with default

parameters, hyperparameter tuning was performed on a subset

of the most promising models to further optimize their

performance. This process utilized RandomizedSearchCV, a

robust technique for efficiently searching a predefined

hyperparameter space. To ensure a reliable estimate of

performance during tuning and mitigate the risk of overfitting

to a single validation set, Stratified K-Fold cross-validation was

employed with 5 splits (k=5). Stratification ensured that each

fold maintained a representative distribution of the target

classes. The optimization criterion for RandomizedSearchCV

was the ROC AUC score, which is a suitable metric for

evaluating classifier performance on imbalanced datasets by

assessing the model's ability to discriminate between positive

and negative classes across various thresholds. The tuning

process involved fitting the models with various combinations

of hyperparameters sampled from specified distributions and

evaluating them using cross-validation on the resampled

training data. The best set of hyperparameters for each model

was selected based on the highest mean cross-validation ROC

AUC score.

XXVI. EXPERIMENTAL RESULTS AND PERFORMANCE

EVALUATION

The experimental evaluation was conducted on a ThinkPad

L390 laptop equipped with an Intel(R) Core(TM) i5-8265U

CPU @ 1.60GHz 1.80 GHz, 24.0 GB RAM, and a 256GB SSD,

running the Windows 10 Pro 64-bit operating system. The

implementation, coding, and visualization were performed

using Python within a Jupyter Notebook environment.

A. Performance Evaluation Metrics

The performance of the developed prediction models was

assessed using a rigorous experimental protocol. The dataset

was initially divided into an 80% training set and a 20% testing

set using stratified random sampling to ensure that the

proportion of target classes was maintained in both subsets. The

confusion matrix served as the fundamental basis for

performance evaluation, providing a detailed breakdown of

classification outcomes: True Positives (TP), True Negatives

(TN), False Positives (FP), and False Negatives (FN). The

confusion matrix components for all evaluated default

algorithms are presented in Table 4.

Model performance was quantified using a suite of widely

accepted evaluation metrics derived from the confusion matrix.

These included Accuracy, Precision, Recall (Sensitivity), F1-

Score, Specificity, and the Area Under the Receiver Operating

Characteristic curve (ROC AUC). Table 5 shows the

calculation of each evaluation metric. In the context of medical

diagnosis, the following metrics are particularly important:

- Recall (Sensitivity): The proportion of actual positive

cases (Liver Patients) that were correctly identified. High

Recall is crucial for minimizing false negatives, which is

paramount in medical diagnosis to avoid missing true cases.

- Specificity: The proportion of actual negative cases

(Non Liver Patients) that were correctly identified. High

Specificity is important for minimizing false positives,

preventing healthy individuals from being incorrectly

diagnosed.

- F1-Score: The harmonic mean of Precision and

Recall, providing a balanced measure particularly useful for

imbalanced datasets.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

- ROC AUC: An aggregate measure of the model's

ability to discriminate between positive and negative classes

across all possible classification thresholds. A higher AUC

indicates superior discriminatory power, representing the trade-

off between True Positive Rate and False Positive Rate.

- Accuracy: The overall proportion of correctly

classified instances. While a general indicator, it is not the

primary comparison metric due to the potential for misleading

results in the presence of class imbalance.

- Precision: The proportion of instances predicted as

positive that were actually positive.

In addition to these predictive performance metrics, the

training time for each model was recorded to consider

computational efficiency. This allows for an analysis of the

trade-offs between model performance and the resources

required for training. The performance evaluation was

conducted on the independent preprocessed test dataset using

both the default models and the tuned versions of selected

classifiers.

TABLE 4. CONFUSION MATRIX.

Model

Logistic Regression

967

145

1221

1541

K-Nearest Neighbors

979

133

357

2405

Decision Tree

1065

1120

1642

Random Forest

1105

2754

Gradient Boosting Machines

1050

490

2272

XGBoost

1101

2756

LightGBM

1103

2752

Support Vector Machine

1046

1181

1581

Gaussian Naïve Bayes

1069

1647

1115

AdaBoost Classifier

944

168

653

2109

Extra Trees Classifier

1102

2752

CatBoost Classifier

1094

2737

Deep Learning

1080

627

2135

TABLE 5. PERFORMANCE EVALUATION METRICS.

Metric

Calculation

Accuracy

(+) / (+++)

Precision ()

TP/(TP+FP)

Recall ()

TP/(TP+FN)

F1-score

2×(×) / (+)

Specificity

TN/(TN+FP)

ROC curve

TPR (y-axis) vs. FPR (x-axis)

B. Default Model Performance

An initial evaluation was conducted by training a diverse set

of thirteen classification models using their default

hyperparameters on the SMOTE-resampled training data and

assessing their performance on the independent test set [18]. In

addition to standard performance metrics, the training time for

each model was recorded to consider computational efficiency.

Table 6 presents the key performance metrics and training

duration for all default models, sorted by their ROC AUC score.

The visualizing performance comparison for the 13 models is

displayed in Fig. 1.

For comparing the performance of the different machine

learning models in this study, we primarily focus on ROC AUC

and F1-Score as robust overall indicators of performance on

imbalanced data. Additionally, Recall (Sensitivity) and

Specificity are carefully examined to understand the critical

trade-off between minimizing false negatives and false

positives, which is paramount in a medical diagnostic context.

The results reveal distinct tiers of performance and highlight

the trade-offs between predictive power and computational cost

(training time) at the default settings.

The highest predictive performance, as measured by ROC

AUC and other key metrics, is concentrated among the

ensemble and tree-based models: Random Forest (0.9994 ROC

AUC), Extra Trees Classifier (0.9994 ROC AUC), LightGBM

(0.9994 ROC AUC), XGBoost (0.9993 ROC AUC), and

CatBoost Classifier (0.9985 ROC AUC). These models

consistently achieved Accuracy, Precision, Recall, F1-Score,

and Specificity exceeding 0.98. While their predictive

capabilities at default settings are very similar and

exceptionally high, significant differences emerge in their

training times. LightGBM stands out as particularly efficient,

training in just 0.19 seconds, followed by XGBoost (0.31s),

Extra Trees (0.41s), CatBoost (0.97s), and Random Forest

(0.99s). For practical applications where rapid retraining or

development cycles are important, the speed offered by

LightGBM, XGBoost, and Extra Trees is a notable advantage.

Beyond this top group, Gradient Boosting Machines (0.9588

ROC AUC, 3.96s) and the Deep Learning (MLP) model

(0.9511 ROC AUC, 103.13s) show a considerable drop in ROC

AUC and generally higher training times compared to the

leading boosted trees. The MLP's training time is dependent on

hyperparameters like epochs and batch size, but even 50 epochs

resulted in a relatively longer duration compared to most other

default models.

Simpler models like Decision Tree (0.06s), K-Nearest

Neighbors (0.07s), Logistic Regression (0.13s), and Gaussian

Naïve Bayes (0.01s) exhibit significantly lower training times,

often completing in milliseconds or a fraction of a second.

Gaussian Naïve Bayes is the fastest to train. However, this

efficiency comes at the cost of predictive performance, with

ROC AUC values ranging from 0.7361 to 0.9406. Among these

faster models, KNN achieves the best balance of speed and

performance, with a respectable ROC AUC of 0.9406. The

Support Vector Machine, while theoretically powerful, shows

the longest training time (119.77s) at default settings with the

RBF kernel, coupled with relatively modest performance

metrics compared to the faster top models.

In summary, the default evaluation reveals a clear trade-off

between training time and predictive performance. While the

top ensemble methods demonstrate exceptional classification

accuracy and discriminatory power, models like LightGBM,

XGBoost, and Extra Trees offer a compelling combination of

high performance and computational efficiency. Simpler

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

models are faster but generally less accurate. This initial

analysis guides the selection of models for the hyperparameter

tuning phase, prioritizing those with high potential based on

their default performance metrics, while also keeping

computational cost in mind for practical considerations.

The Receiver Operating Characteristic (ROC) curve

provides a visual representation of a classifier's ability to

distinguish between positive and negative classes across

various probability thresholds. The Area Under the ROC Curve

(AUC) quantifies this discriminatory power, with values closer

to 1 indicating better performance. Fig. 2 displays the ROC

curves for all models evaluated at default settings. Notably, a

distinct group of models—Random Forest, Extra Trees

Classifier, LightGBM, XGBoost, and CatBoost Classifier—

exhibits curves tightly positioned near the top-left corner of the

plot, corresponding to exceptionally high AUC values ranging

from 0.9985 to 0.9994. This visually confirms their superior

discriminatory ability, achieving high True Positive Rates

while maintaining low False Positive Rates across different

thresholds.

TABLE 6. PERFORMANCE EVALUATION OF ML MODELS.

Model

Accuracy

Precision

Recall (Sensitivity)

F1-Score

Specificity

ROC AUC

Training Time (s)

Random Forest

0.9961

0.9975

0.9971

0.9973

0.9937

0.9994

0.99

Extra Trees Classifier

0.9948

0.9964

0.9910

0.9994

0.41

LightGBM

0.9951

0.9967

0.9964

0.9966

0.9919

0.9994

0.19

XGBoost

0.9956

0.9960

0.9978

0.9969

0.9901

0.9993

0.31

CatBoost Classifier

0.9889

0.9935

0.9909

0.9922

0.9838

0.9985

0.97

Gradient Boosting Machines

0.8575

0.9734

0.8226

0.8917

0.9442

0.9588

3.96

Deep Learning (MLP)

0.8299

0.9852

0.7730

0.8663

0.9712

0.9511

103.13

K-Nearest Neighbors

0.8735

0.9476

0.8707

0.9075

0.8804

0.9406

0.07

AdaBoost Classifier

0.7881

0.9262

0.7636

0.8371

0.8489

0.9005

1.75

Decision Tree

0.6988

0.9722

0.5945

0.7378

0.9577

0.8439

0.06

Support Vector Machine

0.6781

0.9599

0.5724

0.7172

0.9406

0.8152

119.77

Logistic Regression

0.6474

0.9140

0.5579

0.6929

0.8696

0.7644

0.13

Gaussian Naïve Bayes

0.5638

0.9629

0.4037

0.5689

0.9613

0.7361

0.01

Fig. 1 Visualizing performance comparison (all models)

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Fig. 2 ROC curves for all models evaluated.

Conversely, the remaining models show curves

progressively closer to the diagonal random classifier line

(AUC = 0.5), indicating lower discriminatory power. Models

like Gradient Boosting Machines and the Deep Learning MLP

perform moderately well (AUCs around 0.95), positioned

below the top tier but still above random chance. Simpler

models such as Logistic Regression and Gaussian Naïve Bayes

yield curves closest to the diagonal, reflecting their limited

capacity to separate the classes effectively compared to the

more complex ensemble and neural network approaches. This

ROC analysis visually reinforces that ensemble and advanced

tree-based methods provide the strongest discrimination

performance in this liver disease prediction task at default

configurations.

C. Hyperparameter Tuning Results

Based on the promising performance of several models at

their default settings, hyperparameter tuning was performed

using RandomizedSearchCV with 5-fold Stratified K-Fold

cross-validation, optimizing for ROC AUC. This process

allowed for a more thorough exploration of the model's

potential and provided a more statistically validated estimate of

performance through cross-validation. Table 7 summarizes the

best hyperparameters found and their corresponding best cross-

validation ROC AUC scores for the selected models.

The high cross-validation ROC AUC scores achieved by the

tuned models (all above 0.99) indicate that these models are

consistently performing well across different subsets of the

training data, providing statistical confidence in their predictive

capability.

D. Final Performance Comparison (Top Default and Tuned

Models)

For the final comparison, we selected the top 5 default

models based on their initial ROC AUC from the default

evaluation and included all models that underwent

hyperparameter tuning. These models were then evaluated on

the independent test set. Table 8 presents a comprehensive

comparison of their performance metrics and training times,

sorted by ROC AUC score in descending order. The visualizing

performance comparison for the top default and tuned models

is displayed in Fig. 3.

E. Discussion of Results

The final comparison, sorted by ROC AUC (Table 8),

highlights that both the top performing default models and their

tuned counterparts achieve exceptionally high performance

metrics for liver disease prediction on this dataset. Specifically,

the tuned versions of Random Forest, LightGBM, and

XGBoost, along with the default versions of Random Forest,

Extra Trees, and LightGBM, demonstrate the highest ROC

AUC scores, all at 0.9994 or higher, indicating outstanding

discriminatory power. Tuned Random Forest achieved the

highest ROC AUC on the test set at 0.9995.

Comparing the default and tuned versions reveals the impact

of hyperparameter optimization. While the default ensemble

models already performed very well, tuning resulted in slight

improvements in metrics like Recall, Precision, and F1-Score

for some models, and importantly, led to the highest observed

ROC AUC. For instance, Tuned LightGBM achieved a

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

remarkable Recall of 0.9996, indicating its ability to correctly

identify almost all positive cases. Tuned Random Forest not

only achieved the highest ROC AUC but also the highest

Specificity at 0.9973.

The use of Stratified K-Fold cross-validation during the

hyperparameter tuning process provides statistical validation

for the performance estimates of the tuned models. The high

and consistent cross-validation scores (Table 7) demonstrate

that these models' performance is not overly sensitive to the

specific data split used for training and validation, increasing

confidence in their robustness.

A crucial consideration for practical application is the trade-

off between performance and computational efficiency. While

tuning generally improved performance and led to the best

overall models by ROC AUC, it significantly increased the

training time compared to using default parameters, as the

reported training times for tuned models include the entire

RandomizedSearchCV process. Default LightGBM, XGBoost,

and Extra Trees Classifier remain highly attractive options due

to their combination of very high performance (ROC AUC of

0.9994 or 0.9993) and significantly faster training times (under

0.5 seconds) compared to their tuned versions (hundreds of

seconds) or other models like Tuned Random Forest (over 1700

seconds). For scenarios where rapid model training or frequent

retraining is required, the default configurations of these

boosting algorithms present a favorable balance.

The ROC curve analysis for the models included in the final

comparison (Fig. 4) visually reinforces these findings, with the

curves for the top default and tuned models closely clustered

near the top-left corner, demonstrating their superior ability to

distinguish between liver patients and non-liver patients.

TABLE 7. HYPERPARAMETER TUNING RESULTS (OPTIMIZED FOR ROC AUC)

Model

Best Cross-

Validation

ROC AUC

Best Parameters

XGBoost (Tuned)

0.99989

{'subsample': 0.7, 'reg_lambda': 0.01, 'reg_alpha': 0.01, 'n_estimators': 1000, 'min_child_weight': 1,

'max_depth': 10, 'learning_rate': 0.05, 'gamma': 0.1, 'colsample_bytree': 0.7}

LightGBM (Tuned)

0.99991

{'subsample': 0.8, 'reg_lambda': 0.001, 'reg_alpha': 0.1, 'num_leaves': 31, 'n_estimators': 200,

'min_child_samples': 20, 'max_depth': 10, 'learning_rate': 0.2, 'colsample_bytree': 0.9}

CatBoost Classifier

(Tuned)

0.99984

{'subsample': 0.9, 'learning_rate': 0.2, 'l2_leaf_reg': 5, 'iterations': 200, 'depth': 10, 'border_count': 32}

Random Forest

(Tuned)

0.99985

{'n_estimators': 300, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth':

20, 'bootstrap': False}

Extra Trees

Classifier (Tuned)

0.99986

{'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth':

None, 'bootstrap': False}

K-Nearest

Neighbors (Tuned)

0.99393

{'weights': 'distance', 'n_neighbors': 17, 'metric': 'manhattan'}

TABLE 8: FINAL MODEL PERFORMANCE COMPARISON ON TEST SET (TOP 5 DEFAULT AND TUNED MODELS)

Model

Accuracy

Precision

Recall (Sensitivity)

F1-Score

Specificity

ROC AUC

Training Time (s)*

Random Forest (Tuned)

0.9961

0.9989

0.9957

0.9973

0.9995

1748.71

LightGBM (Tuned)

0.9979

0.9975

0.9996

0.9986

0.9937

0.9994

271.96

XGBoost (Tuned)

0.9977

0.9986

0.9982

0.9984

0.9964

0.9994

223.47

Random Forest (Default)

0.9961

0.9975

0.9971

0.9973

0.9937

0.9994

0.99

Extra Trees Classifier (Default)

0.9948

0.9964

0.9910

0.9994

0.41

LightGBM (Default)

0.9951

0.9967

0.9964

0.9966

0.9919

0.9994

0.19

XGBoost (Default)

0.9956

0.9960

0.9978

0.9969

0.9901

0.9993

0.31

CatBoost Classifier (Tuned)

0.9941

0.9949

0.9967

0.9958

0.9874

0.9993

509.98

Extra Trees Classifier (Tuned)

0.9951

0.9975

0.9957

0.9966

0.9937

0.9993

617.29

CatBoost Classifier (Default)

0.9889

0.9935

0.9909

0.9922

0.9838

0.9985

0.97

KNN (Tuned)

0.9378

0.9857

0.9261

0.9550

0.9667

0.9893

44.88

* Training Time for Tuned models includes the time taken for the RandomizedSearchCV process.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Fig. 3 Visualizing performance comparison (top 5 default and tuned models)

Fig. 4 ROC curve comparison (top 5 default and tuned models)

In addition to the aggregated performance metrics, analyzing

the confusion matrices provides detailed insight into how each

model performs in correctly classifying positive (Liver Patient)

and negative (Non Liver Patient) instances. Table 4 presented

these components for the default models. For the tuned models,

the confusion matrix components on the independent test set

are presented in Table 9.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

TABLE 9. CONFUSION MATRIX COMPONENTS FOR TUNED MODELS ON TEST

SET

Model

Random Forest (Tuned)

1109

2750

XGBoost (Tuned)

1108

2575

LightGBM (Tuned)

1105

2761

Extra Trees Classifier (Tuned)

1105

2750

CatBoost Classifier (Tuned)

1098

2753

K-Nearest Neighbors (Tuned)

1075

204

2558

Analysis of Table 9 shows that the top-performing tuned

models, particularly LightGBM, XGBoost, Random Forest,

and Extra Trees, exhibit very low numbers of False Positives

(FP) and False Negatives (FN), aligning with their high

Precision, Recall, and Specificity scores presented in Table 8.

For instance, Tuned LightGBM has only 1 False Negative,

highlighting its exceptional ability to identify positive cases.

Tuned Random Forest shows only 3 False Positives, indicating

a very high accuracy in classifying non-patients. Tuned KNN,

while showing improvement over its default counterpart (Table

4), still has a considerably higher number of False Negatives

compared to the tree-based ensemble models, which is reflected

in its lower Recall. These detailed components further support

the findings from the aggregated metrics and are crucial for

understanding the specific types of errors each model makes,

which is vital in a medical diagnostic context.

In summary, the experimental results strongly support the

effectiveness of ensemble and advanced tree-based models,

particularly Random Forest, LightGBM, XGBoost, Extra

Trees, and CatBoost, for liver disease prediction.

Hyperparameter tuning can yield marginal performance

improvements, achieving the highest ROC AUC, but at the cost

of significantly increased training time. The rigorous

methodology, including preprocessing pipelines, SMOTE, and

cross-validation during tuning, enhances the reliability and

validity of these findings.

XXVII. CONCLUSION AND FUTURE WORK

This study conducted a systematic and comprehensive

evaluation of a diverse suite of machine learning classification

algorithms for the prediction of liver disease using the Liver

Patient Dataset (LDPD). The primary objective was to

benchmark the performance of these models, identify those

demonstrating superior predictive capabilities and

computational efficiency, and explore the impact of

hyperparameter tuning.

The experimental results demonstrate that machine learning

classification is highly effective for this task, achieving

exceptionally high performance metrics across several models,

particularly within the ensemble and advanced tree-based

categories. The initial evaluation with default hyperparameters

established a strong baseline, with models like Random Forest,

Extra Trees Classifier, LightGBM, XGBoost, and CatBoost

Classifier achieving ROC AUC scores above 0.99.

Subsequently, hyperparameter tuning using

RandomizedSearchCV with Stratified K-Fold cross-validation

was applied to a subset of promising models. This process,

while computationally more intensive, led to marginal but

significant improvements in performance, achieving the

highest observed metrics. The final comparison, incorporating

both top default and tuned models, revealed that Tuned

Random Forest achieved the highest ROC AUC (0.9995) and

Specificity (0.9973) on the independent test set. Tuned

LightGBM demonstrated the highest Recall (0.9996),

alongside a very high ROC AUC (0.9994). Tuned XGBoost

also exhibited outstanding performance across key metrics,

with a ROC AUC of 0.9994. These results solidify the finding

that ensemble methods, when appropriately tuned, can achieve

near-perfect discrimination and high accuracy in identifying

both positive and negative cases in this dataset.

A crucial insight from this study is the significant trade-off

between model performance and computational efficiency

(training time). While hyperparameter tuning yielded the

highest performance, it drastically increased the training

duration. Conversely, the default configurations of models like

LightGBM, XGBoost, and Extra Trees Classifier provided

exceptionally high performance (ROC AUC ≥ 0.9993) with

significantly faster training times (under 0.5 seconds). This

highlights that for practical applications where rapid model

deployment or frequent retraining is necessary, prioritizing

slightly lower, but still excellent, performance with

significantly faster training from default configurations of

efficient algorithms like LightGBM could be more suitable.

The rigorous methodology employed, including the use of

preprocessing pipelines to prevent data leakage, SMOTE to

address class imbalance, and Stratified K-Fold cross-validation

during tuning for robust performance estimation, enhances the

reliability and validity of these findings. Analysis of the

confusion matrices provided detailed insights into the types of

errors made, confirming the low rates of both false positives

and false negatives among the top models.

Based on this comprehensive evaluation, the tuned versions

of Random Forest, LightGBM, and XGBoost are identified as

the top-performing models. Considering the performance-

efficiency trade-off, the default configurations of LightGBM,

XGBoost, and Extra Trees Classifier are also highly promising

candidates for practical implementation due to their strong

performance combined with rapid training.

For future work, external validation on independent datasets

is a crucial next step before these models can be considered for

clinical application; additionally, exploring deeper model

interpretability using techniques like SHAP and LIME can

provide valuable insights into feature influence, and

investigating the practical challenges of clinical integration is

essential for real-world deployment.

REFERENCES

[1] World Life Expectancy, “Liver disease in India,” (2022, April 14).

[Online]. Available: https://www.worldlifeexpectancy.com/india-liver-

disease

[2] D. R. J. P. Sindhuja and R. J. Priyadarsini, “A survey on classification

techniques in data mining for analyzing liver disease disorder,” Int. J.

Comput. Sci. Mobile Comput., vol. 5, no. 5, pp. 483–488, 2016.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

[3] G. Shaheamlung, H. Kaur, and M. Kaur, “A survey on machine learning

techniques for the diagnosis of liver disease,” in Proc. 2020 Int. Conf.

Intelligent Eng. Management (ICIEM), 2020, pp. 17–19.

[4] A. Q. Md, S. Kulkarni, C. J. Joshua, T. Vaichole, S. Mohan, and C.

Iwendi, “Enhanced preprocessing approach using ensemble machine

learning algorithms for detecting liver disease,” Biomedicines, vol. 11,

no. 2, p. 581, 2023, doi: 10.3390/biomedicines11020581.

[5] B. V. Ramana, M. S. P. Babu, and N. B. Venkateswarlu, “A critical study

of selected classification algorithms for liver disease diagnosis,” Int. J.

Database Manag. Syst., vol. 3, no. 4, pp. 101–114, 2011.

[6] B. V. Ramana, M. P. Babu, and N. B. Venkateswarlu, “Liver

classification using modified rotation forest,” Int. J. Eng. Res. Dev., vol.

6, no. 4, pp. 17–24, 2012.

[7] Y. Kumar and G. Sahoo, “Prediction of different types of liver diseases

using rule based classification model,” Technol. Health Care, vol. 21, no.

5, pp. 417–432, 2013.

[8] H. Ayeldeen, O. Shaker, G. Ayeldeen, and K. M. Anwar, “Prediction of

liver fibrosis stages by machine learning model: A decision tree

approach,” in Proc. 2015 Third World Conf. Complex Syst. (WCCS),

2015, pp. 23–25.

[9] S. Hashem et al., “Comparison of machine learning approaches for

prediction of advanced liver fibrosis in chronic hepatitis C patients,”

IEEE/ACM Trans. Comput. Biol. Bioinformatics, vol. 15, no. 3, pp. 861–

868, 2018, doi: 10.1109/TCBB.2017.2712365.

[10] S. Sontakke, J. Lohokare, and R. Dani, “Diagnosis of liver diseases using

machine learning,” in Proc. 2017 Int. Conf. Emerging Trends &

Innovation in ICT (ICEI), 2017, pp. 3–5.

[11] H. Ma, C.-F. Xu, Z. Shen, C.-H. Yu, and Y.-M. Li, “Application of

machine learning techniques for clinical predictive modeling: A cross-

sectional study on nonalcoholic fatty liver disease in China,” Biomed Res.

Int., vol. 2018, Art. ID 4304376, 2018, doi: 10.1155/2018/4304376.

[12] J. Jacob, J. C. Mathew, J. Mathew, and E. Issac, “Diagnosis of liver

disease using machine learning techniques,” Int. Res. J. Eng. Technol.,

vol. 5, no. 5, pp. 412–423, 2018.

[13] D. Sivakumar, M. Varchagall, and S. A. Gusha, “Chronic Liver Disease

Prediction Analysis Based on the Impact of Life Quality Attributes,” Int.

J. Recent Technol. Eng. (IJRTE), vol. 7, no. 6, pp. 2111–2117, 2019.

[14] V. Durai, S. Ramesh, and D. Kalthireddy, “Liver disease prediction using

machine learning,” Int. J. Adv. Res. Ideas Innovation Technol., vol. 5, no.

3, pp. 1584–1588, 2019.

[15] [15] V. J. Gogi, “Prognosis of liver disease: Using machine learning

algorithms,” in Proc. Conf. Recent Innovations Electr., Electron. &

Commun. Eng. (ICRIEECE), 2018, pp. 27–28.

[16] C. Geetha and A. R. Arunachalam, “Evaluation based approaches for

liver disease prediction using machine learning algorithms,” in Proc.

2021 Int. Conf. Comput. Commun. Informatics (ICCCI), 2021, pp. 27–29.

[17] C.-C. Wu et al., “Prediction of fatty liver disease using machine learning

algorithms,” Comput. Methods Programs Biomed., vol. 170, pp. 23–29,

2019, doi: 10.1016/j.cmpb.2018.10.028.

[18] A. Shrivastava, “Liver disease patient dataset,” Kaggle. (n.d.). [Online].

Available: https://www.kaggle.com/datasets/abhi8923shriv/liver-

disease-patient-dataset/data

[19] M. Sameer and B. Gupta, “Detection of epileptical seizures based on

alpha band statistical features,” Wireless Pers. Commun., vol. 115, no. 2,

pp. 909–925, 2020, doi: 10.1007/s11277-020-07542-5.

[20] S. Maldonado, J. López, and C. Vairetti, “An alternative SMOTE

oversampling strategy for high-dimensional datasets,” Appl. Soft

Comput., vol. 76, pp. 380–389, 2019, doi: 10.1016/j.asoc.2018.12.021.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Large-Scale Customer Feedback Analysis via a

Kafka Pipeline and Pre-Trained Transformers

Gasbaoui Mohammed el Amin, Benkrama Soumia, Bendjima Mostefa,

Abden Sofiane

Laboratory of TIT

Mathematics and Computer Science Department

Faculty of Exact Sciences

Tahri Mohammed University, Bechar, Algeria

gasbaoui.mohammedelamin@univ-bechar.dz

benkrama.soumia@univ-bechar.dz

bendjima.mostefa@univ-bechar.dz

abdan.soufyane@univ-bechar.dz

Abstract— In today's digital era, businesses receive massive

volumes of customer reviews across multiple platforms,

generating vast amounts of unstructured data. Effectively

analyzing this big data is crucial for understanding customer

sentiment, improving products, and enhancing services. In this

paper, we propose a data pipeline for processing user reviews

using Kafka Streams and pre-trained models from Hugging Face.

The system classifies user reviews into five sentiment categories:

very negative, negative, neutral, positive, and very positive.

Additionally, it categorizes reviews into six predefined aspects:

shipping and delivery, customer service, price and value, quality

and performance, use and design, and others. The classification

models are based on DistilBERT, a smaller and faster variant of

BERT (Bidirectional Encoder Representations from

Transformers) that retains much of its performance while

improving efficiency. The design system is enhanced by a Docker

image for running the Zookeeper service. A REST API for

handling both user requests and result prediction. The proposed

system demonstrates promising results, offering businesses a

valuable tool for evaluating their services and strengthening

customer relationships.

Keywords— Kafka Streams, Big Data, Customer Review

Analysis, Transformers, Deep Learning, Natural Language

Processing.

XXVIII. INTRODUCTION

Data is being generated at an unprecedented pace, with

internet-scale companies producing terabytes of information

each day. Efficient analysis of this data is essential for deriving

meaningful insights [1]. The expansion of the internet has

greatly facilitated the growth of user-generated content,

enabling individuals to share their opinions and participate in

discussions across diverse platforms, including blogs, social

networks, e-commerce websites, and forums. Consequently,

this has led to the generation of a substantial volume of user-

generated data [2]. Leveraging big data analytics enables

businesses to extract actionable insights, identify trends, and

make data-driven decisions, ultimately improving customer

satisfaction and competitive advantage. Big data and deep

learning are closely interconnected, as deep learning models

rely on large-scale data to achieve high accuracy and

performance. The volume, variety, and velocity of big data

offer a valuable resource that deep learning algorithms utilize

to detect patterns, generate predictions, and enhance decision-

making processes [1].

BERT (Bidirectional Encoder Representations from

Transformers) is a groundbreaking natural language processing

model. Unlike traditional models, BERT uses a bidirectional

approach to understand the context of a word based on its

surrounding words. This deep contextual understanding makes

BERT highly effective for tasks like question answering,

sentiment analysis, and language translation [3]. In the study

[2], BERT demonstrated superior performance, achieving an

impressive 89% accuracy. By analyzing a vast dataset of

Amazon reviews across multiple product categories, this

research offers valuable insights that help both consumers make

informed decisions and businesses enhance product and service

quality.

Kafka is a publish-subscribe messaging system written in the

Scala programming language. It is designed to be highly

scalable, durable, and fault-tolerant. While its primary purpose

is to facilitate real-time data streaming for analytics, Kafka is

also widely used for tasks such as monitoring, message replay,

log aggregation, error recovery, and website activity tracking.

It offers simplicity, high throughput, and robust replication

capabilities, making it a reliable solution for handling large-

scale data streams [4]. Apache Kafka consists of several core

components that work together to enable its functionality.

Producers are responsible for sending data to Kafka topics,

while Consumers retrieve this data for processing. Topics are

logical channels that store and categorize data streams, divided

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

into Partitions for scalability and fault tolerance. Brokers are

servers that handle the storage and management of data in

partitions, and a cluster is formed by multiple brokers working

together. The ZooKeeper service manages metadata,

configurations, and leader election within the Kafka cluster.

Kafka’s internal connect API supports integration with external

systems, and the streams API facilitates real-time data

processing.

Our contribution involves the development of a system

designed to assist businesses in evaluating their services,

products, and customer relationships. The proposed system

architecture is built on Apache Kafka and leverages pre-trained

models from Hugging Face, utilizing the Transformers library.

Specifically, the models are based on fine-tuned DistilBERT,

enabling efficient and accurate analysis of customer feedback.

XXIX. METHODS

The entire system operates within a Docker container

running Confluent's ZooKeeper, a distributed coordination

service used by Kafka to manage the cluster. Kafka Streams

serves as the core data processing engine, orchestrating the flow

of data between the REST API, the Hugging Face model, and

various Kafka topics. Figure 1 illustrates the pipeline for

processing user reviews and performing sentiment analysis and

classification. The workflow for handling user requests consists

of the following steps:

1. The user submits a review via the REST API by

making a request to the user review endpoint.

2. The publisher component within Kafka Streams

publishes the user's request as a message to a Kafka

topic named request user.

3. The consumer/publisher component acts as both a

consumer and a publisher. In this step, it first

consumes the message from the request user topic.

4. The consumed data is then sent to a pre-trained model

hosted on Hugging Face, a platform providing access

to various pre-trained Natural Language Processing

(NLP) models.

5. The Hugging Face model, specifically a fine-tuned

DistilBERT model, performs sentiment analysis and

classifies the customer review.

6. The model's prediction (the aggregated classification

result) is published as a message to another Kafka

topic named result prediction.

7. A consumer within Kafka Streams retrieves the

prediction result from the result prediction topic and

processes it for visualization.

8. The final processed result is sent back to the REST

API, where the result prediction visualization

endpoint presents the sentiment analysis outcome.

XXX. RESULTS AND DISCUSSIONS

DistilBERT [5] is a transformer-based model that is smaller

and faster than BERT while retaining much of its performance.

The sentiment analysis model [6] is based on distilbert-base-

uncased, meaning it does not differentiate between uppercase

and lowercase words (e.g., "english" and "English" are treated

the same). The model classifies reviews into five categories:

very negative, negative, neutral, positive, and very

positive.

Fig. 1 The data pipeline for processing user reviews based on Kafka and

pre-trained HuggingFace models.

The pre-trained model was fine-tuned using synthetic

data, which is artificially generated data designed to replicate

real-world data while preserving similar statistical properties.

Synthetic data is created through algorithms, simulations, or

generative models, such as GANs (Generative

Adversarial Networks) or deep learning techniques, rather

than being collected from real-world observations.

The model was fine-tuned for five epochs, achieving a

train_acc_off_by_one of approximately 0.95 on the

validation dataset [6]. This metric provides a more flexible

evaluation criterion by allowing predictions that are off by one

class to still be considered correct. For instance, in a

classification task with five sentiment labels (e.g., a scale from

1 to 5), if the true label is 3, predictions of 2, 3, or 4 would be

counted as correct, while predictions of 1 or 5 would be

considered incorrect.

The second model used in this study is the Customer-

Reviews-Classification model [7], a fine-tuned DistilBERT

model specifically designed for document classification. It

categorizes customer feedback into six predefined classes:

shipping and delivery, customer Service, price and Value,

quality and performance, use and Design, and other.

The model was trained on a synthetic dataset for seven

epochs using a learning rate of 3e-5, a batch size of 16, and

Gradient Accumulation Steps of 2. Gradient accumulation

allows the model to accumulate gradients over two batches

before updating its weights, which helps optimize memory

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

usage in GPU-constrained environments. A weight decay of

0.015 was applied as a regularization technique to prevent

overfitting by gradually reducing the magnitude of model

weights. Additionally, a warm-up ratio of 0.1 was used to

gradually increase the learning rate during the first 10% of

training steps, preventing instability at the start of training. The

model achieved 0.947 accuracy, 0.948 precision, recall, and F1-

score on a custom dataset.[7]

Table 1 presents some sample reviews alongside their

predicted sentiment and category classifications. While the

model accurately predicts sentiment and assigns category labels

in most cases, certain limitations arise when handling

overlapping categories or documents containing multiple

relevant labels. Since the model [7] is designed for single-label

classification, it can assign only one category per document.

Consequently, if a review exhibits characteristics of multiple

categories (e.g., both quality and performance, price and value),

the model may fail to capture all relevant aspects, leading to

potential misclassification.

TABLE VIII

SOME SAMPLE REVIEWS ALONGSIDE THEIR PREDICTED SENTIMENT AND

CATEGORY CLASSIFICATION.

Reviews

Sentiment

analysis

prediction

Review

classification

prediction

The package arrived two days

earlier than expected, and

everything was securely packed.

Really impressed with the fast and

reliable shipping!

positive

Shipping and

delivery

My order was supposed to arrive

in three days, but it took two

weeks! No updates, no tracking

info, and customer service was

unhelpful.

Very

Negative

Shipping and

delivery

The support team was amazing!

They responded quickly, resolved

my issue, and even followed up to

make sure I was satisfied. Best

service ever!

Very

Positive

Customer

service

Not worth the money! I expected

better quality for the price I paid.

Feels cheap and overpriced.

Negative

Price and

value

This product exceeded my

expectations! It’s durable, works

perfectly, and feels premium. I

highly recommend it!

Very

Positive

Quality and

performance

Very disappointed overall. I had

high expectations, but the whole

experience was frustrating from

start to finish.

Very

Negative

Other

Regarding sentiment analysis, which is rated on a scale from

1 (very negative) to 5 (very positive), model [6] validation is

performed using the train_acc_off_by_one metric. However, a

limitation exists in predicting neutral sentiment (score of 3), as

scores of 2 and 4 are also considered correct predictions.

Leveraging Kafka’s distributed architecture enables the system

to efficiently process large volumes of user reviews. The

microservices-based design enhances flexibility, scalability,

and ease of maintenance. Additionally, Docker ensures a

consistent execution environment and simplifies deployment

across various platforms. Kafka’s publisher-subscriber model

decouples data ingestion from processing, reducing latency and

improving system responsiveness. However, despite Kafka’s

optimization for streaming, system performance may be

affected by inference latency introduced by the computational

complexity of pre-trained deep learning models.

XXXI. CONCLUSION

In this study, we propose a sentiment analysis and review

classification system designed to help businesses assess the

quality of their services, products, and customer interactions.

The system leverages Kafka Streams and a pretrained

transformer model from Hugging Face, specifically a fine-

tuned DistilBERT model. We detail the hyperparameter tuning,

the dataset used, and the performance evaluation of each model

across sentiment categories ranging from very negative to very

positive. Additionally, the system classifies reviews into key

aspects such as shipping and delivery, customer service, and

more. Furthermore, we present a comprehensive analysis of the

data pipeline and system workflow, starting from the REST

API that captures user requests, processing through Kafka

Streams, and ultimately generating prediction results. The

system leverages a Docker image integrated with the Confluent

Platform, which enhances Kafka with advanced features for

accelerated application development and connectivity.

REFERENCES

[1] G. Mohammed el Amin, B. Soumia, B. Mostefa, A. Sofiane, and K.

Ikram, ‘Distributed Training Based on Horizontal Scaling for Food

Image Classification’, in 2024 4th International Conference on

Embedded & Distributed Systems (EDiS), Nov. 2024, pp. 320–324. doi:

10.1109/EDiS63605.2024.10783205.

[2] H. Ali, E. Hashmi, S. Yayilgan Yildirim, and S. Shaikh, ‘Analyzing

Amazon Products Sentiment: A Comparative Study of Machine and

Deep Learning, and Transformer-Based Techniques’, Electronics, vol.

13, no. 7, p. 1305, Mar. 2024, doi: 10.3390/electronics13071305.

[3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘BERT: Pre-training

of Deep Bidirectional Transformers for Language Understanding’, in

Proceedings of the 2019 Conference of the North American Chapter of

the Association for Computational Linguistics: Human Language

Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran,

and T. Solorio, Eds., Minneapolis, Minnesota: Association for

Computational Linguistics, Jun. 2019, pp. 4171–4186. doi:

10.18653/v1/N19-1423.

[4] S. T and S. N. K, ‘A study on Modern Messaging Systems- Kafka,

RabbitMQ and NATS Streaming’, Dec. 08, 2019, arXiv:

arXiv:1912.03715. doi: 10.48550/arXiv.1912.03715.

[5] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, ‘DistilBERT, a distilled

version of BERT: smaller, faster, cheaper and lighter’, Mar. 01, 2020,

arXiv: arXiv:1910.01108. doi: 10.48550/arXiv.1910.01108.

[6] ‘tabularisai/robust-sentiment-analysis · Hugging Face’. Accessed: Mar.

29, 2025. [Online]. Available: https://huggingface.co/tabularisai/robust-

sentiment-analysis

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

[7] ‘dnzblgn/Customer-Reviews-Classification · Hugging Face’. Accessed:

Mar. 29, 2025. [Online]. Available:

https://huggingface.co/dnzblgn/Customer-Reviews-Classification

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Comparative Analysis of Mortality Prediction Models at

the University Hospital Center of Oran, Algeria

Mohammed Nadjib Osmani 1, Djamila Benhaddouche 2, Nawal Sad Houari 3

1Mathematics and computer science department, 2Mathematics and computer science department, 3Living

and environment department

1mohammednadjib.osmani@univ-usto.dz

2djamila.benhaddouche@univ-usto.dz

3nawal.sadhouari@univ-usto.dz

1University of Science and Technology Mohamed Boudiaf, Oran, Algeria

2University of Science and Technology Mohamed Boudiaf, Oran, Algeria

3University of Science and Technology Mohamed Boudiaf, Oran, Algeria

Abstract— Predicting mortality is an important field of study that

aids in making wise healthcare decisions and offers insightful

information about population health. Using demographic and

hospital-service data from the University Hospital Center of Oran

(CHUO), Algeria, this study employs machine learning (ML)

models to forecast the ultimate causes of mortality. Sex, city of

residence, hospital services used, and the beginning, intermediate,

and ultimate causes of death are among the factors included in the

12,604 records that make up the dataset. To find trends and

forecast the causes of death in eight distinct groups, six machine

learning models—Logistic Regression (LR), Random Forest (RF),

Support Vector Machine (SVM), Naive Bayes (NB), Multilayer

Perceptron (MLP), and Extreme Gradient Boosting (XGBoost)

were trained and assessed. XGBoost achieved an accuracy and

specificity of 84.05%, with a precision of 42.73%, recall of 25.53%,

and an F1 score of 28.33%, the model outperformed the other

evaluated models, proving its ability to effectively capture

intricate relationships in the data. The study demonstrates how

machine learning techniques can be used to examine a variety of

variables and find significant patterns in mortality trends. This

work enhances predictive analytics in healthcare by utilizing local

data and sophisticated algorithms, providing useful instruments

for directing public health initiatives. The results highlight how

machine learning can improve healthcare outcomes and solve

issues connected to mortality in Algeria.

Keywords— Mortality, Machine Learning, classification,

Prediction, Healthcare.

XXXII. INTRODUCTION

The health of a population and the efficiency of its healthcare

system are both significantly influenced by mortality rates. The

mortality rate in Algeria raises serious issues related to public

health. Algeria's crude death rate, which has been rather steady

in previous years, was recorded as 4.329 deaths per 1,000

inhabitants in 2022 [1]. Although it has decreased from prior

years, infant mortality is still a concern, with a rate of 17.365

deaths per 1,000 live births in 2025 [2]. There are issues with

maternal mortality as well; in 2020, there were 78 fatalities for

every 100,000 live births [3]. Furthermore, noncommunicable

illnesses are responsible for almost 74% of all fatalities in

Algeria, highlighting the necessity of strong predictive

technologies for efficient health outcome management [4].

In this regard, machine learning (ML) presents revolutionary

possibilities by facilitating precise mortality forecasts derived

from intricate datasets. To find important predictors of

mortality, these models can analyze a variety of variables,

including demographics, medical diagnoses, and healthcare

consumption patterns [5], [6]. In hospital settings, where

prompt identification of high-risk patients can direct therapies

and resource allocation, the use of machine learning for

mortality prediction is especially beneficial [7].

The goal of this work is to employ machine learning models

trained on data from the University Hospital Center of Oran

(CHUO), Algeria, to predict the ultimate causes of death for

eight different classes. The dataset contains information on

initial, intermediate, and final causes of death, sex, city of

residence, and hospital services. In order to increase predicted

accuracy and enhance our comprehension of mortality patterns,

this study uses models such as LR, RF, SVM, NB, MLP and

XGBoost. The main objectives of this work are:

 Developing Predictive Models: Using data from the

Oran University Hospital Center, ML models LR, RF,

SVM, NB, MLP, and XGBoost will be trained and

assessed to predict the final cause of death across eight

classes.

 Improving Mortality Prediction: By using clinical and

demographic characteristics including sex, city of

residence, hospital services, and causes of death at

different stages, it is possible to increase thethe accuracy

and reliability of mortality prediction.

 Improving Healthcare Analytics: To demonstrate how

ML can revolutionize healthcare analytics in Algeria by

utilizing local datasets to create customized solutions that

can direct public health initiatives meant to lower

avoidable deaths by identifying trends and risk factors

linked to various causes of death.

The rest of this paper is as follows: Section 2 provides the

background and discusses the fundamentals of mortality

prediction, along with the associated challenges. Related works

are reviewed in Section 3. The proposed approach for mortality

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

prediction at CHUO is detailed in Section 4. Section 5 presents

the results and discussion. Finally, Section 6 offers a conclusion

and outlines directions for future work.

XXXIII. BACKGROUND

With the growing availability of electronic health records

(EHRs) and sophisticated computational tools, machine

learning (ML) techniques for mortality prediction have become

a crucial topic of healthcare study. This section examines the

main ideas surrounding mortality prediction and how they

apply to the current investigation.

J. Importance of Mortality Prediction in Healthcare

A key component of modern healthcare systems is mortality

prediction, which helps physicians evaluate patient risks,

distribute resources efficiently, and create individualized

treatments. For instance, prompt identification of high-risk

patients in intensive care units can greatly enhance outcomes

by enabling early interventions. According to studies, ML

models perform more accurately than conventional scoring

systems like SAPS III and APACHE IV; some of them even

reach an area under the curve (AUC) of 92.9% [8]. These

developments highlight how ML can revolutionize clinical

decision-making.

K. Machine Learning Models in Mortality Prediction

ML offers a robust framework for analysing complex and

heterogeneous datasets characteristic of healthcare

environments. Models such as RF, SVM, XGBoost, and neural

networks have been applied successfully to predict mortality

across various contexts:

 All-Cause Mortality: Research employing datasets such

as MIMIC-III [9] has shown that by integrating factors

including vital signs, test findings, and demographic data

[6], [10], feature-rich ML models can attain excellent

predictive accuracy for all-cause mortality.

 Disease-Specific Mortality: ML has been used to predict

mortality in some situations, such as pancreatitis, sepsis,

and stroke, frequently outperforming conventional

techniques [10], [8].

 Chronic Conditions: ML models have been employed to

accurately predict both long-term and short-term death in

patients with chronic and complex illnesses by utilizing

readily available characteristics and healthcare resource

utilization data [5].

L. Challenges in Mortality Prediction

ML models in mortality prediction encounter several

challenges:

 Data Quality: Many studies emphasize the significance

of high-quality datasets for the training of dependable

models. Absence of values or the presence of noisy data

might substantially degrade model efficacy [11], [12].

 Model Responsiveness: Some researches have revealed

shortcomings in the capacity of ML models to detect

swiftly worsening health problems or severe injuries,

highlighting the necessity for additional enhancement

[12].

 Generalizability: Ensuring that models exhibit robust

performance across varied patient populations is a

primary priority. Tailoring models for certain patient

populations or medical scenarios may mitigate this issue

[8], [12].

XXXIV. RELATED WORKS

The cause of death is a critical outcome in clinical research;

nevertheless, access to cause-of-death data is still restricted.

Several studies have been performed to classify mortality status

and ascertain particular causes of death.

Kim et al. [13] create and validate a machine-learning model to

forecast the cause of death based on a patient's most recent

medical examination. The model employed a stacking

ensemble approach to classify all-cause mortality and eight

predominant causes of death in South Korea, as well as other

causes. Clinical data from national claims (n=174,747) and

electronic health records (n=729,065) were utilized for model

building and validation, with external validation conducted on

data from three US claims databases (n=994,518, 995,372,

407,604). The model exhibited superior performance, attaining

an AUROC of 0.9511 for predicting cause of death within 60

days, and 0.8887 for external validation. Significantly, 11.32%

of fatalities in the Medicare Supplemental database were

ascribed to malignant neoplastic illness. Lee et al. [10] utilized

the MIMIC-III dataset to forecast all-cause in-hospital

mortality through sophisticated feature engineering. Essential

variables, encompassing vital signs, laboratory results, and

demographic data, were employed to train the models. Of the

models evaluated, RF had the superior performance, achieving

an AUROC of 0.94. The research underscored the essential

role of feature engineering and the application of SHAP values

[14] in elucidating how specific features influence a model's

predictions, hence underlining their importance in developing

robust models that might improve clinical decision-making. In

[6], the authors provide the IMPACT framework, which utilizes

explainable artificial intelligence (XAI) methodologies to

elucidate a state-of-the-art tree ensemble model for forecasting

all-cause mortality. The framework is utilized on the NHANES

dataset [15], which includes 47,261 samples and 151

characteristics, to examine mortality across 1-, 3-, 5-, and 10-

year follow-up intervals. The findings indicate that IMPACT

surpasses conventional linear models and neural networks in

terms of accuracy. The approach identifies neglected risk

variables, interaction effects, and correlations between

laboratory characteristics and mortality, indicating possible

modifications to existing reference intervals. The research

formulates interpretable mortality risk ratings, guaranteeing

generalizability by temporal and external validation with the

UK Biobank dataset, so rendering these scores available to both

healthcare professionals and the public. Shahidi et al. [16]

utilized ML algorithms to forecast mortality among people in

continuing care in Alberta, along with their comorbidities. LR

and several ML algorithms

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Fig. 4 The proposed Approach overview

were employed to assess the 60-day mortality risk,

demonstrating superior predictive performance. Authors

emphasized the need of including demographic and clinical

characteristics for predicting short-term mortality. Guillamet

et al. [5] investigate the utilization of ML for forecasting death

in patients with chronic and intricate illnesses, with the

objective of improving resource allocation and decision-

making in healthcare. A classification system was employed to

forecast both long-term mortality (over four years) and early

death (within six months) utilizing available factors and

healthcare resource utilization. The XGBoost model attained

an 87% accuracy in predicting long-term mortality, but the

Gradient Boosting (GRBoost) model exhibited a lower efficacy

for early mortality, with an accuracy of 83%. A variety of

evaluation criteria, such as recall, accuracy, F1-score, and

AUC, were employed to evaluate the model's performance.

Nistal-Nuño [20] compared gradient-boosted decision trees and

logistic regression models for predicting 12-hour mortality in

ICU patients using 1-hour resolution physiological data (eight

parameters over 5 hours) from the MIMIC-III database. The

model achieved an AUROC of 0.89 versus 0.806 for logistic

regression, along with higher accuracy (0.814 vs. 0.782),

diagnostic odds ratio (17.823 vs. 9.254), and improved metrics

including Cohen’s kappa, F-measure, and Matthews correlation

coefficient. These results highlight that the model enhanced

ability to handle unbalanced datasets for mortality prediction,

likely due to its capacity to model complex interactions in ICU

data. García-Gallo et al. [21] developed a 1-year mortality

prediction model for sepsis patients using clinical data from the

first 24 hours of 5,650 MIMIC-III admissions (70% training,

30% validation). A Stochastic Gradient Boosting algorithm,

combined with LASSO for variable selection, achieved an

AUROC of 0.8039, outperforming traditional scores like SAPS

II, SOFA, and OASIS. The results highlight the superiority of

machine learning approaches for long-term mortality

prediction in sepsis care. Iwase et al. [22] leveraged random

forest machine learning to predict ICU mortality and stay

duration with high precision using admission data from 12,747

patients at Chiba University Hospital. The RF model achieved

exceptional performance, notably an AUC of 0.945 for

mortality and 0.881–0.889 for stay length, outperforming

conventional methods. Lactate dehydrogenase was pinpointed

as the most influential variable, aiding both outcome prediction

and patient clustering based on mortality risk.

These works collectively underscore significant progress in

ML for mortality prediction, especially in the application of

varied datasets, enhancement of interpretability, and

consideration of cause-specific outcomes. This work

introduces a dataset derived from the electronic health records

(EHRs) of mortality data from CHUO, Algeria, which has been

meticulously collected, cleaned, and structured. To the best of

our knowledge, no current research in the literature have

employed a dataset that differentiates between several phases

of causes, such as initial and intermediate causes of death, to

determine the final cause of death.

XXXV. PROPOSED APPROACH

The proposed approach covers several essential steps (see

Fig. 1): Data preprocessing includes cleaning data, handling

missing values, encoding categorical variables, removing

duplicates and inconsistencies, and aggregating similar

conditions to minimize the number of classes. (2) Data

Splitting divides the dataset into training and testing subsets;

(3) Feature Selection emphasizes the identification of pertinent

features from the dataset; and (4) Classification involves

training various classification models and evaluating their

performance to determine the most effective one for the task.

M. CHUO Mortality Dataset

The dataset used in this work was collected from the

administrative records of the admissions office of the CHUO,

Algeria. These records were submitted by general practitioners

and comprise unprocessed data on 13,091 patients who died

during hospitalization over a 12-month period, from March

2018 to February 2025. The dataset has 11 variables that

contain demographic, geographical, and medical information

regarding the deceased patients. Table 1 presents a summary

of the raw dataset's composition and principal characteristics.

TABLE IX

CHUO RAW DATASET’S STRUCTURE AND ATTRIBUTES

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

Attribute

Description

Occurrences/Details

Identif

Unique

identifier for

each patient

Month

Month of

death (1 to

12)

Year

Year of death

(2018 to

2025)

Sexe

Gender of the

patient

Male: 7,627; Female:

5,465

Age

Age of the

patient (1 day

to 99 years)

[0-18]: 3,310; [19-25]:

243; [26-50]: 1,968; [51-

75]: 4,954; >75: 2,343

city_res

Wilaya

(province) of

residence

Patients from 136 different

city

Wilaya_dec

Wilaya where

death

occurred

Service

Hospital

service where

the patient

died

Data from 75 different

hospital services

Cause_death_init

Initial cause

of death

2,539 occurrences

Cause_death_iterm

Intermediate

cause of

death

1,450 occurrences

Cause_death_final

Final cause of

death

330 occurrences

The dataset includes variables such as gender and age

distribution. The majority of fatalities were observed in

patients aged 51 to 75 years (4,954 cases), followed by those

over 75 years (2,343 occurrences). Patients were sourced from

a diverse array of 136 cities, with supplementary data regarding

the wilaya of the death's occurrence. The dataset delineates the

evolution of mortality causes across three phases: beginning,

intermediate, and final, offering a comprehensive causal chain

for mortality analysis. Data encompasses 75 distinct healthcare

services where deaths have been reported.

This dataset offers a rich foundation for analyzing mortality

trends and training ML algorithms to forecast final causes of

death. This work aims to contribute important insights into

mortality prediction in Algerian hospitals by utilizing its unique

features and comprehensive information.

N. Data Preprocessing

Effective data preprocessing is a crucial step in ensuring the

quality and reliability of any dataset analysis. Below, we detail

the key steps undertaken to preprocess the mortality dataset

from CHUO, Algeria:

1) Data Cleaning: The first step in preprocessing was to clean

the dataset by addressing inconsistencies and errors:

 Duplicate and Inconsistent Values: We identified and

removed duplicate entries and inconsistent data points.

Additionally, rare causes of death were excluded to focus

on the most relevant patterns.

 Error Correction: Data entry errors were corrected to

improve accuracy.

After the cleaning process, 487 records were removed,

leaving 12,604 valid records for analysis.

2) Handling Missing Values: Missing data can significantly

impact the quality of analysis. To address this issue:

 Identification of Missing Values: The attribute "Age"

was found to have 274 missing values.

 Imputation Technique: These missing values were

replaced with the mean age, ensuring that no records were

excluded while maintaining statistical integrity.

3) Categorical Encoding: The dataset contained several

categorical variables (gender, city, hospital services, and causes

of death). We applied one-hot encoding to transform

categorical variables into numerical representations, making

them suitable for computational analysis.

4) Class Aggregation: Similar causes of death were aggregated

into broader categories. This reduced the number of distinct

classes for final causes of death to 8 categories (see Table 2).

This aggregation simplified the classification process while

maintaining the clinical relevance of the information.

TABLE IX

THE EIGHT CLASSES OF FINAL CAUSES OF DEATH

Class

Final causes of death

Total

occurrence

Cardiac Respiratory Arrest

9,531

Acute Respiratory Failure

1,915

Shock State - Septic - Cardiogenic -

Hypovolemic

680

Multi-organ Failure

161

Neurological Failure

148

Heart Failure

Hemorrhagic - Embolic Causes

Renal - Electrolyte Failure

O. Data splitting

To train our models, the data was divided into two sets: 80%

for the training set and 20% for the testing set. To address the

issue of imbalanced classes, we ensured that the class

distribution was proportionally identical in both the training

and testing sets. This stratification guarantees fair

representation of each class. (see Fig. 2).

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Fig. 2 Data splitting to training and testing set.

P. Feature Selection

The dataset has 12,604 samples (rows) and 11 characteristics

(columns). Every row signifies a deceased patient. The

attributes Identif, Month, Year, city_res, and Wilaya_dec were

omitted from the model training phase due to their lack of

relevance for prediction. The attribute Cause_death_final was

selected as the target variable for prediction.

Q. Classification

In this work, several machine learning classification models

were trained to predict the final causes of death. The models

utilized include LR, RF, SVM, Naive Bayes, MLP and

XGBoost. The training process was optimized using the Adam

optimizer, with the dataset split into 80% for training and 20%

for testing.

XXXVI. RESULTS & DISCUSSION

The algorithms were developed and tested on a PC with an

Intel(R) Core(TM) i5-10400F CPU @ 2.90GHz, 16 GB RAM,

and a 6 GB NVIDIA GTX 1660 Super graphics card. The

Python libraries scikit-learn, Keras and Tensorflow were

utilized for development.

R. Results

As shown in Fig. 3, the performance of each model is

assessed using standard classification metrics. By comparing

these metrics across all models, the best-performing algorithm

for predicting final causes of death is identified.

Fig. 3 Performance metrics for Each Model Classifier.

For instance, the Confusion matrix of the 8 classes for

XGBoost model classifier is shown in Fig 4.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

Fig. 4 XGBoost model’s confusion matrix of the predicted classes.

S. Discussion

The provided results, as indicated in Table 3, show a

significant variance in model performance across various

machine learning algorithms for predicting causes of death

using patient demographics, initial causes of death, and

intermediate causes of death. All models showed suboptimal

precision (21.47%-44.98%) and recall (19.22%-31.83%)

despite moderate accuracy (79.13%-84.05%), while LR

performed the worst (35.94% accuracy) due to its linear

assumptions failing to capture complex relationships in non-

linear, imbalanced data. XGBoost was the best-performing

model with 84.05% accuracy and a 28.33 F1-score. This

disparity between low positive-class metrics and high

specificity points to systemic limitations: (1) Class imbalance

most likely skewed predictions toward majority classes,

increasing accuracy but reducing sensitivity to rarer causes of

death; (2) The limited availability of demographic and hospital-

service features, as opposed to the rich clinical biomarkers or

comorbidities typically used in high-performing models,

constrained discriminative power and reduced accuracy in

predicting precise causes of death.

TABLE IIXI

PERFORMANCE METRICS FOR EACH MODEL CLASSIFIER

Models

Accuracy

Precision

Recall

F1 Score

Specificity

35,94

21,47

31,83

17,15

35,94

83,38

35,03

23,55

25,28

83,38

SVM

82,98

44,98

24,47

26,70

82,98

80,64

24,93

23,71

23,73

80,64

MLP

79,13

22,11

19,22

19,07

79,13

XGBoost

84,05

42,73

25,53

28,33

84,05

XXXVII. CONCLUSION AND FUTURE WORKS

This study analyzes demographic, hospital-service, and

causes of death data from CHUO, Algeria, to demonstrate how

ML models can be used to forecast the causes of mortality at

different stages. Although class imbalance and limited scope

of features caused problems for all models in terms of precision

and recall, XGBoost showed the highest accuracy (84.05%)

among the tested models. These results highlight the necessity

of more comprehensive datasets, such as comorbidities and

clinical biomarkers, in order to enhance prediction

performance. Despite these drawbacks, this work emphasizes

how important it is to use ML and local data to guide public

health initiatives and improve mortality prediction in Algeria.

Class imbalance and feature restrictions are the main causes

of the low precision, recall, and F1-scores shown in all models.

To address these challenges, key strategies can be addressed:

 Class Imbalance Mitigation: Models tend to prefer

majority classes over minority class predictions because

of the dataset's unequal distributions of death causes.

Rebalancing class representation might be aided by

adding data from more hospitals to the dataset.

Additionally, it has been demonstrated that using

sophisticated synthetic oversampling methods, like

ADASYN, can increase F1-scores by 18–22% in

comparable medical datasets [17], [18].

 Feature Augmentation: Enhancing the dataset with

richer features, such as temporal symptom patterns or

social determinants of health, could significantly improve

predictive accuracy. Clinical models that incorporate such

detailed data have consistently demonstrated better

performance in mortality prediction tasks [10], [18].

 Hybrid Approaches: Combining resampling techniques

with threshold tuning has proven effective in addressing

imbalanced datasets. Such hybrid methods can improve

precision by 35–48% in mortality prediction models with

similar challenges [19], [18].

REFERENCES

[1] « Algeria - Death Rate, Crude - 2025 Data 2026 Forecast 1960-2022

Historical ». Consulté le: 2 avril 2025. [En ligne]. Disponible sur:

https://tradingeconomics.com/algeria/death-rate-crude-per-1-000-

people-wb-data.html

[2] « Algeria Infant Mortality Rate 1950-2025 ». Consulté le: 2 avril 2025.

[En ligne]. Disponible sur: https://www.macrotrends.net/global-

metrics/countries/DZA/algeria/infant-mortality-rate

[3] « Algeria Maternal Mortality Rate 2000-2025 ». Consulté le: 2 avril

2025. [En ligne]. Disponible sur: https://www.macrotrends.net/global-

metrics/countries/DZA/algeria/maternal-mortality-rate

[4] « Algeria ». Consulté le: 2 avril 2025. [En ligne]. Disponible sur:

https://data.who.int/countries/012

[5] G. Hernández Guillamet et al., « Machine Learning Model for Predicting

Mortality Risk in Patients With Complex Chronic Conditions:

Retrospective Analysis », Online J. Public Health Inform., vol. 15, p.

e52782, déc. 2023, doi: 10.2196/52782.

[6] W. Qiu, H. Chen, A. B. Dincer, S. Lundberg, M. Kaeberlein, et S.-I. Lee,

« Interpretable machine learning prediction of all-cause mortality »,

Commun. Med., vol. 2, no 1, p. 125, oct. 2022, doi: 10.1038/s43856-022-

00180-x.

[7] A. Krasowski, J. Krois, A. Kuhlmey, H. Meyer-Lueckel, et F.

Schwendicke, « Predicting mortality in the very old: a machine learning

analysis on claims data », Sci. Rep., vol. 12, no 1, p. 17464, oct. 2022,

doi: 10.1038/s41598-022-21373-3.

[8] O. Olang et al., « Artificial Intelligence-Based Models for Prediction of

Mortality in ICU Patients: A Scoping Review », J. Intensive Care Med.,

p. 8850666241277134, août 2024, doi: 10.1177/08850666241277134.

[9] S. Wang, M. B. A. McDermott, G. Chauhan, M. C. Hughes, T. Naumann,

et M. Ghassemi, « MIMIC-Extract: A Data Extraction, Preprocessing,

and Representation Pipeline for MIMIC-III », in Proceedings of the ACM

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

Conference on Health, Inference, and Learning, avr. 2020, p. 222‑235.

doi: 10.1145/3368555.3384469.

[10] H. Lee et P. Tsoi, « Feature-Enhanced Machine Learning for All-Cause

Mortality Prediction in Healthcare Data », 27 mars 2025, arXiv:

arXiv:2503.21241. doi: 10.48550/arXiv.2503.21241.

[11] T. Wang, « Precision and Prediction: Leveraging Artificial Intelligence

in Mortality Modeling for Actuaries », 2024.

[12] T. S. Pias et al., « Low responsiveness of machine learning models to

critical or deteriorating health conditions », Commun. Med., vol. 5, no 1,

p. 62, mars 2025, doi: 10.1038/s43856-025-00775-0.

[13] C. Kim, S. C. You, J. M. Reps, J. Y. Cheong, et R. W. Park, « Machine-

learning model to predict the cause of death using a stacking ensemble

method for observational data », J. Am. Med. Inform. Assoc., vol. 28, no

6, p. 1098‑1107, juin 2021, doi: 10.1093/jamia/ocaa277.

[14] S. Lundberg et S.-I. Lee, « A Unified Approach to Interpreting Model

Predictions », 25 novembre 2017, arXiv: arXiv:1705.07874. doi:

10.48550/arXiv.1705.07874.

[15] « NHANES Questionnaires, Datasets, and Related Documentation ».

Consulté le: 6 avril 2025. [En ligne]. Disponible sur:

https://wwwn.cdc.gov/nchs/nhanes/Default.aspx

[16] F. Shahidi, E. Rennert-May, A. G. D’Souza, A. Crocker, P. Faris, et J.

Leal, « Machine learning risk estimation and prediction of death in

continuing care facilities using administrative data », Sci. Rep., vol. 13,

no 1, p. 17708, oct. 2023, doi: 10.1038/s41598-023-43943-9.

[17] R. S. Abdulsadig et E. Rodriguez-Villegas, « A comparative study in

class imbalance mitigation when working with physiological signals »,

Front. Digit. Health, vol. 6, p. 1377165, mars 2024, doi:

10.3389/fdgth.2024.1377165.

[18] L. Dube et T. Verster, « Enhancing classification performance in

imbalanced datasets: A comparative analysis of machine learning

models », Data Sci. Finance Econ., vol. 3, no 4, p. 354‑379, 2023, doi:

10.3934/DSFE.2023021.

[19] A. Gupta et S. Gupta, « Enhanced Classification of Imbalanced Medical

Datasets using Hybrid Data-Level, Cost-Sensitive and Ensemble

Methods », Int. Res. J. Multidiscip. Technovation, p. 58‑76, avr. 2024,

doi: 10.54392/irjmt2435.

[20] Nistal-Nuño, « Artificial intelligence forecasting mortality at an intensive

care unit and comparison to a logistic regression system », Einstein (São

Paulo), vol. 19, p. eAO6283, sept. 2021, doi:

10.31744/einstein_journal/2021AO6283.

[21] J. E. García-Gallo, N. J. Fonseca-Ruiz, L. A. Celi, et J. F. Duitama-

Muñoz, « A machine learning-based model for 1-year mortality

prediction in patients admitted to an Intensive Care Unit with a diagnosis

of sepsis », Medicina Intensiva, vol. 44, no 3, p. 160-170, avr. 2020, doi:

10.1016/j.medin.2018.07.016.

[22] S. Iwase et al., « Prediction algorithm for ICU mortality and length of stay

using machine learning », Sci Rep, vol. 12, no 1, p. 12912, juill. 2022, doi:

10.1038/s41598-022-17091-5.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

Data Visualization Tools in Mental Health

Informatics

Imene DAHANE*1, Abdelkrim Mebarki*2, S.S BENHARRATS*3

1SIMPA Laboratory, University of Science and Technology of Oran – Mohamed Boudiaf, Algeria.

2University of Science and Technology of Oran - Mohamed Boudiaf, Algeria.

3University of Oran 1 - Ahmed Ben Bella, Sidi Chami Psychiatric Hospital.

1imene.dahane@univ-usto.dz

2abdelkrim.mebarki@univ-usto.dz

3benharrats.sarra@univ-oran1.dz

Abstract— Data visualization is a powerful means of enhancing the

comprehension and communication of complex information. By

transforming data into graphical or pictorial formats, it becomes

more accessible and easier to interpret. In the context of mental

health, leveraging data from electronic health records (EHRs),

visualization tools enable clinicians to better understand patient

demographics, psychiatric histories, treatment outcomes, and co-

occurring conditions. This paper examines the role of data

visualization in clinical outcome analysis, patient monitoring, and

the identification of healthcare trends. Techniques such as real-

time dashboards, predictive models, and visual risk assessments

contribute to improved clinical efficiency and increased patient

engagement. Moreover, visualizations serve as effective tools for

conveying medical information to non-specialist audiences,

fostering transparency, and encouraging healthier behaviors. To

be effective, data visualization must align with the nature of the

data and the needs of the target audience. This includes selecting

the most appropriate tools and visual formats to ensure clarity and

visual appeal. Ultimately, well-designed visualizations support

informed decision-making and contribute to better patient

outcomes.

Keywords— visualization tools, mental health, Patient-centered

care, data visualization, decision-making.

XXXVIII. INTRODUCTION

In our increasingly data-driven world, it is more important than

ever to have accessible ways to view and understand data. The

demand for data literacy continues to grow across sectors,

including healthcare. Clinicians, just like professionals in

business and technology, need efficient tools to explore,

interpret, and act on complex information. That is where data

visualization becomes essential. By transforming raw data into

graphical representations—whether through dashboards,

timelines, or network diagrams—visualization tools help make

data more intuitive and actionable for decision-makers[1].

Data visualization is a powerful tool for enhancing the

understanding and communication of complex data[2]. It refers

to the process of generating graphic displays to represent data

points or statistical summaries. These visual outputs can range

from scatterplots showing individual data points to histograms

summarizing variable distributions [3]. While many industries

have embraced visualization to enhance communication and

decision-making, the mental health sector still faces challenges

in adapting these tools to clinical contexts.

Data visualization helps identify relevant information and

quickly understand massive behavioural data. One way to

improve user participation and continuity of care is through

digital feedback technology that visualizes data[4] .

XXXIX. RESEARCH AND FINDINGS

In our study, we aim to conduct a comprehensive comparison

of visualization tools used in the field of mental health. Our

selection criteria are based on the extent to which these tools

address core clinical requirements, including the ability to track

patient data longitudinally, integrate heterogeneous data types

(such as numerical, textual, and categorical information), and

support usability and interpretability for mental health

professionals. By focusing on these criteria, we seek to evaluate

not only the technical capabilities of each tool but also their

relevance and applicability in real-world clinical settings.

XL. VISUALIZATION TOOLS REVIEWED FOR MENTAL

HEALTH

In recent studies, visualization tools have played a

significant role in the advancement of mental health research

and decision-making. The design of interactive information

visualization [5] on traditional desktop PCs is challenging due

to the limited display size, especially in the linked-view

visualizations or multiple coordinated visualizations [6].

Figure 3:JSNA TOOLKIT STRUCTURE [7]

A New Paradigm for Mental Health Research [7] suggests

using data science approaches to improve public mental health

policies,

applying

visual

analytics

techniques

for data

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

exploration, and referring to Kohlhammer's visual analytics

process. The data in this study encompasses demographic,

socio-economic, and geospatial information, aiming for

population-level applications rather than individual clinical

interventions, with a clear focus on integrating heterogeneous

data from multiple sources.

Figure 2 : Mental diseases with their associated entities [8]

Meanwhile, in [8] the analysis of mental health research based

on knowledge extraction and network visualization focuses on

the extraction of knowledge from the digital mental health

literature. PKDE4J is used for entity extraction, CiteSpace for

cocitation networks, and Gephi for co-occurrence networks.

The study works with scientific texts and extracted entities such

as diseases, symptoms, and technologies, with no direct clinical

applications, but offers significant bibliometric insights.

Temporal visualization is addressed within CiteSpace’s

network analysis. Similarly, in research [9] , the

implementation of an interactive healthcare advisor using

artificial intelligence explores the development of an AI-driven

healthcare advisor via a chatbot, which visualizes medical data

through a conversational interface. This system incorporates

biological signals and mental health data to provide patients

with self-assessment tools and advice, highlighting the

heterogeneity of data by combining biological and

psychological information. Lastly,

Figure 3 : Initial Snapshot Diagram Version 1[10]

the study Approaches [10] for the visualization of health data

to enhance decision-making proposes various diagrams, such

as snapshot and pathway diagrams, to support clinical decision-

making. These visualizations aim to enhance clinician decision-

making, focusing on structured medical data while

incorporating temporal elements to represent patient pathways.

Overall, these studies illustrate the growing role of data science

and visualization in both enhancing mental health research and

improving clinical applications through innovative tools and

techniques. Reference [3] highlights the value of using

visualizations in clinical care as a clear and effective way to

communicate personal health data to both patients and

clinicians, underscoring the benefit of continued co-design with

all parties. The study [11] conducted a two-day field work with

twelve participants from a university community. This

preliminary evaluation involved user interaction with a robot

displaying mood data visualizations, and participant feedback

was collected through interviews, written prompts, and

qualitative analysis to refine the visualization templates.

Figure 4 : The mental health data visualization template on a social robot[11]

XLI. COMPARATIVES STUDY

TABLE XII

COMPARISON OF VISUALIZATION TOOLS IN MENTAL HEALTH FIELD

Method

Criteria

Type of

tools

Data

visualized

Data

heterogeneity

Ease of use

for clinicians

Temporal

tracking

2019

[7]

JSNA ,

visual

Analytics

Public factors

YES

Medium

Possible

2022

[8]

PKDE4J ,

Gephi ,

CiteSpace

Entities and

networks

YES

Medium

Yes, with

temporal

graph

2020

[9]

Chatbot

(kakao)

Biogical

signals

YES

High (direct

interface)

Limited

2018

[10]

aCognitive

diagram

(manual)

History ,Clini

cal reasoning

YES

Limited

Yes, with path

diagram

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

2022

[11]

social robot

mood

YES

High (direct

interface)

Yes

XLII. FUTURE WORKS

Current studies often treat data visualization as a secondary

feature, with limited attention to psychiatrists' specific needs.

Future research should prioritize the development of a

dedicated visualization pipeline that supports clinicians

throughout the diagnostic and care process. This includes:

 Designing tools that can integrate heterogeneous

clinical data (e.g., scores, text notes, physiological

signals).

 Supporting temporal visualization of patient

trajectories (e.g., symptom evolution, treatment

timelines).

 Creating intuitive and actionable interfaces aligned

with clinical workflows.

There is also a lack of qualitative research on how mental

health professionals interact with visual tools in real-life

settings. Future studies should involve clinicians directly

through interviews, co-design workshops, or observational

studies to gather in-depth insights into their preferences,

challenges, and unmet needs.

These insights will be essential to produce clinically relevant

design recommendations and guide the development of

visualization systems that truly enhance decision-making and

care quality in mental health.

XLIII. CONCLUSION

Visualization tools play an increasingly important role in

mental health research and clinical support. The reviewed

studies underscore the variety of visualization methodologies

utilized for mental health data, while also exposing a notable

deficiency: the majority of tools are not explicitly tailored for

psychiatrists. Although some facilitate temporal representation

or manage diverse data, few are entirely congruent with the

clinical decision-making process.

As a result, there is a pressing need for the development of

visualization tools that are not only technically robust but also

clinically intuitive, temporally aware, and context-sensitive.

Bridging this gap will require deeper collaboration with mental

health professionals to ensure that future tools support real-

world diagnostic reasoning, patient monitoring, and therapeutic

planning.

REFERENCES

[1] “Logiciels de Business Intelligence et d’Analytique - Tableau

FR.” Accessed: Apr. 09, 2025. [Online]. Available:

https://www.tableau.com/

[2] S. Chang, L. Gray, N. Alon, and J. Torous, “Patient and Clinician

Experiences with Sharing Data Visualizations Integrated into

Mental Health Treatment,” Soc. Sci., vol. 12, no. 12, Art. no. 12,

Dec. 2023, doi: 10.3390/socsci12120648.

[3] A. Unwin, “Why Is Data Visualization Important? What Is

Important in Data Visualization?,” Harv. Data Sci. Rev., vol. 2,

no. 1, Jan. 2020, doi: 10.1162/99608f92.8ae4d525.

[4] Y. Koh, C. Lee, Y. Ku, and U. Lee, “Data Visualization for

Mental Health Monitoring in Smart Home Environment: A Case

Study,” in 2023 Workshop on Visual Analytics in Healthcare

(VAHC), Oct. 2023, pp. 53–55. doi:

10.1109/VAHC60858.2023.00017.

[5] P. Reipschlager, T. Flemisch, and R. Dachselt, “Personal

Augmented Reality for Information Visualization on Large

Interactive Displays,” IEEE Trans. Vis. Comput. Graph., vol. 27,

no. 2, pp. 1182–1192, Feb. 2021, doi:

10.1109/TVCG.2020.3030460.

[6] R. Liu et al., “Interactive Extended Reality Techniques in

Information Visualization,” IEEE Trans. Hum.-Mach. Syst., vol.

52, no. 6, pp. 1338–1351, Dec. 2022, doi:

10.1109/THMS.2022.3211317.

[7] C. Silva, M. Saraee, and M. Saraee, “Data Science in Public

Mental Health: A New Analytic Framework,” in 2019 IEEE

Symposium on Computers and Communications (ISCC), Jun.

2019, pp. 1123–1128. doi: 10.1109/ISCC47284.2019.8969723.

[8] T. Timakum, Q. Xie, and M. Song, “Analysis of E-mental health

research: mapping the relationship between information

technology and mental healthcare,” BMC Psychiatry, vol. 22, no.

1, p. 57, Jan. 2022, doi: 10.1186/s12888-022-03713-9.

[9] T.-H. Hwang, J. Lee, S.-M. Hyun, and K. Lee, “Implementation

of interactive healthcare advisor model using chatbot and

visualization,” in 2020 International Conference on Information

and Communication Technology Convergence (ICTC), Oct.

2020, pp. 452–455. doi: 10.1109/ICTC49870.2020.9289621.

[10] V. Sharma, A. Stranieri, S. Firmin, H. Mays, and F. Burstein,

“Approaches for the visualization of health information,” in

Proceedings of the Australasian Computer Science Week

Multiconference, in ACSW ’18. New York, NY, USA:

Association for Computing Machinery, Jan. 2018, pp. 1–9. doi:

10.1145/3167918.3167958.

[11] R. Karim, Y. Zhang, P. Alves-Oliveira, E. A. Björling, and M.

Cakmak, “Community-Based Data Visualization for Mental

Well-being with a Social Robot,” in 2022 17th ACM/IEEE

International Conference on Human-Robot Interaction (HRI),

Mar. 2022, pp. 839–843. doi: 10.1109/HRI53351.2022.9889415.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

Greedy-based approach to Reduce Congestion

Areas in IoV

BELHADJ Aissa#1, KIES Ali#2, MOSTEFA Fatima Zahra#3, MEKKAKIA

MAAZA Zoulikha#3

#Computer science Department, University of sciences and technology Mohamed

Boudiaf -ORAN

Address

1aissa.belhadj@univ-usto.dz

2ali.kies@univ-usto.dz

3Fati.mostefa@gmail.com

4zoulikh@hotmail.com

Abstract— Due to several factors such as traffic jams, accidents,

construction in certain areas, bad weather conditions, and special

events, traffic is subject to congestion problems, which have been

steadily increasing since 1950. Intelligent transportation systems

have a range of applications in various fields, most notably traffic

management. This paper proposes a plan for deploying roadside

units to reduce congestion and maintaining smooth traffic flow

with a continuous connectivity by applying a greedy and threshold

greedy algorithms, the results showed an effective congestion

coverage rate and increased connectivity.

Keywords— connectivity probability – congestion – RSUS –

deployment – greedy algorithm

XLIV. INTRODUCTION

According to previously reported statistics provided by

Global Traffic Scorecard[1], traffic congestion has reached

extreme levels, negatively impacting traffic flow and causing

numerous problems, including late arrivals for various events

such as work, travel, and medical appointments. Several factors

contribute to road congestion, including traffic accidents,

weather conditions, special events, and bottlenecks as

indicated in Fig. 1. To address this challenge, intelligent

transportation systems have worked to find possible solutions

by modifying and updating traffic management applications

and improving the deployment of Internet of Vehicles (IoV)

infrastructure, such as sensors and roadside units. This allows

for continuous monitoring of traffic conditions by collecting

and analyzing data. With the proliferation of artificial

intelligence tools, this information is used as input for these

algorithms to suggest possible solutions to the congestion

problem[2].

Among the contributions of this paper are:

- Reducing the cost of deploying roadside units by applying

vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I)

communication types.

- Covering areas exposed to congestion problem with

negligible lead time.

- Maintaining reliable and continuous communication by

increasing the probability of communication.

The rest is paper is organized as follows: in section 2 we have

provided a short overview of Roadside unit placement in

literature, then in section3 give the description of problem with

the system model, and how we have formulated the roadside

unit placement to address objectives. Section 4 shows the

results and discusses them and section4 summarize the essence

of the work .

Fig. 5 Traffic congestion causes

XLV. RELATED WORK

Identifying an appropriate solution for RSU deployment

poses a significant challenge for experts in this domain. In[3],

the authors aimed to enhance delay-sensitive applications by

considering two communication types (V2I and V2V) and

employing genetic algorithms to determine optimal

placements.

The authors in[4] address the issue by modeling it as a 0-1

knapsack problem, utilizing centrality as value and cost as

weight, focusing on intersections as candidate locations while

disregarding remote areas. In[5], the objective is to minimize

the number of RSUs for message dissemination within a

bounded delay, selecting intersections as candidate positions

based on traffic flow density, communication range, and length.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

Others in [6] observed that the effectiveness of RSUs in

minimizing delay is achieved by situating them in road

segments with low traffic flow to maximize coverage. The

authors in [7]offer a geometric method known as Constrained

Delaunay Triangulation (CDT), utilizing exclusively Vehicle-

to-Infrastructure (V2I) communication, with Roadside Units

(RSUs) positioned in various locations. In [8]To resolve the

D1RD issue, the proposed solution involves an optimal

algorithm known as OptDynLim at point with high profit

density, to maximize the coverage area the authors in [9]use a

hybrid genetic algorithm with local search replacement, a

multi-objective roadside deployment model is presented in both

of [10] and [11] to maximize the coverage based on overlapping

area as factor and reduce the expected transmission delay with

limited cost at intersection using V2V and V2I communication

and choose NSGA2,memetic algorithm as optimization method

respectively , for safety application with threshold delay the

authors in [12] they divide the map in region of interest

characterized by accident rate factor, in [13] using a genetic

algorithm and relying in road-segment with big number of bus

stop, high commercial activity and total accident as factors in

order to reduce reception delay of safety message delivered

from the vehicles, for applications operating in real-time that

necessitate minimum latency and rapid data transmission

speeds, in [14] they apply a mmwave technology beam

antennas and define the issue as the Maximum Coverage under

limited financial plan, aiming to optimize profit within the

limits of a budget constraint, to ensure maximum coverage of

high accident In[15], they utilize V2V and V2I

communications and select road segments with low

connectivity as optimal locations for deploying a fixed number

of RSUs.

XLVI. SYSTEM MODEL

A. Problem Description

The congestion problem arises as a result of several factors

mentioned in the previous section. It takes different locations

in road sections and intersections, causing numerous problems

such as increased air pollution, impact on production efficiency

through delays and waiting in the tail, increased fuel

consumption, the risk of rear-end collisions, increased

emergency response time, drivers' anxiety, increased delays in

public transportation, etc. In order to reduce the congestion

problem in the vehicle network, we modeled the system as

shown in Fig. 2. The locations chosen for deploying roadside

units at intersections were close to locations with high

congestion rates, with the condition of providing sufficient

distance for implementing vehicle-to-infrastructure

communication mode. This was done in order to reduce the

duration of notifications while ensuring reliability and good

communication in the network.

B. Problem Formulation

In order to cover the largest number of accidents and increase

the probability of communication at the same time, we chose

an objective function that includes two sub objectives, as shown

in Equation (1).

Fig. 2. System model for traffic congestion coverage

F= max (F1+ F2) (1)

the first objective is to maximize the network connectivity in

the system and it is formulated by the equations from 2 to 3

F1= 



 (2)

cni = wi if xi = 0

(3)

cni = 1 if xi = 1

Where S represents the number of intersections, and cni

represents the connectivity probability at intersection i, if there

is a RSU at intersection i the probability became 1 else it

remains with initial value. For more details about how calculate

probability of connectivity you can return to the paper [15].

For the second objective we want to maximize the coverage

of accident in location with low connectivity and congestion

coverage rate, so we calculate the summation of congestion

coverage ratio at the candidate site and its neighbors with

condition that distance between candidate site and its

neighbors do not be more than 2R where R represent the RSU

and vehicle communication range, The equations from 4 to 6

show the objective formulation.

F2= 



  (4)

Cngi indicate the congestion rate covered by the RSU placed at

site i

With 



  (5)

 = 󰇝󰇞 (6)

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 17-18, 2025, Oran, Algeria.

K is a constraint that represents a constant number of roadside

units that can be deployed as illustrated in equation 5, equation

6 says if there is a roadside unit deployed or not at position i.

Table1 the objective function score with 15 RSUs

Greedy

Threshold-greedy

Threshold= 0,2

0,585

0,3185

Threshold= 0,25

0,585

0,336

Threshold= 0,3

0,585

0,337

Threshold= 0,35

0,585

0,351

Threshold= 0,4

0,585

0,3915

XLVII. RESULT AND DISCUSSION

To evaluate our RSU placement model we have use two

greedy algorithms, the first one with connectivity probability

threshold, so we divide the intersections of the study area in two

sets, after that we sort the set with connectivity smaller than the

threshold by the congestion coverage rate from high to low

level.

The second algorithm sort all the sites with no threshold by

the coverage rate and then compute the objective function.

From the the results mentioned in Table1 we can see that the

greedy algorithm can outperforms the threshold-greedy

algorithm to get best value of objective function score, this

confirms the effectiveness of the greedy algorithm to explore

the search space with best method.

XLVIII. CONCLUSION

In this paper, we model the deployment of roadside units to

cover congestion-prone areas indirectly and in a negligible

time, while maintaining good network connectivity. The results

show that the greedy algorithm can reachs best record score in

objective function compared to the threshold greedy algorithm.

REFERENCES

[1] INRIX, « Scorecard », INRIX. Consulté le: 25 avril 2025. [En ligne].

Disponible sur: https://inrix.com/scorecard/

[2] G. S. Olusanya, M. O. Eze, O. Ebiesuwa, et C. Okunbor, « Smart

Transportation System for Solving Urban Traffic Congestion », RCES,

vol. 7, no 3, p. 55‑59, sept. 2020, doi: 10.18280/rces.070302.

[3] S. Mehar, S. M. Senouci, A. Kies, et M. M. Zoulikha, « An Optimized

Roadside Units (RSU) placement for delay-sensitive applications in

vehicular networks », in 2015 12th Annual IEEE Consumer

Communications and Networking Conference (CCNC), Las Vegas, NV,

USA: IEEE, janv. 2015, p. 121‑127. doi: 10.1109/CCNC.2015.7157957.

[4] Z. Wang, J. Zheng, Y. Wu, et N. Mitton, « A centrality-based RSU

deployment approach for vehicular ad hoc networks », in 2017 IEEE

International Conference on Communications (ICC), Paris, France:

IEEE, mai 2017, p. 1‑5. doi: 10.1109/ICC.2017.7996986.

[5] C. Liu, H. Huang, H. Du, et X. Jia, « Optimal RSUs placement with delay

bounded message dissemination in vehicular networks », J Comb Optim,

vol. 33, no 4, p. 1276‑1299, mai 2017, doi: 10.1007/s10878-016-0034-8.

[6] S. Jain, V. K. Jain, et S. Mishra, « Probabilistic model for minimizing

delay in Vehicular Networks », in 2023 International Conference on

Communication, Circuits, and Systems (IC3S), BHUBANESWAR,

India: IEEE, mai 2023, p. 1‑5. doi: 10.1109/IC3S57698.2023.10169653.

[7] C. Ghorai et I. Banerjee, « A constrained Delaunay Triangulation based

RSUs deployment strategy to cover a convex region with obstacles for

maximizing communications probability between V2I », Vehicular

Communications, vol. 13, p. 89‑103, juill. 2018, doi:

10.1016/j.vehcom.2018.07.002.

[8] Z. Gao, D. Chen, S. Cai, et H.-C. Wu, « OptDynLim: An Optimal

Algorithm for the One-Dimensional RSU Deployment Problem With

Nonuniform Profit Density », IEEE Trans. Ind. Inf., vol. 15, no 2, p.

1052‑1061, févr. 2019, doi: 10.1109/TII.2018.2841056.

[9] D. Ghosh, H. Katehara, O. Rawlley, S. Gupta, N. Arulselvan, et V.

Chamola, « Artificial Intelligence-Empowered Optimal Roadside Unit

(RSU) Deployment Mechanism for Internet of Vehicles (IoV) », in 2022

IEEE 23rd International Symposium on a World of Wireless, Mobile and

Multimedia Networks (WoWMoM), Belfast, United Kingdom: IEEE, juin

2022, p. 495‑500. doi: 10.1109/WoWMoM54355.2022.00077.

[10] L. Yu, Z. Zhang, J. Li, J. Ma, et Y. Wang, « A Multi-Objective Roadside

Unit Deployment Model for an Urban Vehicular Ad Hoc Network »,

IJGI, vol. 12, no 7, p. 262, juill. 2023, doi: 10.3390/ijgi12070262.

[11] S. Anbalagan et al., « Machine-Learning-Based Efficient and Secure

RSU Placement Mechanism for Software-Defined-IoV », IEEE Internet

Things J., vol. 8, no 18, p. 13950‑13957, sept. 2021, doi:

10.1109/JIOT.2021.3069642.

[12] A. Jalooli, M. Song, et X. Xu, « Delay Efficient Disconnected RSU

Placement Algorithm for VANET Safety Applications », in 2017 IEEE

Wireless Communications and Networking Conference (WCNC), San

Francisco, CA, USA: IEEE, mars 2017, p. 1‑6. doi:

10.1109/WCNC.2017.7925603.

[13] M. Sankaranarayanan, M. Chelliah, et S. Mathew, « A Feasible RSU

Deployment Planner Using Fusion Algorithm », Wireless Pers Commun,

vol. 116, no 3, p. 1849‑1866, févr. 2021, doi: 10.1007/s11277-020-07768-

[14] M. Laha et R. Datta, « A Budgeted Maximum Coverage based mmWave

Enabled 5G RSUs Placement in Urban Vehicular Networks », in 2021

International Conference on COMmunication Systems & NETworkS

(COMSNETS), Bangalore, India: IEEE, janv. 2021, p. 387‑395. doi:

10.1109/COMSNETS51098.2021.9352851.

[15] A. Kies, K. Belbachir, Z. Mekkakia Maaza, et C. Duvallet, « Optimal

RoadSide Units Distribution Approach in Vehicular Ad hoc Network »,

IJEEI, vol. 10, no 1, p. 123‑132, mars 2022, doi:

10.52549/ijeei.v10i1.3116.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Heterogeneous Graph Neural Networks for

Product Recommendation on Transactional Retail

Data

Imad Eddine Khiloun *1, Karima Belmabrouk 2, Latifa Dekhici 2,3,

Christoph Bergmeir 4

1S.I.M.P.A laboratory, Department of Computer Science, University of Science

and Technology of Oran Mohamed Boudiaf

2Department of Computer Science, University of Science and Technology of Oran

Mohamed Boudiaf

3LDREI laboratory, Department of electrical engineering, Higher School of

Electrical and Energy Engineering of Oran

4DaSCI, Department of Computer Science and AI, University of Granada, Spain

Address

imadeddine.khiloun@univ-usto.dz

karima.belmabrouk@univ-usto.dz

latifa.dekhici@univ-usto.dz

bergmeir@ugr.es

Abstract— Personalized product recommendation is crucial for

enhancing user experience and driving sales in e-commerce, yet

effectively leveraging sparse, implicit feedback from transactional

data remains challenging. This paper investigates the application

of heterogeneous Graph Neural Networks (GNNs) for product

recommendation on the widely used Online Retail dataset. We

frame the task as link prediction between customers and products,

constructing a heterogeneous graph from processed transactional

records. We employ a GNN model based on GraphSAGE, adapted

for heterogeneity, using learnable embeddings derived solely from

the interaction structure. Our experiments demonstrate the

model's effectiveness, achieving a ROC AUC of 0.853 and an F1-

Score of 0.750 (with 0.937 Recall) on the held-out test set using an

optimized configuration (negative sampling ratio 1.0,

classification threshold -0.5). We analyze the significant impact of

the negative sampling ratio during training on the final precision-

recall trade-off, highlighting the importance of aligning training

parameters with desired recommendation goals. Our findings

confirm the viability of heterogeneous GNNs for modeling implicit

feedback and providing effective recommendations in a retail

context. This paper is organized as follows: Section I introduces

the problem and our approach. Section II reviews related work in

recommendation systems and graph-based methods. Section III

details our methodology, including dataset description, data

preparation, graph construction, and the GNN model

architecture. Section IV describes the experimental setup and

evaluation metrics. Section V presents the quantitative results,

including the analysis of parameters like negative sampling.

Section VI discusses the findings and limitations. Finally, Section

VII concludes the paper.

Keywords— Recommendation Systems, Graph Neural Networks

(GNNs), Link Prediction, Implicit Feedback, Heterogeneous

Graphs, Online Retail, E-commerce, GraphSAGE, Node

Embeddings.

XLIX. INTRODUCTION

Personalized recommendation systems are integral to

modern online platforms, enhancing user engagement and

facilitating navigation through vast product catalogs [1]. A

primary challenge lies in leveraging sparse, implicit user

feedback (e.g., purchase histories, clicks), which only indirectly

signals preferences [2]. Effectively modeling the underlying

complex user-product interaction patterns from such data is

crucial for generating relevant suggestions but remains non-

trivial.

Traditional methods like Matrix Factorization, while

foundational [3], can struggle with the extreme sparsity often

present in real-world implicit datasets and may not fully capture

higher-order collaborative signals embedded within the

interaction structure [4]. To overcome these limitations, Graph

Neural Networks (GNNs) have gained significant traction as a

powerful tool for recommendation systems [5]–[6]. GNNs

naturally operate on graph-structured data, allowing them to

explicitly model user-product interactions and learn rich node

representations by aggregating information from their network

neighborhoods. This capability is particularly promising for

uncovering intricate patterns in implicit feedback data.

Representing user-product interactions within a

heterogeneous graph framework, distinguishing between user

nodes and product nodes, further allows GNNs to learn type-

specific representations and interaction patterns [7]. This work

explores the application of heterogeneous GNNs for

recommendation based on implicit feedback, framed

specifically as a link prediction task: predicting the likelihood

of a future interaction between a user and a product.

Our contributions are:

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

 A methodology for processing raw, implicit

transactional data into a structured heterogeneous

graph suitable for GNN modeling.

 The implementation and evaluation of a

heterogeneous GNN model, utilizing the

GraphSAGE architecture [8] adapted via PyTorch

Geometric's to_hetero capabilities [9].

 An empirical analysis of the model's effectiveness

in predicting user-product interactions using

standard evaluation metrics.

The remainder of this paper details our methodology,

experimental setup, results, and discussion, followed by

concluding remarks.

L. RELATED WORK

Recommendation systems have progressed significantly

beyond early collaborative filtering. While foundational Matrix

Factorization (MF) techniques [10] are well-understood, they

often grapple with the high sparsity typical of implicit feedback

datasets, sometimes failing to capture complex higher-order

user-product relations [2]–[4]. Methods adapting MF for

implicit data, such as Weighted MF [10], address the lack of

negative signals but fundamentally model pairwise interactions

[4].

Graph Neural Networks (GNNs) represent a more recent

advancement, directly leveraging the user- product interaction

graph [5]. State-of-the-art models like LightGCN [4] learn user

and product embeddings via simplified graph convolutions,

proving highly effective for recommendation by capturing

collaborative signals embedded in the graph structure,

explicitly addressing limitations of earlier methods. This

underscores the power of neighborhood aggregation in GNNs

for implicit feedback scenarios.

Furthermore, the inherent heterogeneity of user-product

interaction graphs (distinguishing between user and product

node types) motivates the use of Heterogeneous GNNs

(HGNNs) [11]. Models tailored for heterogeneous graphs, such

as HAN [7], or general GNN frameworks adapted using tools

like PyTorch Geometric [9], allow for type-specific message

passing, potentially yielding more refined user and product

representations compared to treating the graph homogeneously.

Our research utilizes these GNN principles. We employ a

heterogeneous adaptation of the inductive GraphSAGE

architecture [8], implemented using PyTorch Geometric [9], to

learn node embeddings directly from the implicit feedback

graph structure. A key focus is the practical preprocessing of

noisy transactional data to enable GNN application and the use

of neighbor sampling for efficient training on large-scale

graphs.

LI. METHODOLOGY

This section details our approach, from formulating the

recommendation problem to constructing the graph and

defining the GNN model architecture.

T. Problem Formulation

We formulate the product recommendation task as a link

prediction problem on a heterogeneous graph. Let G = (V, E)

represent the interaction graph, where V = V_customer ∪

V_product is the set of nodes comprising unique customers

(V_customer) and unique products (V_product). E represents

the set of observed 'buy' interactions, forming edges between

customers and products they have purchased. The goal is to

train a model that, given the graph G, can predict the probability

P(edge(u, v) = 1 | G) of a link (a 'buy' interaction) existing

between an arbitrary customer u ∈ V_customer and product v

∈ V_product, particularly for pairs (u, v) not present in the

observed edges E.

U. Dataset Description

We utilize the standard Online Retail dataset, originally

sourced from the UCI Machine Learning Repository [12] and

widely available on platforms like Kaggle. This dataset consists

of 541,909 transactional records generated by a UK-based non-

store online retail company between December 1st, 2010, and

December 9th, 2011. The company primarily sells unique all-

occasion gifts, with a significant wholesale customer base.

Each record in the dataset represents a line item within an

invoice and contains the following key attributes:

InvoiceNo: A 6-digit integral number uniquely assigned to

each transaction. If this code starts with the letter 'C', it indicates

a cancellation.

StockCode: A 5-digit integral number uniquely assigned to

each distinct product.

Description: The product name (textual).

Quantity: The quantity of each product per transaction

(numeric).

InvoiceDate: The day and time when each transaction was

generated (timestamp).

UnitPrice: The price per unit of the product in sterling

(numeric).

CustomerID: A 5-digit integral number uniquely assigned to

each customer. Notably, this field contains a significant number

of missing values.

Country: The name of the country where each customer

resides (textual).

The presence of missing CustomerIDs and the inclusion of

cancellation records are important characteristics that

necessitate preprocessing before the data can be effectively

used for graph-based modeling, as detailed in the next

subsection.

V. Data Preparation

The initial step in our methodology involved preparing the

raw Online Retail dataset for graph construction. This required

addressing several data quality aspects and filtering records not

suitable for our link prediction task:

5) Handling Missing Customer IDs: A significant

characteristic of this dataset is the presence of transactions

without an associated CustomerID. As our graph model

requires distinct customer nodes for interaction modeling, all

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

rows where the CustomerID was missing were removed. This

reduced the dataset from the original 541,909 rows to 406,829

records containing identifiable customer information.

6) Filtering Cancellations: The dataset includes records

corresponding to cancelled orders, marked by an InvoiceNo

starting with 'C'. Since our goal is to model and predict

successful purchases, these cancellation records were filtered

out, further refining the dataset to 397,924 rows representing

valid, non-cancelled transaction line items.

It is important to note that even after these filtering steps, the

number of rows (397,924) is substantially larger than the

number of unique invoices present in the data (18536 unique

non-cancelled invoices). This is because the dataset is

structured at the line-item level, meaning each row corresponds

to a specific product (StockCode) purchased within a single

invoice (InvoiceNo). Therefore, a single invoice containing

multiple different products will contribute multiple rows to the

dataset.

Fig 1 provides an overview of the data characteristics after

these initial preparation steps, illustrating the number of unique

values present in each relevant column of the cleaned 397,924-

row dataset. This highlights the cardinality of key identifiers

like StockCode and CustomerID, which informs the subsequent

graph construction phase.

Fig. 6 The plot showing unique values per column, generated from the

DataFrame containing 397,924 rows

W. Graph Construction

With the cleaned dataset containing 397,924 valid

transaction line items, we proceeded to construct the

heterogeneous graph representation required by our GNN

model:

1) Node ID Mapping:

We first established unique, consecutive integer identifiers

for all distinct customers and products present in the cleaned

data. Each unique CustomerID was mapped to an index in the

range [0, ..., N_c-1], and each unique StockCode was mapped

to an index in the range [0, ..., N_p-1]. N_c and N_p represent

the total number of unique customers and products,

respectively, forming the nodes of our graph. These mappings

ensure consistent indexing for embedding lookups and graph

operations. This resulted in N_c = 4,339 customer nodes and

N_p = 3,665 product nodes.

2) Edge List Creation and Deduplication:

Using the mapped IDs, we generated an edge list

representing the 'buy' interactions. For each row in the cleaned

dataset, we created a potential edge from the corresponding

mapped customer ID to the mapped product ID. As our

objective is link prediction (determining if an interaction

exists), and the data contains multiple entries for the same

customer-product pair (due to repeat purchases), we performed

a critical deduplication step. We retained only the unique

(customer_id, product_id) pairs from the generated list. This

ensures that the final graph used for modeling is a simple graph,

where at most one directed edge exists from any customer u to

any product v, signifying that at least one purchase occurred.

This process yielded N_e unique directed 'buy' interaction

edges. This resulted in N_e = 266,802 unique 'buy' edges.

3) Heterogeneous Graph Object:

The identified nodes and the deduplicated edges were then

organized into a HeteroData object using the PyTorch

Geometric library [9]. This object explicitly defines the graph's

structure with two node types: "customer" (size N_c) and

"product" (size N_p). The N_e unique 'buy' edges were stored

under the specific edge type ("customer", "buy", "product")

using PyTorch Geometric's edge_index format (a tensor of

shape [2, N_e]).

4) Adding Reverse Edges:

To enable bidirectional information flow essential for

effective message passing in many GNN architectures, we

utilized the T.ToUndirected() transform from PyTorch

Geometric. Applied to our HeteroData object, this transform

automatically created and added the corresponding reverse

edges for the ("customer", "buy", "product") type. These

reverse edges were assigned the type ("product", "rev_buy",

"customer") and added to the HeteroData object, completing

the graph structure for GNN input.

This graph construction process resulted in a heterogeneous

graph capturing the unique customer-product purchase

relationships, ready for the GNN model detailed next, Table I

summarizes the graph characteristics.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

TABLE I

GRAPH CHARACTERISTICS

Property

Value

Customer Nodes (N_c)

4,339

Product Nodes (N_p)

3,665

'Buy' Edges (N_e)

266,802

'Rev-Buy' Edges

266,802

Graph Density

1.68%

X. Model Architecture

Our proposed model for predicting customer-product

interactions leverages a heterogeneous Graph Neural Network

structure, designed to learn meaningful representations from

the interaction graph. The architecture consists of embedding

layers, a GNN encoder, and a link classifier.

GNNs are a class of deep learning models designed to

operate directly on graph data [5]. They learn node

representations (embeddings) by iteratively aggregating

information from a node's local neighborhood through a

process often called message passing [14]. This allows GNNs

to capture both node features (if available) and, crucially for

our task, the complex topological structure of the interaction

graph.

1) Input Node Embeddings:

Given the absence of explicit node features in the Online

Retail dataset, we employ learnable embedding layers to

represent each node type. We initialize two

torch.nn.Embedding layers: one for customers (customer_emb)

with dimension (N_c, hidden_channels), and one for products

(product_emb) with dimension (N_p, hidden_channels). N_c

and N_p are the number of unique customers and products

identified during graph construction, and hidden_channels is a

hyperparameter defining the dimensionality of the embeddings.

 Justification:

Learnable embeddings are essential when rich node

attributes are unavailable, allowing the model to learn

representations directly from the graph structure and interaction

patterns [2]–[3]. The dimensionality, hidden_channels, was set

to 64. This is a common choice in GNN research, striking a

balance between model capacity (allowing for sufficiently

complex representations) and computational cost/risk of

overfitting associated with very high dimensions.

2) Heterogeneous GNN Encoder:

We employ a GNN encoder to propagate information across

the graph and refine the initial embeddings based on

neighborhood context.

The core architecture is based on GraphSAGE [8],

implemented using SAGEConv layers.

GraphSAGE is an influential inductive GNN framework

designed to efficiently generate embeddings for nodes in large

graphs. Instead of using the full graph Laplacian (like some

earlier GCN variants), GraphSAGE learns aggregation

functions (e.g., mean, max-pooling) that gather information

from a fixed-size, uniformly sampled set of a node's local

neighbors. This aggregated neighborhood information is then

combined with the node's own current representation and

passed through transformation layers to generate the node's

embedding for the next layer [8]. This sampling approach

enhances scalability and inductive capability.

GraphSAGE is well-suited for this task due to its inductive

capabilities and its effectiveness in generating representations

by sampling and aggregating neighbor information. We utilize

a two-layer GNN structure.

 Justification:

A 2-layer GNN allows each node's final embedding to

capture information from its 2-hop neighborhood. This

typically provides a good balance between capturing sufficient

local structure (direct interactions and interactions of

neighbors) and computational complexity [4]. Deeper GNNs

can sometimes suffer from over-smoothing, where node

embeddings become indistinguishable [13]. Two layers

represent a standard and often effective depth for many link

prediction tasks.

A ReLU activation function is applied after the first

SAGEConv layer to introduce non-linearity, enabling the

model to learn more complex functions.

This base GNN is adapted for heterogeneity using PyTorch

Geometric's to_hetero wrapper [9], informed by the graph's

metadata. This crucial step allows the model to learn distinct

message-passing parameters for (customer, buy, product) and

(product, rev_buy, customer) edge types, respecting the

different roles of nodes and interaction directions.

3) Link Prediction Classifier:

To generate a prediction score for a potential link between a

customer u and a product v, we use their final embeddings (z_u,

z_v) generated by the GNN encoder. We employ a dot product

mechanism: score(u, v) = z_u  z_v.

 Justification:

The dot product is a simple, efficient, and commonly used

similarity function in latent factor models and embedding-

based recommendation [3]. It effectively measures the

alignment or compatibility between the learned customer and

product vectors in the shared latent space, with higher values

indicating stronger predicted affinity. Its simplicity also helps

reduce the number of additional parameters in the model.

The overall Model integrates these justified components,

processing graph batches to generate link prediction logits.

LII. EXPERIMENTS

This section outlines the experimental methodology

employed to train our heterogeneous GNN model and evaluate

its performance on the task of predicting customer-product

purchase interactions using the graph constructed in the

previous section.

A. Experimental Setup

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

1) Data Splitting:

The interaction edges ("customer", "buy", "product") were

partitioned into training (80%), validation (10%), and test

(10%) sets using T.RandomLinkSplit [9]. Negative sampling

was enabled to generate non-interacting pairs for training

supervision. The primary experimental setup, presented first,

utilized a negative sampling ratio of 2.0 (two negative samples

per positive training edge). For comparative analysis (presented

in Section V), setups using ratios of 1.0 and 4.0 were also

trained and evaluated. The splitting procedure ensured

appropriate handling of reverse edges and included negative

samples in the validation and test sets for evaluation.

2) Mini-Batching and Neighbor Sampling:

Training and evaluation were performed using mini-batches

generated via LinkNeighborLoader [9] to handle the graph size

efficiently.

 Neighbor Sampling:

For each target supervision edge in a batch, a 2-hop

computation subgraph was sampled using num_neighbors=[20,

10]. This configuration samples up to 20 neighbors for the

source/destination nodes (1-hop) and up to 10 neighbors for

each of those nodes (2-hop), providing local context for the

GNN while maintaining computational feasibility [8].

 Batch Size:

A batch_size of 128 supervision edges was used for training,

balancing gradient estimation quality and memory usage.

Validation and testing used a larger batch size for efficiency.

 Data Shuffling:

The training loader employed shuffling (shuffle=True) for

each epoch, while validation and test loaders did not

(shuffle=False) to ensure consistent evaluation.

3) Model Training Details:

 Model Configuration:

The heterogeneous GNN model detailed in the previous

section was used, configured with a hidden embedding

dimension (hidden_channels) of 64.

 Optimizer:

The Adam optimizer was employed with a learning rate of

lr=0.001. Adam is a standard choice for deep learning models,

known for its adaptive learning rates and generally robust

performance.

 Loss Function:

We utilized the binary cross-entropy loss with logits. This

loss is appropriate for binary link prediction tasks where the

model outputs raw logits, providing numerical stability by

combining the sigmoid activation and loss calculation.

 Training Duration:

The model was trained for 50 epochs. This duration was

determined based on preliminary observations where both the

training loss and key validation metrics showed significant

stabilization and diminishing returns beyond this point,

suggesting reasonable convergence for comparative purposes.

B. Evaluation Metrics

To assess the performance of our GNN model on the link

prediction task, we employed several standard evaluation

metrics, calculated on the validation and test sets:

Confusion Matrix: A table summarizing the counts of True

Positives (TP), True Negatives (TN), False Positives (FP), and

False Negatives (FN).

Area Under the Receiver Operating Characteristic Curve

(AUC / ROC AUC): This metric evaluates the model's ability

to correctly rank positive instances higher than negative

instances, irrespective of a specific classification threshold. It

plots the True Positive Rate (Recall) against the False Positive

Rate at various thresholds. An AUC score of 1.0 represents

perfect ranking, while 0.5 signifies random chance

performance. AUC is calculated using the raw prediction scores

(logits) from the model.

Accuracy: The proportion of all predictions (both positive

and negative links) that the model classified correctly. It is

calculated as (True Positives + True Negatives) / (Total

Predictions).

Precision: The proportion of predicted positive links that

were actually correct. It measures the relevance of the positive

predictions and is calculated as True Positives / (True Positives

+ False Positives).

Recall (True Positive Rate): The proportion of actual

positive links that the model correctly identified. It measures

the model's ability to find all relevant instances and is

calculated as True Positives / (True Positives + False

Negatives).

F1-Score: The harmonic mean of Precision and Recall,

providing a single score that balances both metrics. It is

calculated as 2 * (Precision * Recall) / (Precision + Recall).

Threshold Dependency: Metrics 2 through 6 (Accuracy,

Precision, Recall, F1-Score, Confusion Matrix) require

converting the model's continuous output scores (logits) into

discrete binary predictions (0 or 1). This conversion depends on

a chosen classification threshold. Based on preliminary analysis

balancing precision and recall (detailed in Section V), a

threshold of -0.5 was selected for reporting these metrics. AUC

is calculated independently of this threshold using the raw

scores.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

LIII. RESULTS

This section presents the evaluation results of our

heterogeneous GNN model trained using the experimental

setup described previously. We first report the performance of

the baseline configuration and then analyze the impact of

varying key hyperparameters.

A. Baseline Model Performance

Our baseline model was configured using the parameters

detailed in the previous section: Adam optimizer (lr=0.001), a

negative sampling ratio of 2.0 during training split generation,

and neighbor sampling parameters num_neighbors=[20, 10].

The model was trained for 50 epochs. Performance was

evaluated on the held-out validation set.

Initial evaluation compared classification thresholds of 0.0

and -0.5 on the validation set. While threshold=0.0 yielded

higher precision (0.8257), threshold=-0.5 achieved

substantially higher Recall (0.7268 vs 0.4784) and a better

overall F1-Score (0.7085 vs 0.6058). Given the goal of

identifying potential purchase interactions, the improved Recall

and F1-Score at threshold -0.5 led us to select it as the primary

threshold for reporting classification metrics in subsequent

analyses.

The quantitative results for the link prediction task, using

different classification threshold are presented in Table II. The

corresponding confusion matrix (threshold = -0.5) is visualized

in Fig 2.

TABLE II

BASELINE MODEL PERFORMANCE ON VALIDATION SET

Metric

Value

(threshold=0.0)

Value

(threshold=-0.5)

ROC AUC

0.8591

0.8592

Accuracy

0.7925

0.8007

Precision

0.8257

0.6912

Recall

0.4784

0.7268

F1-Score

0.6058

0.7085

Fig. 2 Confusion Matrix for the baseline model on the validation set, using a

classification threshold of -0.5

Interpretation of Baseline: The baseline model achieves a

strong ROC AUC score of 0.8592, demonstrating good ranking

ability. With the chosen operating threshold of -0.5, the model

balances precision and recall reasonably well, achieving an F1-

Score of 0.7085 and correctly identifying approximately 73%

of the actual positive links (Recall = 0.7268). Fig 2 details the

specific counts of true/false positives and negatives at this

threshold.

B. Impact of Negative Sampling Ratio

To assess the sensitivity of the model to the proportion of

negative examples encountered during training, we trained two

additional models, identical to the baseline except for the

negative sampling ratio used during the RandomLinkSplit

stage: one with a ratio of 1.0 (equal positive and negative

training supervision edges) and one with a ratio of 4.0 (four

times as many negative edges). Performance was evaluated on

the validation set using the same -0.5 threshold.

The results are compared in Table III.

TABLE III

PERFORMANCE COMPARISON FOR DIFFERENT NEGATIVE SAMPLING RATIOS

(VALIDATION SET, THRESHOLD = -0.5)

Neg.

Sampling

Ratio

ROC

AUC

Accuracy

Precision

Recall

F1-

Score

1.0

0.8563

0.6968

0.6332

0.9352

0.7551

2.0

0.8592

0.8007

0.6912

0.7268

0.7085

4.0

0.8518

0.8351

0.8293

0.2213

0.3494

 Analysis:

The negative sampling ratio significantly influenced the

model's predictive behavior, particularly the precision-recall

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

balance, even when evaluated at the fixed -0.5 threshold (Table

III).

Using a ratio of 1.0 resulted in the highest Recall (0.9352)

and F1-Score (0.7551). This suggests that with fewer negative

examples during training, the model learned a representation

space or decision boundary that, when combined with the -0.5

threshold, was highly effective at identifying positive instances,

albeit at the cost of lower Precision (0.6332) and Accuracy

(0.6968).

Increasing the ratio to 2.0 (baseline) improved Precision and

Accuracy compared to 1.0, but substantially reduced Recall and

F1-Score.

Further increasing the ratio to 4.0 drastically shifted the

balance towards Precision (0.8293) and Accuracy (0.8351) but

resulted in very poor Recall (0.2213) and the lowest F1-Score.

This indicates that training with a large excess of negative

samples pushed the model towards being overly conservative

in predicting positive links at this specific evaluation threshold.

Interestingly, the ROC AUC score remained relatively stable

across the ratios (0.85-0.86), peaking slightly at 2.0. This

suggests the model's overall ranking ability wasn't dramatically

altered, but the calibration of its scores (affecting performance

at a fixed threshold) was highly sensitive to the negative

sampling ratio.

Based on these validation results, the model trained with a

negative sampling ratio of 1.0 yielded the best F1-Score and

Recall at our chosen operating threshold of -0.5. While the ratio

of 2.0 served as our initial baseline configuration, we select the

model trained with the 1.0 ratio as the preferred configuration

due to its superior performance on these key metrics. Therefore,

the final evaluation on the held-out test set will utilize the model

trained with a negative sampling ratio of 1.0.

C. Test Set Performance

Following the analysis on the validation set, the model

configuration trained with a negative sampling ratio of 1.0 was

selected as the preferred model due to its superior F1-Score and

Recall. To obtain an unbiased estimate of its generalization

performance, this final model was evaluated on the held-out test

set, using the same classification threshold of -0.5 determined

during validation.

The performance metrics achieved on the test set are

presented in Table IV.

TABLE IV

FINAL MODEL PERFORMANCE ON TEST SET (NEG. RATIO = 1.0, THRESHOLD =

-0.5)

Metric

ROC

AUC

Accuracy

Precision

Recall

F1-

Score

Value

0.8531

0.6876

0.6251

0.9372

0.7500

 Analysis:

The results on the unseen test set confirm the effectiveness

of the selected model configuration. The performance is largely

consistent with the validation results, achieving a high ROC

AUC of 0.8531 and an F1-Score of 0.7500. Notably, the model

maintained its excellent Recall of 0.9372, successfully

identifying the vast majority of positive interactions within the

test data at the chosen threshold. This consistency between

validation and test performance indicates good generalization

and suggests that the model selection process did not

significantly overfit to the validation data.

LIV. DISCUSSION

Our experiments demonstrate the viability of employing a

heterogeneous Graph Neural Network, specifically an adapted

GraphSAGE model, for link prediction on the implicit feedback

data characteristic of the Online Retail dataset. The model

achieved strong performance, particularly in identifying

potential customer-product interactions, as evidenced by the

high ROC AUC and Recall scores obtained on the test set.

A key finding relates to the significant impact of the negative

sampling ratio used during training setup. While the overall

ranking ability (AUC) remained relatively stable, the ratio

drastically influenced the precision-recall trade-off at our

chosen operating threshold (-0.5). Training with fewer negative

samples (ratio 1.0) resulted in a model optimized for high

Recall and F1-score, suggesting it learned representations

particularly effective at identifying positive links when

evaluated at a lower decision threshold. Conversely, increasing

the negative samples (ratio 4.0) pushed the model towards

higher precision but severely hampered its ability to recall

actual positive interactions. This highlights the sensitivity of

threshold-dependent metrics to this training parameter and

underscores the importance of tuning it based on the specific

application's goals (e.g., prioritizing recall in recommendation

discovery). Our selection of the 1.0 ratio model reflects a

prioritization of finding most relevant items (high Recall/F1).

The successful application of learnable embeddings without

explicit node features confirms that GNNs can effectively

leverage purely structural information from interaction graphs

for recommendation tasks. The 2-layer GraphSAGE

architecture provided sufficient neighborhood context,

achieving good results without introducing the potential

complexities or over-smoothing issues of deeper models. The

use of a simple dot product classifier proved effective in

translating learned embeddings into meaningful link prediction

scores.

Limitations of this study include the reliance solely on

interaction data, incorporating product or customer features

could potentially enhance performance. The analysis focused

on a static snapshot of the data, neglecting temporal dynamics

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

which could influence purchase behavior. The cold-start

problem for new users or products was not explicitly addressed

in this link prediction framework.

Despite these limitations, our findings reinforce the potential

of heterogeneous GNNs for implicit feedback

recommendation, particularly demonstrating how training

configurations like negative sampling can be tuned to align

model behavior with desired operational characteristics like

high recall.

LV. CONCLUSION

In this paper, we addressed the task of product

recommendation from implicit feedback using the standard

Online Retail dataset. We presented a methodology

encompassing data preprocessing tailored for transactional

records, construction of a heterogeneous user-item graph, and

the application of a GraphSAGE-based Graph Neural Network

adapted for heterogeneity. By utilizing learnable embeddings

derived solely from the graph structure and framing the

problem as link prediction, our approach demonstrated

considerable effectiveness.

Our experiments yielded strong performance, notably

achieving a ROC AUC score of 0.8531 and an F1-score of

0.7500 on the held-out test set with our selected configuration

(negative sampling ratio 1.0, threshold -0.5). We particularly

highlighted the significant impact of the negative sampling

ratio during training setup on the final precision-recall balance,

emphasizing the need to align this parameter with specific

recommendation goals. These results confirm the viability and

potential of applying heterogeneous GNNs to effectively model

sparse, implicit interaction data common in real-world retail

scenarios.

REFERENCES

[106] Zhang, S., Yao, L., Sun, A., & Tay, Y, “Deep learning based

recommender system: A survey and new perspectives.”, ACM

Computing Surveys (CSUR), 52(1), 1-38, 2019.

[107] Rendle, S., Krichene, W., Zhang, L., & Anderson, J, “Neural

collaborative filtering vs. matrix factorization revisited.”, Proceedings

of the 14th ACM Conference on Recommender Systems. 2020.

[108] Koren, Y., Bell, R., & Volinsky, C, “Matrix factorization techniques for

recommender systems.”, Computer, 42(8), 30-37, 2009.

[109] He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., & Wang, M, “Lightgcn:

Simplifying and powering graph convolution network for

recommendation.”, Proceedings of the 43rd International ACM SIGIR

conference on research and development in Information Retrieval, 2020.

[110] Wu, S., Sun, Y., Zhang, W., Xie, X., & Cui, B, “Graph neural networks

in recommender systems: a survey.”, ACM Computing Surveys

(CSUR), 55(5), 1-37, 2022.

[111] Gao, C., Li, Y., et al, “Graph Neural Networks for Recommender

Systems: Challenges, Methods, and Directions.”, arXiv preprint

arXiv:2207.12204, 2022.

[112] Wang, X., Ji, H., Shi, C., Wang, B., Ye, Y., Cui, P., & Yu, P. S,

“Heterogeneous graph attention network.”, Proceedings of The World

Wide Web Conference, 2019.

[113] Hamilton, W. L., Ying, R., & Leskovec, J, “Inductive representation

learning on large graphs.”, Advances in neural information processing

systems, 30, 2017.

[114] Fey, M., & Lenssen, J. E, “Fast graph representation learning with

PyTorch Geometric.”, arXiv preprint arXiv:1903.02428, 2019.

[115] Hu, Y., Koren, Y., & Volinsky, C, “Collaborative filtering for implicit

feedback datasets.” Proceedings of the 2008 Eighth IEEE International

Conference on Data Mining, 2008.

[116] Yang, C., Xiao, Y., Zhang, Y., Sun, Y., & Han, J, “Heterogeneous graph

representation learning: A survey.”, arXiv preprint arXiv:2011.09861,

2020.

[117] Chen, D., Sain, S. L., & Guo, K, “Data mining for the online retail

industry: A case study of RFM model-based customer segmentation

using data mining.”, Journal of Database Marketing & Customer

Strategy Management, 19(3), 197-208, 2012.

[118] Li, Q., Han, Z., & Wu, X. M, “Deeper insights into graph convolutional

networks for semi-supervised learning.”, Proceedings of the AAAI

conference on artificial intelligence, 32(1), 2018.

[119] Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., & Dahl, G. E,

“Neural message passing for quantum chemistry.”, International

conference on machine learning, PMLR, 2017.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Using Chatbot in E-commerce to Improve Profit :

Artificial Intelligence in practice

Houda EL BOUHISSI, Essaid FERHAT, Naima BOUAGAL

University of Bejaia, Faculty of Exact Sciences, LIMED Laboratory, 06000, Bejaia, Algeria

Abstract— Recently, chatbots based on artificial

intelligence have brought unprecedented business

potential. The use of intelligent machines such as chatbots

is increasing in our lives every day and will be able to

replace humans performing certain tasks, providing

better benefits and enhancing business profits. This

article presents a new approach to creating an online sales

chatbot to provide customer support and increase sales.

The system uses machine learning to understand natural

language. The chatbot is based on modular architecture

and uses machine learning algorithms. Preliminary

results have shown that the proposed approach provides

promising results compared to other state-of-the-art

approaches.

Keywords— Artificial intelligence, chatbots, conversational

agents, Machine Learning, NLP.

LVI. INTRODUCTION

The rise of online commerce over the past two

decades has had a significant impact on society and the

way business is conducted on a global scale. As well as

transforming the retail industry, it has had many

positive effects on both businesses and consumers on a

personal level [1]. The ability to shop online has had a

monumental impact on consumers around the world.

In recent years, users have become more reliant on e-

commerce than ever before, and outlets such as

Amazon are taking down giants within the industry

such as Walmart with absolute ease [2].

Customer needs and expectations are growing by

day, and to win customers, we must satisfy them and

try to optimize the energy and effort they put into

getting the product, especially in the context of e-

commerce.

There are many ways to satisfy customers and

increase profits, among them chatbots. A chatbot could

bring a remarkable benefit to e-commerce: not only can

it chat with customers, but it can also perform human

tasks such as taking customer orders and processing

them like a human [3].

In the 1990s, researchers started to develop

conversational agents, or chatbots, that could

understand and respond to human speech in natural

language. These early chatbots were restricted in their

functional scope, but they laid the groundwork for the

development of more sophisticated chatbots in the

following decades [4].

Chatbots represent a major technological advance,

profoundly transforming interactions between humans

and digital systems. Their evolution, powered by

constant advances in artificial intelligence, natural

language processing and machine learning, means that

they are becoming increasingly sophisticated tools,

capable of responding to the varied needs of users in an

efficient and personalized way.

One of the key strengths of chatbots lies in their

ability to improve operational efficiency across a wide

range of sectors. Whether it's in customer service,

where they offer fast and accessible 24/7 assistance, or

in healthcare, where they support patient diagnosis and

monitoring, chatbots provide innovative and reliable

solutions. They also contribute to education by offering

virtual tutors, and to e-commerce by improving the

shopping experience through intelligent

recommendations and instant responses.

What's more, their integration with artificial

intelligence makes them particularly flexible and

adaptable. Thanks to natural language processing,

chatbots understand linguistic nuances and provide

tailored responses, making interactions more human.

Their ability to learn from data and past interactions

enhances their accuracy and relevance over time. As a

result, they not only reduce costs for businesses, but

also improve user satisfaction and experience.

However, the use of chatbots is not without its

challenges. Among them, data confidentiality and

security are major concerns. Users often share sensitive

information with chatbots, so strict standards are

needed to ensure that this data is protected. In addition,

ethics in their design and use are essential to avoid bias,

inappropriate responses or manipulation.

Research conducted in the United States indicates

that most individuals (62%) prefer using chatbots for

communication with businesses [5].

In practice, several famous chatbot exist, for

instance, Sephora uses a chatbot to recommend

products, while H&M suggests outfits that match the

user's style. We cite here the most used Chatbot

Platforms such as WhatsApp, Facebook Messenger,

Alexa and Google Assistant and Twitter [6].

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Chatbots are playing a key role in online commerce.

They improve the shopping experience and increase

customer loyalty in several ways:

 Product recommendations: Chatbots suggest

items based on user preferences or purchase

history.

 Navigation assistance: They guide users in their

search for specific products or services.

 Personalizing promotions: Based on the data

collected, chatbots can suggest customized

offers.

 Order tracking and returns management: Allow

customers to track the status of their purchases

or organize a return.

Chatbots are much more than just a technological

tool, therefore they represent an evolution in the way

we interact with digital technology. Their ongoing

development is opening promising prospects in many

areas, with huge potential for improving productivity,

simplifying complex processes and making services

more accessible.

With responsible design and judicious integration,

chatbots have the power to transform our societies by

making technologies smarter, more efficient and more

human.

With the advent of artificial intelligence, AI-based

chatbots have emerged. These chatbots use dialogue

systems to enable natural language conversations with

users using speech, text or both. AI-based chatbots are

more effective and relevant, as they can simulate

human face-to-face communication [7].

The aim of this paper is twofold. First, this paper

attempts to introduce the chatbot technology, its types,

and its impact in e-commerce. Second, this paper

proposes a new approach to improve e-commerce

benefits using a new AI-based-chatbot.

We focus on the development of natural language

conversations, a core feature of AI chatbots, to

facilitate a more flexible exchange of information

between humans and chatbots.

The rest of the paper is organized as follows: In Sec.

2, we introduce chatbot technology, its types, and its

impact on e-commerce. Sec. 3 overviews the main

related work in this area. Sec. 4 presents the proposed

approach in detail. Finaly, Sec. 5 draws the conclusion

of the research and briefly discusses future work.

LVII. PRELIMINARIES

The term "chatbot" is a term combining the words

"chat" (instant messaging) and "bot" (automatic or

semi-automatic software agent). It refers to a program

capable of interacting with human users in the form of

textual conversation [8].

An AI chatbot is a program designed to simulate a

conversation with a human. It can analyze queries

entered by users and provide automatic responses, all

without human intervention [9].

These bots are designed to enhance their

understanding and responses through the application of

advanced AI techniques, such as natural language

processing (NLP), machine learning (ML), and deep

learning(DL), which facilitate continuous improvement

in their capabilities.

These technologies enable the chatbot to engage in coherent

discourse, manage intricate requests, offer bespoke

recommendations, and perpetually enhance its cognitive

capabilities.

The chatbot typically manifests as a small messaging

window on the Web or on mobile applications. Also

known as conversational agent, a chatbot is used to

respond to user queries about a product or service.

In general, chatbots function as the primary interface

for customer service. These bots provide prompt

responses to customers, resolving basic issues, and

assisting companies in the management of substantial

volumes of requests.

Furthermore, they serve to enhance customer

experience by offering a fast, personalized service. The

integration of chatbots into digital marketing has also

enabled brands to interact with their customers in a

more interactive and engaging way.

Generally, we distinguish three types of chatbot [10]:

 Simple chatbot: this chatbot is characterized by

its limited capacity, which is programmed to

respond to specific questions according to

predefined scenarios and following a linear

conversation tree.

 Intelligent chatbot, which is distinguished by

its capacity to use AI, thereby facilitating the

comprehension of natural language and the

acquisition of proficiency in addressing more

intricate inquiries through the aggregation and

analysis of data derived from prior interactions.

 Hybrid chatbot that combines the strengths of

both artificial intelligence and human

intervention. If the AI tool fails to respond to

the user's request, the responsibility is

transferred to a human agent.

Our research focuses on chatbots based on AI. An AI

chatbot involves several components [11] :

 NLP : The chatbot employs NLP techniques to

analyze user queries and extract their meaning.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

 Information processing : involves the chatbot

matching the query with predefined intentions

to ascertain the user's objective.

 The generation of a response is dependent upon

the intention detected, with the chatbot deriving

an adapted response from a knowledge base or

a language generation engine.

 Conversational interaction: the chatbot

maintains a conversational context by learning

previous exchanges.

 Learning: conversations are recorded to

continuously train the underlying AI model and

improve its performance.

The architecture of an AI chatbot is designed to

efficiently handle many simultaneous conversations

with users, with the bot able to be integrated into a

multitude of channels, including websites, mobile

applications, social networks, and SMS. AI chatbots are

ideal for a wide range of applications requiring

intelligent conversational support for the user.

Within the context of customer service, for instance,

these bots can respond to a high volume of frequently

asked questions on a 24-hour basis, executing routine

operations on demand, such as updating addresses or

tracking orders, and qualifying complex requests to

direct customers to the appropriate personnel. This

enables teams to be relieved of repetitive tasks,

allowing them to focus on dedicated cases that

necessitate human expertise.

In the context of sales and marketing, the chatbot

plays a pivotal role in stimulating and facilitating the

purchasing process. It does so by recommending

suitable products at opportune moments, assisting

customers in their information searches, proposing

highly personalized promotions, and helping them

complete their orders. The chatbot's intuitive

conversation capabilities enable it to discern latent

customer needs and offer the most relevant products or

services.

In the context of industry technology, the AI chatbot

has emerged as a pivotal tool for employees,

functioning as a daily ally. It is capable of

autonomously resolving numerous technical issues,

providing step-by-step guidance to users through

internal procedures, and addressing a wide range of

queries pertaining to the company, employee benefits,

leave application processes, and related topics. This

capacity to alleviate the workload of IT teams is a

significant benefit of the chatbot.

In the healthcare domain, the AI chatbot plays a

crucial role in informing patients about their medical

conditions, available treatments, and the progression of

examinations. It facilitates appointment booking with

healthcare professionals, offers adapted preventative

guidance, ensures ongoing health parameter

monitoring, and reminds individuals to take prescribed

medication [12] .

The potential for integrating AI chatbots is

extensive, with their ability to enhance user experiences

across various sectors by providing individualized

conversations and recommendations.

LVIII. RELATED WORKS

One of the major challenges in the development of

an automated customer support system is the

categorization of natural language. This subject has

been the focus of numerous studies. Therefore, we will

address the selection of appropriate techniques to

ensure the functionality of chatbot, as well as the

selection of techniques that are employed in the correct

manner.

In this section, we’ll present the main works related

to e-commerce based on chatbots.

The authors in [13] built a chatbot for a university

shopping mall that automates user interactions and

improves customer experience. The problem addressed

by the authors is the lack of automation in customer

service for e-commerce at university shopping malls,

making interaction with customers slow and inefficient.

The developed chatbot relies on a rules-based system,

where pre-defined answers are generated based on

questions asked by users. The chatbot assists with

product research, provides information on promotions,

and answers common queries. The chatbot showed an

improvement in response time reduction and an

increase in interactions between users and the platform.

Another approach proposed by [14] which explores

the implementation of chatbots in online commerce and

their role open innovation, where companies integrate

external ideas to improve their services. Furthermore,

the paper investigates how chatbots can be integrated

into online commerce to improve user experience,

optimize customer-business interactions and increase

conversion rates. It focuses on the challenges of

adopting chatbots, such as their personalization,

efficiency, and their role in open innovation. The built

chatbot automatically answered 85% of the most

frequently asked questions. Customer service response

times reduced by 20% increase in online sales.

In addition, the authors in [15] suggest a knowledge-

based intelligent conversational agent system

architecture to support customer services in e-

commerce. The proposal provides instant, personalized

answers to customer queries, reducing waiting times

and improving overall satisfaction. The proposed

chatbot recommends products to customers, offering

more personalized shopping experience and increasing

the chances of a sale. Furthermore, the proposed work

automates repetitive customer service tasks, such as

frequent questions and support processes freeing up

time for higher value-added tasks. Customers benefit

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

from faster, more personalized assistance, enhancing

their online shopping experience.

The proposal presented in [16] presents an e-

commerce website integrated with a recommendation

system, chatbot, and reverse image search to enhance

the user experience. The problem addressed is that of

efficient product search on e-commerce sites, by

enabling users to find similar products via image search

and to receive personalized recommendations. The

authors have developed a system architecture

combining recommendation algorithms-based user

preferences, a chatbot to answer customer questions,

and a reverse image search engine to enable users to

find products by uploading an image. The developed

system has improved customer satisfaction by offering

more precise recommendations and faster product

searches via image.

An interesting proposal presented by [17] which aims

to improve customer satisfaction, optimize operations

and increase customer service efficiency. The proposal

explores the integration of chatbots in e-commerce to

improve customer experience and increase sales. The

paper uses web crawling, NLP and Information

Retrieval (IR).

Furthermore, the authors in [18] introduce BOON, a

neural search engine for cross-information retrieval

between text and images. The challenge is to create a

search engine capable of understanding complex

queries combining images and text to provide relevant

results. The authors use a neural network based on

visual transformers and language models to process

textual and visual data together. The BOON model has

been trained on large databases containing text and

images to improve the accuracy of search results.

BOON performed better than traditional search

engines, providing more accurate results for queries

involving images and text.

The authors in [19] present a conditional generative

chatbot using a Transformer model, to generate

coherent and contextually adapted responses in

conversations. The problem addressed is the need to

improve the generation of responses in chatbots so that

they are relevant and consistent over the long term in a

conversation. The paper proposed a combination of

Conditional Wasserstein Generative Adversarial

Networks (cWGAN) with a Transformer model. The

generator is a complete Transformer model that

produces responses. The discriminator uses only the

Transformer's encoder to evaluate the reality or

"falsity" of the generated responses. The proposed

model showed promising results in terms of generating

more consistent responses, outperforming traditional

chatbot models.

The main problem addressed in the paper [20] is the

lack of immediate human assistance in e-commerce.

This gap can lead to customer dissatisfaction and lower

online conversions. The proposed work aims to

improve user experience, increase sales and understand

customer intentions. The proposed chatbot seeks to fill

this gap by providing a fluid user experience close to

human interaction. The proposal understands user

queries using artificial intelligence to detect customer

intentions and identify key information in their

messages and finally, the requests are routed to the

appropriate modules to generate relevant responses.

The proposal uses the AIML (AI Markup Language)

for natural language processing and deep neural

networks (DNN) to enhance the accuracy and relevance

of responses.

Related works comprise retrieval-based models,

generative models, and hybrid approaches. Retrieval-

based chatbots use ML or DL to select the most

appropriate response from a predefined set. They are

efficient, consistent, and safe but limited to existing

responses and unable to generate novel replies.

Generative models that use DL architecture, such as

transformers (e.g. BERT) and large language models,

can produce new, contextually relevant responses and

handle complex, open-domain conversations.

However, they require large amounts of training data,

advanced computing resources, and precise tuning to

avoid producing incorrect or inaccurate results. Hybrid

models combine both techniques, using retrieval for

precision and generation for flexibility, offering a more

balanced and adaptive conversational experience, but at

the cost of increased system complexity.

In a nutshell, a variety of used chatbot approaches

have been presented. However, the selection of the

most appropriate approach is contingent on the client's

requirements and the prevailing technical constraints.

AI-based solutions (e.g. Transformers) have proved

superior customization capabilities; nevertheless, they

can be costly and complex to implement.

Hybrid approaches seem to offer a satisfactory

compromise, integrating the benefits of multiple

techniques to deliver precise recommendations and

natural interactions.

Taking advantage of the related work, we propose a

new AI chatbot using ML algorithms. In the next

section, we will present our proposal in detail.

LIX. PROPOSED APPROACH

Building an efficient e-commerce application based

on an AI chatbot is a big deal and involves different

technologies. However, how can we choose the best

technique to ensure the functionality of the chatbot

while avoiding the challenges posed by the different

methodologies available?

To answer this question, we propose a new model

based on ML and NLP, as shown in Figure 1, which

includes different components.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Fig. 1 System architecture

The proposal includes two components:

 The first component is the e-commerce

application hosted on a remote server, this

application is hybrid, it will be Explorer on a

desktop or on a smartphone.

 The second component is the AI chatbot hosted

on the same server.

When you know your chatbot use case, its type, and

channels, you are ready to select the platform to build

your chatbot.

Our proposal improves the business benefits, it

makes customers’ experience more effective and

personalized, which increases providing answers and

helps in improving the satisfaction level of the

customer experience.

Dealing with customers is very time-consuming and

we need people to always be available, which improves

the cost of doing business; however, a well-developed

chatbot makes it easy to communicate with customers

in their homes. It acts as a personal shopper, helping

customers find what they need.

The process to generate response as illustrated in

figure 2 involves various phases :

Fig. 2 Chatbot system management

A. User request retrieval:

At the beginning, the chatbot greets us with a

welcome message and waits for our response. The user,

therefore, formulates his request in the form of text.

The query is checked against the other queries in the

question/answer repository prepared in advance. If the

request matches a query in the repository, the system

sends a response directly to the client. Otherwise, the

request is then processed with BERT (following step)

B. User request processing:

In this phase the customer sends a message, generally

a text, to the bot. This message is processed with BERT

(Bidirectional Encoder Representations from

Transformers) [21] which is a DL algorithm for text

processing. We note that we have prepared registered

questions and responses in the database serving as

interaction models to provide quick response to the

user.

The use of BERT transformed text processing by

enabling models to deeply understand contextual

nuances.

BERT's architecture considers both the left and right

context of a word simultaneously, making it highly

effective for tasks such as sentiment analysis, question

answering, named entity recognition and text

summarization.

By leveraging its pre-trained language

understanding, BERT can be fine-tuned to specific

datasets for domain-specific applications such as

medical text analysis or legal document processing. Its

versatility and contextual understanding have set new

standards in many NLP tasks, improving both accuracy

and efficiency.

In this step, the user request processing using BERT

involves different phases to deeply understand the

context and the meaning. Here's how it typically, these

phases work:

 Tokenization: The user's request is first broken

into tokens (words) and converted into

embeddings that BERT can understand.

 Contextual Encoding: Unlike traditional

models, BERT reads the entire sentence

bidirectionally, capturing the full context of

each word based on its surroundings. This

allows it to understand nuances like intent,

entities, and sentiment.

 Feature Extraction: The output of BERT is a set

of high-dimensional vectors representing each

token with contextual meaning. The vector

often represents the overall sentence meaning.

This step is accompanied by a selection of the

best features. For this, we use the bioinspired

Grey Wolf Optimization (GWO) algorithm

[22] . GWO, a nature-inspired metaheuristic

optimization algorithm that mimics the

leadership hierarchy and hunting strategy of

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

grey wolves in nature. It is widely used to

optimize complex problems, including tuning

ML and DL models.

 Intent Detection : This vector is fed into a

classifier to determine the user's intent (for

example buy a dress). In this phase, the bot

tries to understand incoming messages

automatically based on a database history

which involves frequent messages with their

responses. The bot scans the text and matches it

with predefined rules or AI-based algorithms to

pick the best response. The classification of

customers' potential intentions is done by

classifiers. One customer might want to check

the status of an order, while another might want

to check the vouchers she has, which are two

different types of queries.

 Response Generation : Based on the detected

intent and entities, the chatbot either retrieves a

suitable response (retrieval-based) or uses

another model to generate a reply (generative

approach). The bot sends a response within

seconds, and the SMS bot sends the perfect

response back, making things fast and

convenient for customers. To generate

responses, chatbots use two different ways. It

uses history database where all the messages

are stored, that means if the message is

equivalent (in term of meaning) to a message

before sent, the response is generated

automatically from the predefined database.

Else, the bot generates a new response on the

fly which will be stored in the history database

for further use.

For commercial applications, responses tend to be

pre-defined to ensure that customers receive a

consistent service, and that the bot does not respond in

unintended ways that could lead to public relations

failures.

In essence, BERT enhances a chatbot’s ability to

understand user input by providing rich, context-aware

text representations, crucial for accurate intent

detection and natural interaction.

Preliminary results have shown an improvement in

accuracy over the classic BERT approach with feature

reduction. However, the response time is a little longer.

The novelty of our approach is twofold. First, we use

a prepared database of request/response to quickly deal

with user requests. Second, we enhance BERT with

GWO aiming for a high-performance intent

classification or dialogue act recognition in a chatbot.

LX. CONCLUSION

In this paper, we introduced chatbot technology, its

types, and its impact in e-commerce and we give details

about the AI chatbots. Furthermore, we propose a new

approach related to an e-commerce application based

on AI chatbot to improve e-commerce benefits.

Our approach involves using BERT with GWO to

enhance performance. In our future work, we intend to

enhance our proposal and reduce request time

processing.

Finally, we deduce that a selection of a technique or

methodology cannot be made prior to a thorough

examination of the functional requirements of the

chatbot in question. This is since each technique

possesses inherent limitations and advantages,

necessitating the maximization of these benefits.

REFERENCES

[1] Shahriari, S., Mohammad reza, S., & gheiji, S. (2015). E-

Commerce And It Impacts on Global Trend And Market.

International Journal of Research -Granthaalayah, 3(4),49–55.

https://doi.org/10.29121/granthaalayah.v3.i4.2015.3022.

[2] Akbar, R., & Madany, Z. (2021). Impact of E-Commerce in

Industry. International Journal of Research and Applied Technology,

1, 59–64. https://doi.org/10.34010/injuratech.v1i2.5914.

[3] Misischia, C. V., Poecze, F., & Strauss, C. (2022). Chatbots in

customer service: Their relevance and impact on service

quality. Procedia Computer Science, 201, 421-428.

[4] Kooli, C. (2023). Chatbots in Education and Research: A Critical

Examination of Ethical Implications and Solutions. Sustainability,

15, 5614. https://doi.org/10.3390/su15075614.

[5] A. Miklosik, N. Evans, and A. Qureshi, "The Use of Chatbots in

Digital Business Transformation: A Systematic Literature

Review," IEEE Access, vol. 9, pp. 106530-106539, 2021, doi:

10.1109/ACCESS.2021.3100885.

[6] Adamopoulou, E., & Moussiades, L. (2020). An Overview of

Chatbot Technology. Artificial Intelligence Applications and

Innovations: 16th IFIP WG 12.5 International Conference, AIAI

2020, Neos Marmaras, Greece, June 5–7, 2020, Proceedings, Part

II, 584, 373–383. https://doi.org/10.1007/978-3-030-49186-4_31.

[7] Athota, L., Shukla, V. K., Pandey, N., & Rana, A. (2020, June).

Chatbot for healthcare system using artificial intelligence. In 2020

8th International conference on reliability, infocom technologies and

optimization (trends and future directions) (ICRITO) (pp. 619-622).

IEEE.

[8] Wibowo, B., Clarissa, H., & Suhartono, D. (2020). The

Application of Chatbot for Customer Service in E-Commerce.

Engineering, Mathematics and Computer Science (EMACS) Journal,

2, 91–95. https://doi.org/10.21512/emacsjournal.v2i3.6531.

[9] Zhang J, Oh YJ, Lange P, Yu Z, Fukuoka Y Artificial Intelligence

Chatbot Behavior Change Model for Designing Artificial Intelligence

Chatbots to Promote Physical Activity and a Healthy Diet: Viewpoint

J Med Internet Res 2020;22(9):e22845 doi: 10.2196/22845.

[10] M. Rakhra, A. Singh, and P. K. Reddy, "Comparative Analysis

of Chatbot Architectures for E-Commerce Applications," IEEE

Access, vol. 9, pp. 13445-13460, 2021, doi:

10.1109/ACCESS.2021.3116878.

[11] R. Jaan, L. Wei, and M. Chen, "Architectural components of AI

chatbots for natural language processing," in Proc. IEEE Int. Conf.

Comput. Intell. Virtual Environ. Meas. Syst., 2012, pp. 112-117, doi:

10.1109/CIVEMSA.2012.6297142.

[12] Ma, R., Cheng, Q., Yao, J., Peng, Z., Yan, M., Lu, J., ... & Zhao,

C. (2025). Multimodal machine learning enables AI chatbot to

diagnose ophthalmic diseases and provide high-quality medical

responses. npj Digital Medicine, 8(1), 64.

[13] Oguntosin, V., & Olomo, A. (2021). Development of an E-

Commerce Chatbot for a University Shopping Mall. Applied

Computational Intelligence and Soft Computing, 2021, Article ID

6630326.

[14] Illescas-Manzano, M. Á., & López, M. Á. (2019).

Implementation of Chatbot in Online Commerce, and Open

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Innovation. Journal of Open Innovation: Technology, Market, and

Complexity, 5(2), 125.

[15] Ngai, E. W. T., & Lee, M. M. S. (2021). An Intelligent

Knowledge-based Chatbot for Customer Service. Expert Systems

with Applications, 186, 115611.

[16] Badave, P., Bhomaj, B., Bindu, B., Shivarkar, R., & Dhavase, N.

(2022). Ecommerce Website with Recommendation System

Including Chatbot and Reverse Image Search. International Journal

for Research in Applied Science & Engineering Technology

(IJRASET), 10(9), 46904–46908.

[17] Hossain, M., Habib, M., Hassan, M., Soroni, F., & Khan, M. M.

(2022). Research and Development of an E-commerce with Sales

Chatbot. In 2022 IEEE World AI IoT Congress (AIIoT) (pp. 483–

488). IEEE.

[18] Gong, Y., & Cosma, G. (2023). BOON: A Neural Search Engine

for Cross-Modal Information Retrieval. Proceedings of the 1st

International Workshop on Deep Multimodal Information Retrieval

(MMIR '23), 1–5.

[19] Esfandiari, N., Kiani, K., & Rastgoo, R. (2023, June 3). A

Conditional Generative Chatbot using Transformer Model. arXiv.org.

https://arxiv.org/abs/2306.02074.

[20] Shirkande, S. T., Patil, S. S., Sawant, S. S., & Ghule, S. B.

(2024). Development of an E-Commerce Sales Chatbot. ICEST-

2K24: International Conference on Engineering, Science and

Technology. International Journal of Scientific Research in Science,

Engineering and Technology, Print ISSN: 2395-1990, Online ISSN:

2394-4099.

[21] Koroteev, M. V. (2021). BERT: a review of applications in

natural language processing and understanding. arXiv preprint

arXiv:2103.11943.

[22] El Bouhissi, H., Ziane, A., Rahmani, L., Medbal, M., & Kostiuk,

M. (2023). RF-PSO: An Optimized Approach for Diabetes

Prediction. ICST 2023, 227–238.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Predicting Fire Forest in Algeria : A new Approach

Houda EL BOUHISSI, Naima ILLOUL

University of Bejaia, Faculty of Exact Sciences, LIMED Laboratory, 06000, Bejaia, Algeria

Abstract— Forest fires have emerged as a major concern,

drawing international attention—especially in Algeria. They

are increasingly recognized by the global community as one of

the most critical security challenges of our time. This study

examines the current protective measures in place to combat

these fires and evaluates their effectiveness in preserving

the country’s environment from devastating damage.

Numerous forecasting methods exist, with those leveraging

artificial intelligence—particularly machine learning and

related technologies—being the most widely used. These

AI- driven techniques have led to the development of

adaptive and reliable systems across various domains, especially

in predictive modeling. In this work, we apply such methods

to forecast forest fires. The aim of this paper is to introduce

a novel hybrid approach that combines machine learning

with bioinspired algorithms to enhance forest fire prediction.

Experimental results demonstrate that integrating bioinspired

algorithms significantly improves the performance of

machine learning models.

Keywords— machine learning, Kaggle, datasets, forest fire

prediction, logistic regression.

INTRODUCTION

Forest Fires are among the world’s most dangerous natural

disasters in the world. They cause catastrophic losses to

forest ecosystems and pose a serious threat to human safety

and property.

Forest fires can cause devastating damage to ecosystems,

animals, and human habitats. They destroy vast areas of

forest, resulting in the loss of biodiversity and natural

habitats. Many animals are killed or displaced, and

endangered species are under even greater threat. Algeria is

one of the countries

mostly affected by these disasters each year.

The considerable risk associated with these events has

led to significant concern among stakeholders, who are

questioning the effectiveness of protection measures against

these powerful fires and their ability to safeguard the

country’s environment.

The disruption of local economies, damage to homes and

infrastructure, and the occurrence of forest fires are all also

consequences of this. Communities often suffer emotionally

and financially from the loss of property and livelihoods.

Recovery from such disasters takes years. It also requires

significant resources. Preventive measures and awareness are

crucial to reducing the frequency and severity of forest fires.

Despite the efforts made by the protection services to avoid

them, this problem remains a major risk for the country’s

environment and the safety of its population. The damage

and danger left behind by these fires worry officials and

associations in the country who are trying to find

immediate solutions to put an end to this disaster by

providing all the necessary equipment. The experience of

all these years proves that despite the immediate intervention

of the protection services, it still generates a significant rate of

damage, the country remains imperiled by these forest fires.

For this reason, building a forest fire prediction system

seems like a very good solution to prevent risks and

reduce the damage.

The aim of this paper is to propose a hybrid approach

to predict forest fires in Algeria using Artificial

Intelligence techniques. The development of a prediction

system to classify the possibility of forest fires into two

categories (fire and non-fire) is essential for the study.

This hybrid system will represent a different technique

and approach proposed and its results. Our objective is to

implement an efficient prediction system based on

supervised machine learning to predict forest fires using

machine learning and bioinspired algorithms.

The rest of the paper is organized as follows. Section

2 offers a comprehensive overview of the fundamental and

most significant approaches associated with Fire Forest.

In section 3, we present in detail our classification

approach. Then in section 4, we present an empirical

study of the proposed approach to assess its performance

and efficiency. Finally, Section 5 concludes the paper and

establishes the opportunity for future work.

II.

RELATED WORKS

In the context of data analysis, certain

methodologies

employ

machine

learning

algorithms to

make predictions. Conversely, alternative approaches

utilize artificial neural networks, employing deep

learning methodologies to enhance the precision of

predictions.

In the following, we will present the main works.

The authors in [1] present a predictive model based on

the decision tree for forest fire prediction in Algeria.

The

data

used

is collected

from

two

regions

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

of North Algeria: Sidi Bel Abbe`s and Bejaia. The

meteorological data with three attributes that influence fire

occurrences are used, namely temperature, relative humidity,

and wind speed. Results show that the decision tree is

suitable for this purpose, since it gives significant

performances, and it can be translated to rule based.

Another approach proposed by [2].The authors combine

historical fire occurrence data from the Fire Information for

Resource Management System. The meteorological and

topographic variables are then derived and processed for

the creation of high- resolution maps. These maps serve

as an effective decision-support tool for analyzing fire

behavior.

The proposal of [3] consists of creating a system which

integrates

weather

data.

exploratory analysis was

conducted, followed by preprocessing aimed at eliminating

noisy data and converting categorical variables into

numerical ones, thereby enhancing the clarity of the dataset.

The regression techniques employed for prediction purposes

include Random Forest, Decision Trees, Support Vector

Regression, and Naive Bayes.

The authors in [4] proposed a novel approach using logistic

regression, to predict forest fire risk in the Lijiang region.

This approach makes it possible to assess the influence

of various factors on the study subject, such as

topography (altitude, slope, orientation), vegetation and

weather conditions (precipitation, temperature, wind,

humidity).

In addition, [5] propose a new method, namely parallel

SVM, for reliable performance of forest fire prediction. The

data used consists of weather data from the Indian

region. This type of solution can help very well with the

detection of the fires before they destroy the whole forest

and simplify the prediction of these forest fires.

An interesting approach proposed by [6] which consists

of a fire prediction system. This system utilizes satellite

images

obtained. By integrating artificial intelligence and

supervising the learning of neural networks with satellite

remote sensing technology, Agni optimizes the use of

satellite images for forecasting high-risk fire areas. The

model has demonstrated consistent performance through

extensive evaluations.

Several papers are covered that explain in detail forest

fire prediction methods that can help produce interesting

results.

The reviewed works present a variety of approaches to

forest fire prediction, ranging from simple decision trees to

advanced AI-based systems. Some methods focus on

using basic meteorological data, offering easy

interpretation but limited accuracy due to the

exclusion of other important factors like vegetation or

topography.

Others incorporate a wider range of variables and produce

high-resolution risk maps, which are valuable for planning

but may lack real-time responsiveness.

Different studies emphasize data preprocessing and model

comparison, yet they often overlook a clear analysis of

performance differences.

While logistic regression allows for understanding factor

influence, it may not capture complex patterns effectively.

More recent approaches integrate satellite imagery and AI,

showing promising results but requiring significant

computational resources.

Overall, the works complement each other, but shared

challenges remain, such as ensuring model generalization,

balancing

complexity

with

usability, and achieving timely

predictions.

The method used in this study is a combination of LR

for prediction and Particle swarm intelligence (PSO) for

feature selection optimization.

we aim to implement an efficient and useful prediction

system based on machine learning. Du to the importance of the

data feature, we use logistic regression algorithm.

Next, we present our approach in detail.

III.

PROPOSED APPROACH

The methodology proposed in this study aims to predict

forest fire in Algeria using a hybrid approach which is a

combination of LR and PSO.

The system has been developed for the purpose of

classifying forest regions as either ”fire” or ”non-fire” risk,

with this classification being determined by meteorological

conditions.

The system architecture is presented in figure 1 and

involves four steps.

The first step involves data collection.

The second step concerns data processing and includes

many phases (cleaning, …etc.).

The third step is about feature selection using PSO.

And finally, the last step concerns the predication process.

Following, we will describe these steps in detail.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Fig. 1. System architecture.



Data Collection :

The process begins with gathering relevant data. In this

research, meteorological and environmental data were

collected from two forested regions in northern Algeria:

Bejaia and Sidi-Bel-Abbes [1].

These regions were selected due to their historical

vulnerability to forest fires, providing valuable and relevant

data for the study.



Data Preprocessing

An initial analysis was performed to understand the structure

and characteristics of the dataset. This included examining

statistical distributions, identifying missing values, and

detecting potential outliers.

A key component of this step is the creation of a heat

map showing the correlation coefficients between variables,

which helped to reveal strong relationships between certain

features and the target variable (fire occurrence).

Before training the model, the data followed several

preprocessing phases :



Data Cleaning: Managing missing values, correcting

inconsistencies, and filtering out noisy or irrelevant

data.



Data Transformation: Converting categorical

variables into numerical formats (e.g., using one-hot

encoding), normalizing or standardizing continuous

features, and performing dimensionality reduction if

needed.



Splitting the

Dataset The processed dataset was then divided into

two subsets: Training Set: Used to train the model by

allowing it to learn patterns from historical data. Test

Set: Used to evaluate the model's performance on

unseen data, ensuring its ability to generalize well.



Feature Selection:

Particle Swarm Optimization (PSO) is a bio-inspired

optimization algorithm used to find optimal solutions in

complex search spaces [7]. When we apply PSO to the forest

fire dataset [1], PSO can be used to select the most relevant

features (e.g., temperature, humidity, wind, rain) that contribute

significantly to predicting fire occurrences or burned areas

(figure 2).

Fig. 2. How PSO works [8].

Each particle in the swarm represents a candidate feature

subset, encoded as a binary vector indicating which features are

included. The fitness of each particle is evaluated and trained on

the selected features. The particle updates its position based on

its best-found solution and the best-known global solution in the

swarm. This cooperative behaviour allows PSO to explore the

feature space effectively and avoid local minima.

PSO, by iterating over generations, converges toward an

optimal feature subset that maximizes predictive performance

while minimizing feature count. In forest fire prediction, this

results in simpler, faster, and more interpretable models. It helps

in identifying environmental variables most critical for early fire

detection or damage estimation.

Redundant or irrelevant features (e.g., noise variables or

highly correlated inputs) are naturally excluded during the

process. PSO thus contributes to better generalization, lower

computation cost, and enhanced decision-making in forest fire

management. This makes it a valuable tool in environmental

data analysis and risk forecasting systems.



Prediction:

LR is applied to the training data to build the predictive

model. After training, the model was tested on the test set, and

performance metrics such as accuracy, precision, recall, or F1-

score were calculated to assess its effectiveness.

Our approach is based on logistic regression, a supervised

classification algorithm that has shown an effective efficiency

for binary outcome prediction.

IV.

EXPERIMENT AND EVALUATION

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

To validate the effectiveness of our predictive model, we

perform a series of experiments on a meteorological dataset

gathered from the regions of Bejaia and Sidi-Bel-Abbes in

Algeria [1]. This dataset is a structured collection of

meteorological and fire-related data compiled to facilitate the

prediction of forest fires in Algeria. It encompasses daily

observations from June to September 2012, focusing on two

regions: Bejaia in the northeast and Sidi Bel-Abbes in the

northwest. Below is a detailed description of the dataset:

Total Instances: 244



Bejaia: 122 instances



Sidi Bel-Abbes: 122 instances



Time Frame: June to September 2012

Class Distribution:



Fire Occurrences: 138 instances



No Fire Occurrences: 106 instances

The dataset comprises 12 attributes (table 1), including

meteorological variables, Fire Weather Index (FWI)

components, and a target class label:

TABLE I

DATASET ATTRIBUTES [1]

Learn, Pandas, and Matplotlib. After pre-processing the

data, it was split into training instructions and test sets, main

feature are selected then we apply the LR model for prediction.

We perform different experimentations using LR alone

and using LR with PSO. Based on the available information,

here is a comparative analysis of two approaches for forest

fire prediction.

Both models utilize the Algerian Forest Fires Dataset [1],

which

includes

meteorological

observations

and Fire

Weather Index components from the Bejaia and Sidi Bel-

Abbes regions.

TABLE 2

EXPERIMENT RESULTS [1]

Metric

LR (%)

LR + PSO (%)

Accuracy

85.00

87.00

Precision

83.00

85.00

Recall

86.00

89.00

F1-Score

84.00

87.00

The results of applying PSO to LR for forest fire

prediction show clear improvements across all evaluation

metrics :



Accuracy increased from 85% to 87%, indicating that

the optimized model makes fewer overall classification

errors.



Precision increased from 83% to 85%, meaning

the model is better at minimizing false positives—

crucial for avoiding unnecessary fire alerts.



Recall improved from 86% to 89%, showing that the

model detects more actual fire events, reducing

the risk of missing dangerous situations.



The F1-score also increased from 84% to 87%,

demonstrating a better balance between detecting

fires and maintaining prediction reliability.

These gains confirm that PSO effectively selects the most

relevant environmental features, such as temperature,

wind,

and

humidity,

while

discarding noisy or redundant data.

As a result, the LR becomes more focused, interpretable,

and robust. The reduced feature set also lowers computational

costs, enabling

faster.

The experiment was performed within a Python

environment, with the utilization of libraries such as Scikit-

CONCLUSION

In this paper, we propose a model for forecasting forest

fires in Algeria. To this end, we have used the logistic

regression algorithm as the underlying framework. The

present study focuses on the

regions

N°

Attribute

Description

Date

Observation date in

DD/MM/YYYY format.

Temperature

Temperature at noon in

degrees Celsius

Relative Humidity (RH)

Percentage of humidity

Wind Speed (Ws)

Wind speed in km/h.

Rain

Total daily rainfall in mm

Fine Fuel Moisture Code

(FFMC)

Represents the moisture

content of litter and fine

fuels

Duff Moisture Code (DMC)

Indicates the moisture

content of loosely

compacted organic layers

Drought Code (DC)

Reflects the moisture

content of deep, compact

organic layers

Initial Spread Index (ISI)

Combines wind and FFMC

to estimate the rate of fire

spread

Buildup Index (BUI)

Combines DMC and DC to

represent the total amount

of fuel available

Fire Weather Index (FWI)

Indicates the potential fire

intensity

Classes

Binary classification

indicating fire occurrence:

'fire' or 'not fire'

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

of Bejaia and Sidi-Bel Abbes.

The methodology employed is based on the exploitation

of meteorological data accessible via the Kaggle platform.

The gathered information was incorporated into a

comprehensive plan that included data collection, analysis,

model construction, and forecasting. Logistic regression was

chosen as the optimal approach for several reasons.

This approach is characterized by its ease of implementation.

Additionally, it is noteworthy for the clarity and precision of

his interpretation. Eventually, it distinguishes himself through

his remarkable precision

in the classification process.

Experimental results have demonstrated the proposed

approach’s capacity to accurately differentiate between fire-

prone and fire- prone

conditions

based

meteorological

characteristics. This predictive capability is crucial for the

rapid deployment of preventive measures and fire- fighting

resources.

Additionally, the incorporation of meteorological data

into existing early warning systems has proven to be

highly relevant, enabling the effective management of risks

associated with climatic phenomena.

This system has proven to be extremely effective, even

in the absence of substantial resources. Moreover, it is

particularly advantageous because of its ability to integrate

with pre-existing early warning systems, even in contexts

where resources are limited.

In future, we considered other hybrid approaches based

on deep learning and bioinspired algorithms to improve

accuracy.

REFERENCES

[1] F. Abid and N. Izeboudjen, “Predicting forest fire in Algeria using

data mining techniques: Case study of the decision tree algorithm,” in

Proc. Int. Conf. Adv. Intell. Syst. Sustain. Dev.,

pp. 363–370, Springer, 2019.

[2] I. Elkhrachy et al., “Sentinel-1 remote sensing data and hydrologic

engineering centres river analysis system two- dimensional integration

for flash flood detection and modelling in New Cairo City, Egypt,”

J. Flood Risk Manag., vol. 14, no. 2, p. e12692, 2021.

[3] T. Preeti, S. Kanakaraddi, A. Beelagi, S. Malagi, and A. Sudi, “Forest

fire prediction using machine learning techniques,” in Proc. Int.

Conf. Intell. Technol. (CONIT), pp. 1–6, IEEE, 2021.

[4] L. Si et al., “Study on forest fire danger prediction in plateau mountain-

ous forest area,” Nat. Hazards Res., vol. 2, no. 1, pp. 25–32, 2022.

[5] K. R. Singh, K. P. Neethu, K. Madhurekaa, A. Harita, and P. Mohan,

“Parallel SVM model for forest fire prediction,” Soft Comput. Lett.,

vol. 3, p. 100014, 2021.

[6] B. Zheng et al., “Increasing Forest fire emissions despite the decline

in global burned area,” Sci. Adv., vol. 7, no. 39, p. eabh2646,

2021.

[7] H. El Bouhissi, A. Ziane, L. Rahmani, M. Medbal, & M. Kostiuk,

(2023). RF-PSO: An Optimized Approach for Diabetes Prediction. In ICST

(pp. 227-238).

[8] Q. Nizamani, A. A. Hashmani, Z. H. Leghari, Z. A. Memon, H.

M. Munir, T.Novak & M. Jasinski (2024). Nature-inspired

swarm intelligence algorithms for optimal distributed

generation allocation: A comprehensive review for minimizing

power losses in distribution networks. Alexandria Engineering Journal,

105, 692-723.

[9] R. Bekka, S. Kherbouche,

and H. El Bouhissi. Distraction detection

predict

vehicle

crashes:

deep

learning approach.

Computación y Sistemas, 26(1), 373-387, 2022.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11–12, 2025, Oran, Algeria.

Deep Learning-Based Classiﬁcation of Knee

Osteoarthritis Using Gaussian Noise Augmentation

and Knowledge Distillation

1st Khadidja Messaoudene

LIMOSE Laboratory

University M’Hamed Bougara of Boumerdes

Boumerdes, Algeria

k.messaoudene@univ-boumerdes.dz

2nd Khaled Harrar

LIST Laboratory

University M’Hamed Bougara of Boumerdes

Boumerdes, Algeria

khaled.harrar@univ-boumerdes.dz

Abstract—Knee osteoarthritis (KOA) is a degenerative joint

disease characterized by cartilage deterioration, leading to pain,

stiffness, and impaired joint function. Accurate detection and

grading are crucial for early intervention, but challenges such as

limited data, redundant features, and suboptimal classiﬁcation

performance hinder reliable diagnostic tools. This study pro-

poses an effective pipeline for classifying KOA into Kellgren-

Lawrence (KL) grades 0 (no OA) and 2 (moderate OA) using

data augmentation, deep feature extraction, and reﬁned feature

selection. The method was tested on 688 knee radiographs

from the Osteoarthritis Initiative (OAI), with regions of interest

(ROIs) extracted and augmented using Gaussian noise. Deep

features were obtained via DenseNet-201, followed by knowledge

distillation for feature selection, and classiﬁcation was performed

using a ﬁne Gaussian Support Vector Machine (GSVM) with

5-fold cross-validation. The pipeline achieved 94.5% accuracy

and a 96% AUC, whereas omitting feature selection reduced

accuracy to 82%, and excluding augmentation lowered it to

88%, underscoring their importance. The integration of Gaussian

noise augmentation, DenseNet-201, and knowledge distillation

signiﬁcantly enhanced classiﬁcation performance, demonstrating

strong potential for improving automated diagnostic systems and

supporting early KOA detection and clinical decision-making.

Index Terms—knee OsteoArthritis, X-ray images, Knowledge

distillation, DenseNet-201, GSVM.

I. INTRODUCTION

Knee Osteoarthritis (KOA) is a prevalent degenerative joint

disease characterized by the gradual deterioration of articular

cartilage, leading to pain, stiffness, and reduced mobility

,It is the most common form of arthritis, affecting millions

of individuals worldwide, particularly those over the age of

50 [1]. Epidemiological studies indicate that the incidence

of KOA is increasing, inﬂuenced by factors such as aging

populations, obesity, and joint injuries [2].

Radiographic imaging plays a crucial role in the diagnosis

and evaluation of KOA, This imaging modality is essential

for assessing structural changes in the knee joint, such as

cartilage loss, bone marrow lesions, osteophyte formation,

and joint space narrowing. The severity of KOA is typi-

cally classiﬁed using grading systems such as the Kellgren-

Lawrence (KL) scale [3], which categorizes the disease into

stages ranging from 0 (no radiographic features of OA) to

4 (severe OA with extensive joint damage). Accurate staging

is vital for determining appropriate treatment strategies and

monitoring disease progression. However, manual assessment

of OA severity through imaging is often subjective and time

consuming, leading to variability in diagnosis and staging.

To address these challenges, there is a growing necessity for

the development and implementation of automatic detection

and classiﬁcation methods through advanced image processing

techniques. Deep learning [4] has emerged as a transformative

approach in medical imaging, offering enhanced accuracy and

efﬁciency in disease detection and classiﬁcation. By leveraging

large datasets and advanced neural networks, deep learning

models can identify subtle patterns and features in medical

images that are often imperceptible to the human eye. This

capability is particularly beneﬁcial for KOA diagnosis, where

early detection and precise classiﬁcation are paramount for

effective treatment and improved patient outcomes.

Several recent studies have explored automated classiﬁca-

tion approaches for knee osteoarthritis (OA) diagnosis using

various feature extraction techniques and classiﬁers. Janvier

et al. [5] employed fractal analysis coupled with logistic

regression, achieving an accuracy of 73% . In 2018, Riad

et al. [6] applied the Dual-Tree Complex Wavelet Transform

(DTCWT) with an SVM-RBF classiﬁer, reporting a higher

accuracy of 80.38%. Brahim et al. [7] utilized Power Spec-

tral Density (PSD) features alongside logistic regression and

obtained an accuracy of 78.92%. More recently, Ribas et

al. [8] implemented a convolutional neural network (CNN)-

based approach, achieving an accuracy of 81.69% on the

same dataset. These methods demonstrate the evolution from

traditional texture-based techniques to deep learning models,

with CNNs showing promising improvements in classiﬁcation

performance.

II. MATERIALS AND METHODS

A. Dataset

The dataset used in our experiment was obtained from

the publicly accessible OAI [9]. The dataset comprises 688

radiographs of the knee, speciﬁcally focusing on the medial

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11–12, 2025, Oran, Algeria.

ROI of the tibia. These radiographs have been categorized

using the Kellgren and Lawrence rating system (KL0, KL2).

We compared grade (no OA) to disease overall grade KL2

(mild OA).Figure below shows the ROI used in our work.

Fig. 1. ROI used

B. Methods

This section outlines the workﬂow of our framework for

KOA classiﬁcation, as illustrated in Figure2 . The pipeline be-

gins with Gaussian noise-based augmentation to mitigate class

imbalance in the dataset of 688 knee radiographs (evenly split

between KL grades 0 and 2) from the OAI dataset. Following

preprocessing and ROI extraction to focus on anatomically

relevant regions, the method employs DenseNet-201 for deep

feature extraction, leveraging its dense connectivity pattern

for comprehensive feature representation. These features are

then reﬁned via knowledge distillation (KD) to prioritize the

most discriminative characteristics while reducing redundancy.

Finally, the optimized feature set is classiﬁed using a ﬁne

Gaussian Support Vector Machine (GSVM), selected for its

effectiveness with high-dimensional medical data.

1) Data augmentation: A major challenge in KOA detec-

tion is the limited availability of annotated medical imaging

data, which can restrict the performance of diagnostic models

. To address this limitation while ensuring the preservation of

critical pathological features, we employ Gaussian noise-based

data augmentation. This technique enhances dataset diversity

by introducing controlled variations that mimic real-world

imaging noise characteristics while maintaining the structural

integrity of radiographic ﬁndings [10].

The application of GNDA involves the addition of Gaussian

noise to the original dataset. This noise is mathematically

modelled as a Gaussian distribution with a mean (µ) of

zero and a given variance (σ2), essential for maintaining the

integrity of the original data’s distribution while introducing

variability.

The GNDA can be represented by the following equation :

X′=X+ϵ(1)

where:

•Xrepresents the original data sample.

•ϵdenotes the Gaussian noise, a random variable drawn

from N(0, σ2).

•X′is the resultant augmented data point.

The Gaussian distribution, deﬁned by the probability density

function (PDF), is given as:

f(x|µ, σ2) = 1

√2πσ2exp −(x−µ)2

2σ2(2)

2) Features extration: DenseNet201 [11]is a convolutional

neural network known for its dense connectivity, where each

layer receives inputs from all previous layers, enhancing

feature reuse and gradient ﬂow. It has 201 layers organized

into dense blocks and transition layers, enabling efﬁcient and

rich feature extraction from images. Typically, input images

are resized to 224×224 pixels, and the model outputs high-

dimensional feature maps , which can be used for various

classiﬁcation tasks. DenseNet201’s architecture reduces the

vanishing gradient problem and improves learning efﬁciency,

making it effective for extracting detailed and discriminative

features in applications like medical image analysis and object

recognition

3) Features seletion: Feature-based KD involves trans-

ferring internal representations from a teacher model to a

student model [12], allowing the student to learn the intricate

structures and relationships embedded in the teacher’s feature

maps (Figure 3). This method offers a more comprehensive

knowledge transfer compared to merely replicating output

probabilities.

Fig. 3. The generic teacher-student framework for KD

The loss function in feature-based KD is designed to align

the intermediate feature representations of the teacher and

student networks, typically formulated as:

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11–12, 2025, Oran, Algeria.

Fig. 2. The proposed methods

Lfeature =1

i=1 ∥Ft

T−Ft

S∥2

2(3)

Where:

•Lfeature represents the feature-based KD loss.

•Ndenotes the number of feature layers or map points

included in the distillation process.

•Ft

Tand Ft

Scorrespond to the feature maps of the teacher

and student networks, respectively.

•∥·∥2

2indicates the squared Euclidean (L2) norm, mea-

suring the difference between the teacher’s and student’s

feature maps.

The total KD loss combines the feature-based distillation

loss with the task loss (typically cross-entropy loss), and is

expressed as:

LKD =αLCE + (1 −α)Lfeature (4)

Here, αis a hyperparameter that determines the weighting

between the feature-based distillation loss and the cross-

entropy task loss.

4) Classiﬁcation: After the feature selection process, where

the most informative DenseNet features were identiﬁed

through knowledge distillation (KD), a Gaussian SVM clas-

siﬁer was employed for the binary classiﬁcation of KOA

into KL grades 1 and 2. The Gaussian SVM , utilizing

a radial basis function (RBF) kernel, was chosen for its

ability to model complex nonlinear decision boundaries in

the high-dimensional feature space derived from DenseNet.

This approach enabled the classiﬁer to effectively discriminate

between subtle structural and textural variations in knee joint

regions, as encoded by the distilled feature representations

III. RESULTS AND DISUSION

The bar chart in ﬁgure 4presents a performance comparison

of different data augmentation strategies used in conjunction

with a Densenet201-based model, measured by classiﬁcation

accuracy. Three conﬁgurations are shown: without GNDA

(Gaussian Noise Data Augmentation), without KD (Knowl-

edge Distillation Data Augmentation), and the full model

combining GNDA + Densenet201 + KD. When GNDA is

excluded, the accuracy drops to 82.6%, indicating that Gaus-

sian noise augmentation signiﬁcantly contributes to improving

model generalization and robustness. When Knowledge Dis-

tillation is removed, accuracy is 88%, suggesting that it also

provides valuable performance gains, though slightly less than

GNDA in this context. The best performance is achieved by

the full combination GNDA + Densenet201 + KD with an

accuracy of 94.5%, demonstrating that the integration of both

augmentation techniques leads to the most effective learning

and model performance.

The ﬁgure5 displays the Receiver Operating Characteristic

(ROC) curve, a common tool for evaluating the performance

of a binary classiﬁer. The true positive rate (sensitivity) is

plotted against the false positive rate at various threshold

settings. The blue curve represents the ROC, and the light

blue shaded area under it corresponds to the Area Under

the Curve (AUC), which in this case is 0.99. An AUC of

0.99 indicates excellent classiﬁer performance, suggesting that

the model has a very high ability to distinguish between the

two classes. Additionally, a highlighted point on the curve

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11–12, 2025, Oran, Algeria.

without GNDA without KD GNDA+Densnet201+KD

100

82.688 94.5

METHODS

Accuracy (%)

Fig. 4. Performance comparison of different methods.

at coordinates (0.07, 0.96) represents the current operating

point of the classiﬁer meaning it achieves a true positive

rate of 96% with only a 7% false positive rate. This balance

between sensitivity and speciﬁcity demonstrates that the model

is highly effective and reliable for the classiﬁcation task.

Fig. 5. The ROC curve

Table I summarizes the performance of various classiﬁcation

methods on a dataset of 688 samples. Traditional approaches

like fractal analysis with logistic regression [5] achieved

limited accuracy (AUC 0.73). Improvements were seen with

DTCWT + SVM-RBF [6] and PSD + logistic regression

[7], reaching accuracies of 0.8038 and 0.7892, respectively.

Ribas et al. [8] later introduced a CNN model with a slight

performance gain (accuracy 0.8169). The proposed method

combining GNDA, DenseNet201, knowledge distillation, and

Gaussian SVM outperforms all previous approaches with a

0.99 accuracy, demonstrating the power of deep learning and

optimized classiﬁcation.

Authors Year Methods Classiﬁer Data Acc

Janvier et

al. [5]

2017 Fractal analysis LR 688 0.73

Riad et al.

[6]

2018 DTCWT SVM-RBF 688 0.8038

Brahim et

al. [7]

2019 PSD LR 688 0.7892

Ribas et

al. [8]

2023 CNN - 688 0.8169

Proposed 2025 GNDA+DensNet201 GSVM 688 0.945

TABLE I

COMPARISON OF METHODS WITH THE PROPOSED APPROACH.

IV. CONCLUSION

This study presented an effective KOA classiﬁcation

pipeline combining data augmentation, DenseNet-201 feature

extraction, and knowledge distillation-based feature selection,

achieving 94.5% accuracy in distinguishing KL grades 0 and

2. The results highlight the importance of feature selection

and augmentation, as their exclusion signiﬁcantly reduced

performance. The method shows promise for improving auto-

mated KOA diagnosis, supporting early detection and clinical

decision-making. Future work could expand to multi-class

grading and larger datasets for broader validation.

REFERENCES

[1] Centers for Disease Control and Prevention (CDC),

“Osteoarthritis (OA),” 2020. [Online]. Available:

https://www.cdc.gov/arthritis/basics/osteoarthritis.htm

[2] T. Neogi, “The epidemiology and impact of pain in osteoarthritis,” Ost.

Cart, vol. 21, no. 9, pp. 1145–1153, 2013.

[3] H. Kellgren and J. S. Lawrence, “Radiological assessment of osteoarthri-

tis,” Ann. Rheum. Dis., vol. 16, pp. 494–502, 1957.

[4] H. Greenspan, B. van Ginneken, and R. M. Summers, “Deep learning

in medical imaging: Overview and future promise of an exciting new

technique,” IEEE Trans. Med. Imaging, vol. 35, no. 5, pp. 1153–1159,

2016.

[5] T. Janvier, R. Jennane, A. Valery, K. Harrar, M. Delplanque, C. Lelong,

D. Loeuille, H. Toumi, and E. Lespessailles, “Subchondral tibial bone

texture analysis predicts knee osteoarthritis progression: Data from the

osteoarthritis initiative,” *Osteoarthritis Cartilage*, vol. 25, no. 2, pp.

259–266, 2017.

[6] R. Riad, R. Jennane, A. Brahim, T. Janvier, H. Toumi, and E. Lespes-

sailles, “Texture analysis using complex wavelet decomposition for knee

osteoarthritis detection: Data from the osteoarthritis initiative,” Comput.

Electr. Eng., vol. 68, pp. 181–191, 2018.

[7] A. Brahim, R. Riad, and R. Jennane, “Knee osteoarthritis detection using

power spectral density: Data from the osteoarthritis initiative,” in Com-

puter Analysis of Images and Patterns: 18th International Conference,

CAIP 2019, Salerno, Italy, September 3–5, 2019, Proceedings, Part II,

vol. 18, pp. 480–487, Springer International Publishing, 2019.

[8] L. Ribas, T. Riad, R. Jennane, and O. Brun, “A complex network based

approach for knee osteoarthritis detection: Data from the osteoarthritis

initiative,” Biomed. Signal Process. Control, vol. 71, p. 103133, 2022.

[9] G. Lester, “The osteoarthritis initiative: A NIH public–private partner-

ship,” HSS J., vol. 8, no. 1, pp. 62–63, 2012.

[10] H. X. Dou, X. S. Lu, C. Wang, H. Z. Shen, Y. W. Zhuo, and L. J.

Deng, “PatchMask: A data augmentation strategy with Gaussian noise

in hyperspectral images,” Remote Sens., vol. 14, no. 24, p. 6308, 2022.

[11] J. Zhou, X. Gu, H. Gong, X. Yang, Q. Sun, L. Guo, and Y. Pan,

“Intelligent classiﬁcation of maize straw types from UAV remote sensing

images using DenseNet201 deep transfer learning algorithm,” Ecol.

Indic., vol. 166, p. 112331, 2024.

[12] M. Huang, Y. You, Z. Chen, Y. Qian, and K. Yu, “Knowledge distillation

for sequence model,” in Proc. Interspeech, pp. 3703–3707, Sep. 2018.

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

Bio-Driven Facial Mark Detection: Robust

Celebrity Identification

Souad Khellat-Kihel#1

# Department of computer science, University of Science and Technology

Mohamed-Boudiaf, Oran

souad.khellat@univ-usto.dz

Abstract— In the realm of soft biometrics, facial marks such as

moles, scars, and freckles offer a unique layer of information for

identity recognition, especially in the context of celebrity

identification. Traditional deep learning systems often treat facial

features holistically, without specifically leveraging the salience of

such individualized traits. Inspired by the human visual system,

particularly its use of foveated vision and saccadic movements,

this work proposes a novel, biologically-motivated framework that

simulates selective attention toward facial marks while preserving

contextual facial structure.

Keywords— Facial marks, foveated vision, celebrities,

identification.

LXI. INTRODUCTION

Face recognition has witnessed substantial advances through

deep learning, especially in applications such as celebrity

identification. Yet, these models often operate as "black boxes,"

offering little insight into how specific facial features—like

scars or moles—contribute to identity recognition. In contrast,

the human visual system employs a biologically efficient

mechanism involving foveated vision and saccadic eye

movements to focus on salient facial regions. These

mechanisms inspire the present work, which proposes a system

that prioritizes facial marks—permanent, unique skin

features—as central cues in biometric identification.

Facial marks are recognized in biometric literature as "soft

biometrics." Unlike rigid structural features, they provide

discriminative value even when traditional face structure is

altered due to age, surgery, or disguise. This paper explores

their predictive capacity through a biologically-inspired

computational model.

Our method integrates the HMAX neural architecture,

mimicking the V1 visual cortex, with a log-polar image

transformation to emulate the retino-cortical mapping found in

primate vision. This allows high-resolution sampling at the

center (facial marks) and lower-resolution encoding in the

periphery (outer face), creating a context-aware, space-variant

representation of facial data.

LXII. RELATED WORKS

The study of facial marks has led to hierarchical facial

analysis systems, beginning with global facial characterization

and narrowing down to individual features using techniques

like LDA and SIFT [1,2]. Facial marks have shown particular

utility in distinguishing between identical twins [3,4].

Automated facial mark detection has advanced significantly,

with systems employing models such as the Active Appearance

Model and LoG detectors [5], as well as the Fast Radial

Symmetry Transform [6]. While many studies focus on

detection and classification, fewer explore the direct impact of

facial marks on recognition accuracy. Contributions like those

by Becerra-Riera et al. [7] and others [8,9] demonstrate

methods such as LBP, HoG, Fisher Vectors, and skin-mark

matching for enhanced face representation.

Facial marks—considered “soft biometrics”—are especially

useful in scenarios involving cosmetic surgery or occlusion,

offering robustness in variable conditions like lighting or aging

[10]. Deep learning approaches have further improved

performance using facial marks for disguise and age variation

resilience [11]. Research also highlights their importance in

masked face recognition, using features like ears and marks for

accurate identification [12].

Building on this foundation, the current work proposes a

biologically-inspired model using the HMAX architecture and

log-polar image transformation to simulate human visual

attention. This space-variant sampling enhances central facial

details while preserving peripheral context, improving

recognition performance in dynamic and complex conditions.

LXIII. METHODOLOGY

III.1 System Architecture

The core of the proposed system is the integration of two

biologically-informed components:

 HMAX (Hierarchical Model and X): A

computational model mimicking the ventral stream of

the visual cortex, particularly the V1 area, with S1 and

C1 layers that apply Gabor filters and max-pooling to

extract scale- and orientation-invariant features.

 Log-Polar Image Mapping: A transformation

technique inspired by the retino-cortical projection in

primates, which allows the central visual field (fovea)

to be sampled in high resolution while the peripheral

field is sampled in lower resolution.

This dual-component system captures fine details (facial

marks) while embedding them in a meaningful facial context.

It also drastically reduces the amount of image data processed,

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

enhancing efficiency without compromising discriminative

power. Fig. 1. depicts the general architecture of the proposed

method.

Fig. 7 General proposed system for face recognition based on facial marks.

2.2 Foveated Vision Simulation

Just like humans direct their gaze toward informative regions

on a face, the model simulates this behavior using space-variant

sampling. High acuity is reserved for facial marks, and low-

resolution sampling captures surrounding facial context. This

technique improves recognition in situations where

conventional systems fail due to occlusion, changes in lighting,

or non-frontal poses.

LXIV. EXPERIMENTAL RESULTS

Two datasets were used:

 CFM Dataset (Celebrity Facial Marks): Comprising

164 images from 30 celebrities, annotated with facial

marks like moles and scars.

 FRGCv2 Subset: Containing over 12,000 images from

568 subjects, this dataset provides a broad testbed for

face recognition techniques.

Multiple facial mark configurations were tested:

 Individual facial marks (first and second ranked by

prominence),

 Fusion of features from multiple marks,

 Inclusion of peripheral (non-marked) facial regions.

Feature extraction utilized the S1C1 structure from HMAX,

producing 256-dimensional feature vectors. Classification was

performed using a SoftMax neural network trained via cross-

entropy loss.

The system's effectiveness was validated through various

comparative studies:

 Using a single facial mark, the system achieved 45%

recognition accuracy.

 Incorporating peripheral face regions increased

performance to 41.67%.

 Fusing facial mark and peripheral features raised

accuracy to 55%.

 When combined with deep learning systems (e.g.,

VGG-Face), the identification rate peaked at 73.47%,

outperforming standard HMAX (25%) and S1C1-only

models (31.67%).

The table below show a comparative study between our

obtained results and the existing works:

System

Identification rate

(%)

Marks matching (SM M) [8]

34.66

Marks matching (SM A) [8]

16.00

HMAX

VGG

53.33

S1C1

31.67

Ours one Mark

45.00

Ours peripheral

41.67

Ours S1C1 Mark

peripheral

51.67

Proposed approach

Table 1: Celebrities identification based on existed works,

foveated vision and deep neural architectures.

These results highlight that even with limited data and small

annotated sets, biologically-inspired systems can rival or

outperform traditional CNNs in challenging identification

scenarios. Extensive experiments were conducted using the

CFM (Celebrity Facial Marks) dataset and a subset of FRGCv2,

both enriched with annotated facial marks. The system shows

promising results, achieving up to 55% identification accuracy

using only a single facial mark combined with peripheral facial

data. When fusing these with outputs from deep neural

networks (e.g., VGG-Face), identification performance rises to

73.47%, underscoring the complementary strength of

biologically-inspired processing.

This work highlights the potential of combining cognitive

neuroscience principles with machine learning to build

adaptive and resilient face recognition systems. Such models

are particularly robust in scenarios involving disguise, aging, or

cosmetic changes—common challenges in real-world celebrity

identification tasks.

LXV. CONCLUSION

This study demonstrates that foveated vision and biologically-

inspired neural models provide a powerful alternative to

generic deep learning for specific tasks like celebrity

identification. By focusing on facial marks—traits often

ignored or diluted in end-to-end CNNs—the system introduces

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

a nuanced approach to face recognition. The hybrid use of high-

resolution focal areas and contextual peripheral information

mimics the selective attention seen in human perception,

offering improved robustness in complex visual environments.

As future directions, we suggest to enhance mark selection

mechanisms to identify "biometric-signature-quality" facial

marks. Also, integrating temporal dynamics for live video

recognition would target real applications.

REFERENCES

[120] Klare, B., & Jain, A.K. (2010). On a taxonomy of facial features. 2010

Fourth IEEE International Conference on Biometrics: Theory,

Applications and Systems (BTAS), 1-8.

[121] Lin, D., & Tang, X. (2006). Recognize High Resolution Faces: From

Macrocosm to Microcosm. 2006 IEEE Computer Society Conference on

Computer Vision and Pattern Recognition (CVPR'06), 2, 1355-1362.

[122] Srinivas, N., Aggarwal, G., Flynn, P.J., & Bruegge, R.W. (2011). Facial

marks as biometric signatures to distinguish between identical twins.

CVPR Workshop on Biometrics, 106-113.DOI:

10.1109/CVPRW.2011.5981818

[123] Srinivas, N., Aggarwal, G., Flynn, P.J., & Bruegge, R.W. (2012).

Analysis of Facial Marks to Distinguish Between Identical Twins. IEEE

Transactions on Information Forensics and Security, 7, 1536-1550.

[124] Park, U., & Jain, A.K. (2010). Face Matching and Retrieval Using Soft

Biometrics. IEEE Transactions on Information Forensics and Security,

5, 406-415.

[125] Srinivas, N., Flynn, P.J., & Bruegge, R.W. (2016). Human Identification

Using Automatic and Semi-Automatically Detected Facial Marks.

Journal of forensic sciences, 61 Suppl 1, S117-30 .

[126] Becerra-Riera, F., Morales-González, A., & Vazquez, H.M. (2017).

Facial marks for improving face recognition. Pattern Recognition

Letters, 113, 3-9.

[127] Becerra-Riera, F., & Morales-González, A. (2016). Detection and

matching of facial marks in face images. Revista Cubana de Ciencias

Informticas 10, 172–181

[128] Zhang, Z., Tulyakov, S., & Govindaraju, V. (2009). Combining Facial

Skin Mark and Eigenfaces for Face Recognition. ICB.

[129] Djabi, I., Ouahabi, A., Benzaoui, A., & Taleb-Ahmed, A. (2020). Past,

present, and future of face recognition: A review. Journal of Visual

Communication and Image Representation, 71, 102818.

https://doi.org/10.1016/j.jvcir.2020.102818

[130] Hernandez-Ortega, J., et al. (2022). Deep facial marks recognition. IEEE

Transactions on Information Forensics and Security, 17, 1234-1245.

https://doi.org/10.1109/TIFS.2022.1234567

[131] Carragher, D.J., Towler, A., Mileva, V.R. et al. Masked face

identification is improved by diagnostic feature training. Cogn. Research

7, 30 (2022). https://doi.org/10.1186/s41235-022-00381-x

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

LLMs and Cybersecurity: applications,

efficiency and challenges

Nassiba Wafa ABDERRAHIM

National Higher School of Telecommunications and ICT

LaRATIC Lab, Oran, Algeria

wafa.abderrahim@ensttic.dz

Abstract— Large Language Models (LLMs) have recently become

indispensable tools across various domains, including

Cybersecurity. Their ability to understand natural language,

reason through complex problems, and generate code makes them

especially useful for security tasks like threat detection,

vulnerability assessment, and process automation. This paper

explores the practical applications of LLMs in real-world

Cybersecurity scenarios. We discuss their impact on productivity,

analytical quality, and overall effectiveness based on several well-

known LLMs, and we also address the challenges and ethical

considerations involved in integrating them into this field.

Keywords—LLMs, Cybersecurity, Security tasks,

productivity, ethical considerations.

LXVI. INTRODUCTION

With the rise of generative AI and Large Language

Models (LLMs) such as GPT, Claude, and Gemini,

new opportunities have emerged to support users in

daily tasks ranging from code generation and text

processing to complex problem-solving and

automation. Insights from experts and practitioners

suggest that integrating LLMs into workflows can

potentially triple productivity across various sectors,

making them indispensable tools for optimizing

efficiency and allocating resources more effectively

[1][2].

Applying AI tools and LLMs in the field of

Cybersecurity has become indispensable, where the

AI market in Cybersecurity is projected to grow at a

compound annual growth rate of 21.9% between

2023 and 2028 [3], reflecting the sector’s dynamic

and rapidly evolving nature.

Cybersecurity demands speed, precision, and

adaptability to effectively address increasingly

complex and fast-changing threats and cyberattacks.

Ideed, LLMs have great potential for a wide range of

Cybersecurity tasks, such as threat intelligence,

vulnerability detection, secure code generation and

others [4][5][6]. Each of these tasks presents its

challenges and opportunities. In threat intelligence,

for instance, LLMs are being utilized to extract and

organize information from massive volumes of

documents, a task that has traditionally been labor-

intensive and time-consuming. Similarly, in

anomaly detection, LLMs are being employed to

identify security anomalies such as malicious traffic

in network flows, malware files in systems, and

anomalies in logs. These applications have opened

new avenues for enhancing Cybersecurity [7]. On

one hand, open-sourced LLMs support the

development of Cybersecurity-enhanced domain to

address unique Cybersecurity challenges. On the

other hand, advanced LLMs solve complex tasks via

prompt engineering, in context learning, and chains

of thought despite the lack of Cybersecurity specific

training [8].

However, despite extensive research on the

application of LLMs in Cybersecurity, there remains

a lack of comprehensive overviews that reflect real-

world practices from the perspectives of

professionals and practitioners. This paper presents

the main use cases of LLMs in the daily work of

Cybersecurity practitioners, including script

generation, code analysis, and documentation. Each

section is supported by concrete examples and

provides a critical discussion of the benefits,

limitations, and future perspectives for the effective

adoption of LLM-based Cybersecurity solutions.

The remainder of this paper is organized as

follows. Section 2 provides an overview of the most

widely used LLMs. Section 3 discusses the key

applications of LLMs in Cybersecurity, with a

particular emphasis on practical use cases. Section 4

addresses the challenges and ethical considerations

associated with the exploitation of LLMs in

Cybersecurity. Finally, Section 5 presents the

conclusion.

LXVII. OVERVIEW OF LLMS

IDEAS: National Conference of Innovation on Data Engineering and AI Science.

June 11-12, 2025, Oran, Algeria.

The evolution of LLMs have undergone a

remarkable transformation progressing from initial

statistical language models (SLMs) to neural

language models (NLMs), then to pre-trained

language models (PLMs), and finally to the current

state of large language models. These models are

called large because they are trained on large

amounts of text data, such as books, articles, and

websites. They are capable of interacting with users

in a conversational manner and generating new text

that closely resembles human writing or speech [9].

Several key factors have driven this evolutionary

trajectory, including increased data diversity,

computational advancements, and algorithmic

innovations [10]. This evolution has not only

impacted general applications but has also opened up

new possibilities in specialized fields such as

Cybersecurity, by revolutionizing approaches to

threat detection, analysis, and response [9].

Table 1 summarizes the most widely used LLMs,

categorized into two main types: open-source and

closed-source models. Open-source LLMs offer

accessible model weights, allowing researchers to

adapt them for specific Cybersecurity needs, such as

handling private data or building

custom tools. However, they may have limitations in

terms of performance and scalability. In contrast,

closed-source models often deliver superior

accuracy and efficiency but offer limited

transparency, raising concerns about potential biases

and constraints.

TABLE XIII

SUMMARY OF LLMS USED IN CYBERSECURITY

Organization

LLM

Size

Open

Source

OpenAI

GPT-4

~1.76T parameters

Anthropic

Claude 3

Undisclosed

Google

DeepMind

Gemini

Undisclosed

PROCEEDING IDEAS'2025 PDF Free Download

PROCEEDING IDEAS'2025 PDF free Download. Think more deeply and widely.

Uploaded by _mark_miller_ on 3/20/2026

/152

100%